paper_id string | paper_url string | pdf_url string | status string | processed_at string | elapsed_seconds float64 | num_pages int64 | num_pages_processed int64 | max_pages_per_paper int64 | pdf_exceeds_page_limit bool | failed_pages list | prompt_type string | model_id string | script_version timestamp[s] | markdown string | html string | pages_with_images int64 | image_file_count int64 | paper_output_prefix string | page_stats list |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2604.08626 | https://arxiv.org/abs/2604.08626 | https://arxiv.org/pdf/2604.08626.pdf | success | 2026-04-16T08:38:11.720357+00:00 | 494.03 | 33 | 33 | 200 | false | [] | ocr_layout | datalab-to/chandra-ocr-2 | 2026-04-15T00:00:00 | "# WildDet3D\n\n## Scaling Promptable 3D Detection in the Wild\n\nWeikai Huang<sup>♥1,2</sup> Jiey(...TRUNCATED) | "<h1> WildDet3D</h1>\n<h2>Scaling Promptable 3D Detection in the Wild</h2>\n\n<p>Weikai Huang<sup>(...TRUNCATED) | 9 | 13 | 2604.08626 | [{"page_number":1,"error":false,"token_count":1703,"num_chunks":17,"image_count":1,"image_files":["i(...TRUNCATED) |
arXiv OCR with Chandra OCR 2
This output bundle stores OCR results for arXiv PDFs using datalab-to/chandra-ocr-2.
Summary
- Output dataset:
nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416 - Output bucket:
hf://buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416 - Source paper IDs in input list: 1
- Processed IDs recorded in
state/processed_ids.txt: 1 - Successes: 1
- Partial successes: 0
- Errors: 0
- Next shard index: 1
- Updated at: 2026-04-16T08:38:11.738458+00:00
Files
data/part-*.jsonl.gz: OCR result shards, one JSON object per paperstate/processed_ids.txt: completed paper IDs used for resumestate/summary.json: aggregate counters and bookkeeping<paper_id>/<paper_id>.md,.html,_metadata.json: optional per-paper outputs when--write-paper-filesis enabled<paper_id>/images/*: extracted image assets when--include-imagesis enabled
Each paper record includes:
num_pages: total number of pages in the source PDFnum_pages_processed: number of pages actually sent to OCRpdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR capmax_pages_per_paper: configured OCR page cap for the runpages_with_images: number of OCR pages that produced extracted imagesimage_file_count: total number of extracted image files for the paperpaper_output_prefix: root folder for the optional per-paper files
Load the results
from datasets import load_dataset
dataset = load_dataset("nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["paper_id"])
print(dataset[0]["markdown"][:1000])
Job config
- Prompt type:
ocr_layout - Page batch size: 16
- Max output tokens: 12384
- Max model length: 18000
- GPU memory utilization: 0.85
- Minimum arXiv request interval: 3.1 seconds
- Max pages per paper sent to OCR: 200
- Bucket backend: hf-cli
- Paginate output: False
- Include headers/footers: False
- Include images: True
- Write paper files: True
- Image URL base:
https://huggingface.co/buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416/resolve
Reproduction
hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
-s HF_TOKEN --timeout 2d \
./chandra2-arxiv-ocr.py --output-dataset nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416 \
--output-bucket nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-20260416 \
--paper-ids-url https://.../hf_missing_paper_ids.txt
- Downloads last month
- 46