Dataset Viewer
Auto-converted to Parquet Duplicate
paper_id
string
paper_url
string
pdf_url
string
status
string
processed_at
string
elapsed_seconds
float64
num_pages
int64
num_pages_processed
int64
max_pages_per_paper
int64
pdf_exceeds_page_limit
bool
failed_pages
list
prompt_type
string
model_id
string
script_version
string
markdown
string
html
string
pages_with_images
int64
image_file_count
int64
paper_output_prefix
string
page_stats
list
2604.08626
https://arxiv.org/abs/2604.08626
https://arxiv.org/pdf/2604.08626.pdf
success
2026-04-16T09:15:30.896239+00:00
494.38
33
33
200
false
[]
ocr_layout
datalab-to/chandra-ocr-2
2026-04-16.1
"# WildDet3D\n\n## Scaling Promptable 3D Detection in the Wild\n\nWeikai Huang<sup>♥1,2</sup> Jiey(...TRUNCATED)
"<h1> WildDet3D</h1>\n<h2>Scaling Promptable 3D Detection in the Wild</h2>\n\n<p>Weikai Huang<sup>(...TRUNCATED)
9
13
2604.08626
[{"page_number":1,"error":false,"token_count":1703,"num_chunks":17,"image_count":1,"image_files":["i(...TRUNCATED)

arXiv OCR with Chandra OCR 2

This output bundle stores OCR results for arXiv PDFs using datalab-to/chandra-ocr-2.

Summary

  • Output dataset: nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416
  • Output bucket: hf://buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416
  • Source paper IDs in input list: 1
  • Processed IDs recorded in state/processed_ids.txt: 1
  • Successes: 1
  • Partial successes: 0
  • Errors: 0
  • Next shard index: 1
  • Updated at: 2026-04-16T09:15:30.914138+00:00

Files

  • data/part-*.jsonl.gz: OCR result shards, one JSON object per paper
  • state/processed_ids.txt: completed paper IDs used for resume
  • state/summary.json: aggregate counters and bookkeeping
  • <paper_id>/<paper_id>.md, .html, _metadata.json: optional per-paper outputs when --write-paper-files is enabled
  • <paper_id>/images/*: extracted image assets when --include-images is enabled

Each paper record includes:

  • num_pages: total number of pages in the source PDF
  • num_pages_processed: number of pages actually sent to OCR
  • pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
  • max_pages_per_paper: configured OCR page cap for the run
  • pages_with_images: number of OCR pages that produced extracted images
  • image_file_count: total number of extracted image files for the paper
  • paper_output_prefix: root folder for the optional per-paper files

Load the results

from datasets import load_dataset

dataset = load_dataset("nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["paper_id"])
print(dataset[0]["markdown"][:1000])

Job config

  • Prompt type: ocr_layout
  • Page batch size: 16
  • Max output tokens: 12384
  • Max model length: 18000
  • GPU memory utilization: 0.85
  • Minimum arXiv request interval: 3.1 seconds
  • Max pages per paper sent to OCR: 200
  • Bucket backend: hf-cli
  • Paginate output: False
  • Include headers/footers: False
  • Include images: True
  • Write paper files: True
  • Image URL base: https://huggingface.co/buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416/resolve

Reproduction

hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
  -s HF_TOKEN --timeout 2d \
  ./chandra2-arxiv-ocr.py --output-dataset nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416 \
  --output-bucket nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-08626-spacing-fix-v2-20260416 \
  --paper-ids-url https://.../hf_missing_paper_ids.txt
Downloads last month
50