Datasets:

nielsr
/

arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417

paper_id string	paper_url string	pdf_url string	status string	processed_at string	elapsed_seconds float64	num_pages int64	num_pages_processed int64	max_pages_per_paper int64	pdf_exceeds_page_limit bool	failed_pages list	prompt_type string	model_id string	script_version string	markdown string	html string	pages_with_images int64	image_file_count int64	paper_output_prefix string	page_stats list
2604.07429	https://arxiv.org/abs/2604.07429	https://arxiv.org/pdf/2604.07429.pdf	success	2026-04-17T05:38:30.787232+00:00	449.41	52	52	200	false	[]	ocr_layout	datalab-to/chandra-ocr-2	2026-04-16.3	"# GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents\n\nMingyu Ouy(...TRUNCATED)	"<h1>GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents</h1>\n\n<p>(...TRUNCATED)	6	7	2604.07429	[{"page_number":1,"error":false,"token_count":998,"num_chunks":10,"image_count":1,"image_files":["im(...TRUNCATED)

arXiv OCR with Chandra OCR 2

This output bundle stores OCR results for arXiv PDFs using datalab-to/chandra-ocr-2.

Summary

Output dataset: nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417
Output bucket: hf://buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417
Source paper IDs in input list: 1
Processed IDs recorded in state/processed_ids.txt: 1
Successes: 1
Partial successes: 0
Errors: 0
Next shard index: 1
Updated at: 2026-04-17T05:38:30.829282+00:00

Files

data/part-*.jsonl.gz: OCR result shards, one JSON object per paper
state/processed_ids.txt: completed paper IDs used for resume
state/summary.json: aggregate counters and bookkeeping
<paper_id>/<paper_id>.md, .html, _metadata.json: optional per-paper outputs when --write-paper-files is enabled
<paper_id>/images/*: extracted image assets when --include-images is enabled

Each paper record includes:

num_pages: total number of pages in the source PDF
num_pages_processed: number of pages actually sent to OCR
pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
max_pages_per_paper: configured OCR page cap for the run
pages_with_images: number of OCR pages that produced extracted images
image_file_count: total number of extracted image files for the paper
paper_output_prefix: root folder for the optional per-paper files

Load the results

from datasets import load_dataset

dataset = load_dataset("nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["paper_id"])
print(dataset[0]["markdown"][:1000])

Job config

Prompt type: ocr_layout
Page batch size: 16
Max output tokens: 12384
Max model length: 18000
GPU memory utilization: 0.85
Minimum arXiv request interval: 3.1 seconds
Max pages per paper sent to OCR: 200
Bucket backend: hf-cli
Paginate output: False
Include headers/footers: False
Include images: True
Write paper files: True
Image URL base: https://huggingface.co/buckets/nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417/resolve

Reproduction

hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
  -s HF_TOKEN --timeout 2d \
  ./chandra2-arxiv-ocr.py --output-dataset nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417 \
  --output-bucket nielsr/arxiv-chandra-ocr-2-include-images-demo-2604-07429-retry-20260417 \
  --paper-ids-url https://.../hf_missing_paper_ids.txt

Downloads last month: 25

Number of rows:

Total file size:

1.19 MB