Datasets:

AmanPriyanshu
/

regularizer-250K-from-reasoning-sft-3M-random-compilation

Parquet error: Scan size limit exceeded: attempted to read 2227869151 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files

Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

Regularizer 250K (from Reasoning SFT 3M)

250,000 deduplicated reasoning samples extracted from AmanPriyanshu/reasoning-sft-3M-random-compilation. Designed as a regularization set during tool-use and agentic mid-training — preserves general reasoning, code, structured output, and instruction-following capabilities.

Construction

MD5-hashed all 3M input fields to identify duplicates (471,296 duplicate rows, 15.7%)
Kept exactly one copy of each duplicated input (284,866 unique)
Dropped domains unsuitable for mid-training regularization:
- safety — reserved for final alignment
- SYNTHETIC-2-SFT-Verified — overrepresented (same source as Dolci Think Python Algorithms)
- N/A — unlabeled, unverifiable
- Wildchat — raw user chats, weak reasoning signal
- One-Shot-CFT-Data — tiny dataset, 79.5% duplicated in parent
- Misc noise (nitpick, style, OpenAssistant, TableGPT, synthetic, java, python, c)
Added back 36 samples from SYNTHETIC-2-SFT-Verified to hit exactly 250,000

Schema

Column	Type	Description
`input`	`list[{role: str, content: str}]`	Conversation history with roles: `user`, `system`, `assistant`
`response`	`str`	Model response following strict think template
`domain`	`str`	Unified domain label
`source_dataset`	`str`	HuggingFace source dataset identifier
`dataset_license`	`str`	License of the source dataset

Response Template

Every response follows exactly:

<think>
{reasoning}
</think>
{answer}

Domain Distribution (56 domains)

Category	Samples	%	Key Domains
Code / SWE	~75K	30%	Dolci Think Python Algorithms (31.6K), SWE Repair (19K), code (8.6K), suggestion (11K), bug/refactor/perf (4.6K)
Reasoning / Math / Science	~45K	18%	math (8.9K), science (9.7K), stem-reasoning (4.2K), analytical_reasoning (6.3K), fermi (6.3K), brain_teaser (6.3K)
Structured Output / IF	~43K	17%	instruction_following (19.1K), structured_outputs (5K), text_classification/extraction/modification (19K)
Agentic / Tool-use	~46K	18%	tool_use (6.6K), webagent_flow (6.3K), rag (6.3K), fs_cot_flow (6.3K), struct2text_flow (6.3K), follow_up (6.5K)
General / Conversational	~41K	17%	chat (6.7K), creative_content (6.3K), rc (6.3K), mcq (6.2K), open_domain_qa (1.3K)

Sources

Source	License
AmanPriyanshu/reasoning-sft-CHIMERA	apache-2.0
AmanPriyanshu/reasoning-sft-IF_multi_constraints_upto5	odc-by
AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K	cc-by-4.0
AmanPriyanshu/reasoning-sft-Nemotron-Instruction-Following-Chat-v1	cc-by-4.0
AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K	apache-2.0
AmanPriyanshu/reasoning-sft-Superior-Reasoning-SFT-gpt-oss-120b-434K	cc-by-4.0
AmanPriyanshu/reasoning-sft-dolci-think-sft-32b-1M	odc-by
AmanPriyanshu/reasoning-sft-github-codereview	mit
AmanPriyanshu/reasoning-sft-interstellarninja-json-mode-reasoning-160K	apache-2.0
AmanPriyanshu/reasoning-sft-minimax-microsoft-orca-agentinstruct-1M-v1	cdla-permissive-2.0
AmanPriyanshu/reasoning-sft-minimax-stratified-kmeans-diverse-reasoning-842K-only	cc-by-4.0
AmanPriyanshu/reasoning-sft-poor-quality-reasoning-sample-mix	apache-2.0
AmanPriyanshu/reasoning-sft-stem-reasoning-complex-FineProofs-126K	apache-2.0

Files

Single parquet file, 250,000 rows, ~2.1GB. Rows sorted by domain then input hash for deterministic reproducibility.

Downloads last month: 42

Collection including AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation

Pooled Sets

Collection

8 items • Updated about 12 hours ago