Dataset Viewer
Auto-converted to Parquet Duplicate
The dataset viewer is not available for this split.
Parquet error: Scan size limit exceeded: attempted to read 2227869151 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files
Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

Regularizer 250K (from Reasoning SFT 3M)

250,000 deduplicated reasoning samples extracted from AmanPriyanshu/reasoning-sft-3M-random-compilation. Designed as a regularization set during tool-use and agentic mid-training — preserves general reasoning, code, structured output, and instruction-following capabilities.

Construction

  1. MD5-hashed all 3M input fields to identify duplicates (471,296 duplicate rows, 15.7%)
  2. Kept exactly one copy of each duplicated input (284,866 unique)
  3. Dropped domains unsuitable for mid-training regularization:
    • safety — reserved for final alignment
    • SYNTHETIC-2-SFT-Verified — overrepresented (same source as Dolci Think Python Algorithms)
    • N/A — unlabeled, unverifiable
    • Wildchat — raw user chats, weak reasoning signal
    • One-Shot-CFT-Data — tiny dataset, 79.5% duplicated in parent
    • Misc noise (nitpick, style, OpenAssistant, TableGPT, synthetic, java, python, c)
  4. Added back 36 samples from SYNTHETIC-2-SFT-Verified to hit exactly 250,000

Schema

Column Type Description
input list[{role: str, content: str}] Conversation history with roles: user, system, assistant
response str Model response following strict think template
domain str Unified domain label
source_dataset str HuggingFace source dataset identifier
dataset_license str License of the source dataset

Response Template

Every response follows exactly:

<think>
{reasoning}
</think>
{answer}

Domain Distribution (56 domains)

Category Samples % Key Domains
Code / SWE ~75K 30% Dolci Think Python Algorithms (31.6K), SWE Repair (19K), code (8.6K), suggestion (11K), bug/refactor/perf (4.6K)
Reasoning / Math / Science ~45K 18% math (8.9K), science (9.7K), stem-reasoning (4.2K), analytical_reasoning (6.3K), fermi (6.3K), brain_teaser (6.3K)
Structured Output / IF ~43K 17% instruction_following (19.1K), structured_outputs (5K), text_classification/extraction/modification (19K)
Agentic / Tool-use ~46K 18% tool_use (6.6K), webagent_flow (6.3K), rag (6.3K), fs_cot_flow (6.3K), struct2text_flow (6.3K), follow_up (6.5K)
General / Conversational ~41K 17% chat (6.7K), creative_content (6.3K), rc (6.3K), mcq (6.2K), open_domain_qa (1.3K)

Sources

Source License
AmanPriyanshu/reasoning-sft-CHIMERA apache-2.0
AmanPriyanshu/reasoning-sft-IF_multi_constraints_upto5 odc-by
AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K cc-by-4.0
AmanPriyanshu/reasoning-sft-Nemotron-Instruction-Following-Chat-v1 cc-by-4.0
AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K apache-2.0
AmanPriyanshu/reasoning-sft-Superior-Reasoning-SFT-gpt-oss-120b-434K cc-by-4.0
AmanPriyanshu/reasoning-sft-dolci-think-sft-32b-1M odc-by
AmanPriyanshu/reasoning-sft-github-codereview mit
AmanPriyanshu/reasoning-sft-interstellarninja-json-mode-reasoning-160K apache-2.0
AmanPriyanshu/reasoning-sft-minimax-microsoft-orca-agentinstruct-1M-v1 cdla-permissive-2.0
AmanPriyanshu/reasoning-sft-minimax-stratified-kmeans-diverse-reasoning-842K-only cc-by-4.0
AmanPriyanshu/reasoning-sft-poor-quality-reasoning-sample-mix apache-2.0
AmanPriyanshu/reasoning-sft-stem-reasoning-complex-FineProofs-126K apache-2.0

Files

Single parquet file, 250,000 rows, ~2.1GB. Rows sorted by domain then input hash for deterministic reproducibility.

Downloads last month
42

Collection including AmanPriyanshu/regularizer-250K-from-reasoning-sft-3M-random-compilation