Pooled Sets
Collection
8 items • Updated
Error code: TooBigContentError
Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
250,000 deduplicated reasoning samples extracted from AmanPriyanshu/reasoning-sft-3M-random-compilation. Designed as a regularization set during tool-use and agentic mid-training — preserves general reasoning, code, structured output, and instruction-following capabilities.
input fields to identify duplicates (471,296 duplicate rows, 15.7%)safety — reserved for final alignmentSYNTHETIC-2-SFT-Verified — overrepresented (same source as Dolci Think Python Algorithms)N/A — unlabeled, unverifiableWildchat — raw user chats, weak reasoning signalOne-Shot-CFT-Data — tiny dataset, 79.5% duplicated in parentnitpick, style, OpenAssistant, TableGPT, synthetic, java, python, c)SYNTHETIC-2-SFT-Verified to hit exactly 250,000| Column | Type | Description |
|---|---|---|
input |
list[{role: str, content: str}] |
Conversation history with roles: user, system, assistant |
response |
str |
Model response following strict think template |
domain |
str |
Unified domain label |
source_dataset |
str |
HuggingFace source dataset identifier |
dataset_license |
str |
License of the source dataset |
Every response follows exactly:
<think>
{reasoning}
</think>
{answer}
| Category | Samples | % | Key Domains |
|---|---|---|---|
| Code / SWE | ~75K | 30% | Dolci Think Python Algorithms (31.6K), SWE Repair (19K), code (8.6K), suggestion (11K), bug/refactor/perf (4.6K) |
| Reasoning / Math / Science | ~45K | 18% | math (8.9K), science (9.7K), stem-reasoning (4.2K), analytical_reasoning (6.3K), fermi (6.3K), brain_teaser (6.3K) |
| Structured Output / IF | ~43K | 17% | instruction_following (19.1K), structured_outputs (5K), text_classification/extraction/modification (19K) |
| Agentic / Tool-use | ~46K | 18% | tool_use (6.6K), webagent_flow (6.3K), rag (6.3K), fs_cot_flow (6.3K), struct2text_flow (6.3K), follow_up (6.5K) |
| General / Conversational | ~41K | 17% | chat (6.7K), creative_content (6.3K), rc (6.3K), mcq (6.2K), open_domain_qa (1.3K) |
| Source | License |
|---|---|
| AmanPriyanshu/reasoning-sft-CHIMERA | apache-2.0 |
| AmanPriyanshu/reasoning-sft-IF_multi_constraints_upto5 | odc-by |
| AmanPriyanshu/reasoning-sft-Nemotron-Cascade-SFT-SWE-210K | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-Nemotron-Instruction-Following-Chat-v1 | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-OpenThoughts3-1.2M-450K | apache-2.0 |
| AmanPriyanshu/reasoning-sft-Superior-Reasoning-SFT-gpt-oss-120b-434K | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-dolci-think-sft-32b-1M | odc-by |
| AmanPriyanshu/reasoning-sft-github-codereview | mit |
| AmanPriyanshu/reasoning-sft-interstellarninja-json-mode-reasoning-160K | apache-2.0 |
| AmanPriyanshu/reasoning-sft-minimax-microsoft-orca-agentinstruct-1M-v1 | cdla-permissive-2.0 |
| AmanPriyanshu/reasoning-sft-minimax-stratified-kmeans-diverse-reasoning-842K-only | cc-by-4.0 |
| AmanPriyanshu/reasoning-sft-poor-quality-reasoning-sample-mix | apache-2.0 |
| AmanPriyanshu/reasoning-sft-stem-reasoning-complex-FineProofs-126K | apache-2.0 |
Single parquet file, 250,000 rows, ~2.1GB. Rows sorted by domain then input hash for deterministic reproducibility.