Atom2.7m
Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.
Model Details
- Architecture: decoder-only GPT
- Parameters: 2,738,880
- Layers: 5
- Hidden size: 192
- Attention heads: 4
- KV heads: 2
- Context length: 512
- Vocabulary size: 4,096
- Token embeddings: tied input/output embeddings
- Arithmetic feature embeddings:
place_vocab_size: 66role_vocab_size: 12
Tokenizer
This model should not be evaluated or used with a plain Hugging Face tokenizer path alone. It uses a custom fusion tokenizer implemented in tokenizer_utils.py.
The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:
- digits
0-9are atomic and never BPE-merged - digit spans are emitted least-significant-digit first
+ - * / = ( )are isolated atomic tokens- whitespace is isolated from text
place_idsare assigned to every digit runrole_idsare assigned only for strict integer equation spans
The model expects aligned input_ids, place_ids, and role_ids.
Usage
from pathlib import Path
import torch
from transformers import AutoModelForCausalLM
from tokenizer_utils import load_tokenizer
model_dir = Path(".")
model = AutoModelForCausalLM.from_pretrained(
model_dir,
trust_remote_code=True,
).eval()
tokenizer = load_tokenizer(model_dir)
text = "12 + 34 ="
encoding = tokenizer.encode(text)
input_ids = torch.tensor([encoding.input_ids])
place_ids = torch.tensor([encoding.place_ids])
role_ids = torch.tensor([encoding.role_ids])
with torch.no_grad():
outputs = model(
input_ids=input_ids,
place_ids=place_ids,
role_ids=role_ids,
)
For correct results, do not rely on pipeline("text-generation") unless it is wrapped to provide place_ids and role_ids.
Evaluation
ArithMark 2.0
Use the included fusion-aware benchmark script:
python benchmark_fusion_arithmark.py \
--checkpoint . \
--tokenizer-dir . \
--data-path arithmark_2.0.jsonl \
--batch-size 64 \
--device cuda \
--output benchmark_results/fusion_arithmark_2.0_results.json
lm-evaluation-harness
Use the included launcher so the atom2.7m model wrapper is registered:
python lm_eval_fusion run \
--model atom2.7m \
--model_args pretrained=.,tokenizer_dir=. \
--tasks hellaswag,arc_easy,arc_challenge,piqa \
--device cuda:0 \
--batch_size auto \
--output_path benchmark_results/lm_eval
The wrapper uses tokenizer_utils.load_tokenizer() and forwards place_ids and role_ids to the model.
Results
| Benchmark | Metric | Value |
|---|---|---|
| ArithMark 2.0 | acc | 0.6380 |
| arc_challenge | acc_norm | 0.2261 |
| arc_easy | acc_norm | 0.3270 |
| hellaswag | acc_norm | 0.2733 |
| piqa | acc_norm | 0.5305 |
Training Data
The pretraining mixture targeted about 3.5B tokens:
- Ultra-FineWeb: 900M
- FineWeb-Edu: 900M
- FineMath: 450M
- Cosmopedia-v2: 337.5M
- UltraData-Math-L2-preview: 337.5M
- Ultra-FineWeb-L3-en-QA-Synthetic: 225M
- Synthetic-Arithmetic: 350M
Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The training curriculum is included as pretraining_curriculum.json.
Limitations
- This is a very small model and should be treated as an experimental research artifact.
- The custom tokenizer makes plain
AutoTokenizeror defaultlm_eval --model hfunsuitable for final reported numbers. - Numeric text is represented least-significant-digit first internally.
- Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.
Files
model.safetensors: model weightsconfig.json,config.py,configuration_gpt.py,model.py: custom model codetokenizer.json,tokenizer_utils.py: tokenizer files and fusion wrapperbenchmark_fusion_arithmark.py: ArithMark evaluationlm_eval_fusion.py,lm_eval_fusion: lm-eval custom model wrapperpretraining_curriculum.json: training curriculum
- Downloads last month
- -
