Atom2.7m

Atom2.7m is a small decoder-only causal language model trained with a general byte-level BPE tokenizer plus arithmetic-specific digit features. The model has 2,738,880 parameters and uses custom code for both the model and the tokenizer path.

Model Details

Architecture: decoder-only GPT
Parameters: 2,738,880
Layers: 5
Hidden size: 192
Attention heads: 4
KV heads: 2
Context length: 512
Vocabulary size: 4,096
Token embeddings: tied input/output embeddings
Arithmetic feature embeddings:
- place_vocab_size: 66
- role_vocab_size: 12

Tokenizer

This model should not be evaluated or used with a plain Hugging Face tokenizer path alone. It uses a custom fusion tokenizer implemented in tokenizer_utils.py.

The tokenizer keeps byte-level BPE for ordinary text, but treats arithmetic sensitive spans specially:

digits 0-9 are atomic and never BPE-merged
digit spans are emitted least-significant-digit first
+ - * / = ( ) are isolated atomic tokens
whitespace is isolated from text
place_ids are assigned to every digit run
role_ids are assigned only for strict integer equation spans

The model expects aligned input_ids, place_ids, and role_ids.

Usage

from pathlib import Path

import torch
from transformers import AutoModelForCausalLM

from tokenizer_utils import load_tokenizer

model_dir = Path(".")

model = AutoModelForCausalLM.from_pretrained(
    model_dir,
    trust_remote_code=True,
).eval()
tokenizer = load_tokenizer(model_dir)

text = "12 + 34 ="
encoding = tokenizer.encode(text)

input_ids = torch.tensor([encoding.input_ids])
place_ids = torch.tensor([encoding.place_ids])
role_ids = torch.tensor([encoding.role_ids])

with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        place_ids=place_ids,
        role_ids=role_ids,
    )

For correct results, do not rely on pipeline("text-generation") unless it is wrapped to provide place_ids and role_ids.

Evaluation

ArithMark 2.0

Use the included fusion-aware benchmark script:

python benchmark_fusion_arithmark.py \
  --checkpoint . \
  --tokenizer-dir . \
  --data-path arithmark_2.0.jsonl \
  --batch-size 64 \
  --device cuda \
  --output benchmark_results/fusion_arithmark_2.0_results.json

lm-evaluation-harness

Use the included launcher so the atom2.7m model wrapper is registered:

python lm_eval_fusion run \
  --model atom2.7m \
  --model_args pretrained=.,tokenizer_dir=. \
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
  --device cuda:0 \
  --batch_size auto \
  --output_path benchmark_results/lm_eval

The wrapper uses tokenizer_utils.load_tokenizer() and forwards place_ids and role_ids to the model.

Results

Benchmark	Metric	Value
ArithMark 2.0	acc	0.6380
arc_challenge	acc_norm	0.2261
arc_easy	acc_norm	0.3270
hellaswag	acc_norm	0.2733
piqa	acc_norm	0.5305

Training Data

The pretraining mixture targeted about 3.5B tokens:

Ultra-FineWeb: 900M
FineWeb-Edu: 900M
FineMath: 450M
Cosmopedia-v2: 337.5M
UltraData-Math-L2-preview: 337.5M
Ultra-FineWeb-L3-en-QA-Synthetic: 225M
Synthetic-Arithmetic: 350M

Synthetic-Arithmetic is AtomCalc-style canonical integer equation data. The training curriculum is included as pretraining_curriculum.json.

Limitations

This is a very small model and should be treated as an experimental research artifact.
The custom tokenizer makes plain AutoTokenizer or default lm_eval --model hf unsuitable for final reported numbers.
Numeric text is represented least-significant-digit first internally.
Role annotations intentionally target strict integer equations, not broad math prose, decimals, rationals, or QA formats.

Files

model.safetensors: model weights
config.json, config.py, configuration_gpt.py, model.py: custom model code
tokenizer.json, tokenizer_utils.py: tokenizer files and fusion wrapper
benchmark_fusion_arithmark.py: ArithMark evaluation
lm_eval_fusion.py, lm_eval_fusion: lm-eval custom model wrapper
pretraining_curriculum.json: training curriculum

Downloads last month: -

Safetensors

Model size

2.74M params

Tensor type

F32

BF16

UniversalComputingResearch
/

Atom2.7m