roneneldan/TinyStories
Viewer • Updated • 2.14M • 91.3k • 979
How to use Fu01978/TinyLM with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="Fu01978/TinyLM") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Fu01978/TinyLM", dtype="auto")How to use Fu01978/TinyLM with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Fu01978/TinyLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Fu01978/TinyLM",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/Fu01978/TinyLM
How to use Fu01978/TinyLM with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Fu01978/TinyLM" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Fu01978/TinyLM",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Fu01978/TinyLM" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Fu01978/TinyLM",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use Fu01978/TinyLM with Docker Model Runner:
docker model run hf.co/Fu01978/TinyLM
A 3.4M parameter causal language model trained from scratch, for experimentation.
| Hyperparameter | Value |
|---|---|
| Parameters | 3.403.968 |
| Layers | 4 |
| Hidden size | 64 |
| Attention heads | 4 |
| FFN dim | 192 |
| Embedding rank | 32 |
| Context length | 256 |
| Tokenizer | GPT-2 (50257 vocab) |
Uses a factored (low-rank) embedding to keep the vocab projection from eating the entire parameter budget, with weight tying on the output head.
| Datasets | Skylion007/openwebtext (10k samples), roneneldan/TinyStories (10k samples) |
| Optimizer | AdamW (lr=3e-3, weight_decay=0.01) |
| Scheduler | Cosine annealing with warm restarts |
| Mixed precision | fp16 (torch.cuda.amp) |
| Hardware | Nvidia P100 |
from huggingface_hub import snapshot_download
import importlib.util
import torch
# Download files
snapshot_download(repo_id="Fu01978/TinyLM", local_dir="./tinylm")
# Load via script
spec = importlib.util.spec_from_file_location("modeling_tinylm", "./tinylm/modeling_tinylm.py")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
model, tokenizer, config = module.load_tinylm("./tinylm")
model.eval()
# Generate
output = module.generate(model, tokenizer, "Once upon a time, ")
print(output)