ForSureLLM

Ultra-rapid yes / no / unknown classifier for short user replies, distilled from Claude Sonnet 4.6 into a multilingual MiniLM-L12. 2 ms on CPU, 24-113 MB, no API call needed.

🎯 Try it live: HuggingFace Space demo
📦 Source code: github.com/jcfossati/ForSureLLM

What it does

Given a short French or English reply (1-30 words typically), returns whether the user is agreeing, refusing, or hesitating about a pending action. Designed as a consent-intent oracle for chatbots, IVR systems, CLI confirmations, and automation flows.

from forsurellm import classify

classify("carrément")            # ("yes", 0.97)
classify("laisse tomber")        # ("no", 0.98)
classify("je sais pas trop")     # ("unknown", 0.96)
classify("oui mais non")         # ("unknown", 0.92)
classify("yeah right")           # ("no", 0.87)   # sarcasm detected
classify("+1")                   # ("yes", 1.00)  # symbolic preprocessor
classify("👍")                    # ("yes", 1.00)

Numbers

Metric	Value
Adversarial accuracy (124 trap phrases, 22 categories)	95.2 %
Surface-variant robustness (1227 variants)	95.8 %
Test set accuracy (1178 phrases)	91.7 %
Calibration ECE	0.012
CPU latency p50	1.8 ms
ONNX int8 size	113 MB (multilingual) · 24 MB (FR+EN pruned variant)

Head-to-head on the 124-case adversarial bench:

Classifier	Accuracy	p50 latency	API cost
ForSureLLM	95.2 %	1.8 ms	0
Haiku 4.5 zero-shot	75.0 %	602 ms	$$
Cosine MiniLM-L12 (no fine-tune)	67.7 %	8 ms	0

ForSureLLM beats Haiku 4.5 zero-shot by +20.2 pts while running ~330× faster.

Strengths

Categories where ForSureLLM crushes a generalist LLM (Haiku 4.5):

modern_slang (Gen-Z): no cap, bet, say less, deadass — 100 % vs 43 %
negated_verb: I wouldn't say no, ce n'est pas un non — 83 % vs 17 %
sarcasm: oui bien sûr..., yeah right — 100 % vs 40 %
symbolic: +1, 100%, 👍, 10/10 — 100 % vs 40 % (deterministic preprocessor)
slang_abbrev: np, tkt, kk, nope — 100 % vs 50 %

Files in this repo

forsurellm-int8.onnx — full multilingual model, 113 MB (50+ languages supported via shared subwords, FR+EN tuned)
(Optional) forsurellm-int8_fr-en.onnx — vocab-pruned FR+EN variant, 24 MB. Same predictions as the full model on FR+EN inputs, 5× lighter on disk and in RAM (+85 MB process memory vs +418 MB), latency unchanged. Tokens outside FR+EN become <unk>.

How to use (without the package)

import onnxruntime as ort
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
import numpy as np

onnx_path = hf_hub_download("jcfossati/ForSureLLM", "forsurellm-int8.onnx")
session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
# tokenizer.json must be downloaded from the GitHub repo (space/tokenizer.json)
# or installed via the forsurellm package once published.

For the full preprocessing pipeline (case normalisation, symbolic shortcuts, sarcasm-aware threshold), use the forsurellm Python package — see the GitHub repo for installation.

Training procedure

Backbone: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (12 layers, 384 hidden)
Teacher: Claude Sonnet 4.6 (generation) + Claude Haiku 4.5 (labeling, with Sonnet fallback when confidence < 0.6)
Loss: KL-divergence on soft labels (3 classes)
Dataset: ~5,800 hand-curated + LLM-generated EN+FR phrases, balanced across 22 adversarial categories
Training: 8 epochs, batch 32, lr 2e-5, warmup 10%, weight decay 0.01 (~2 min on RTX Blackwell)
Calibration: temperature scaling (T = 0.680, fitted by LBFGS on val set NLL)
Export: ONNX dynamic quantization (avx512-vnni, int8)

Limitations

EN + FR only. The full model (113 MB) keeps the multilingual vocab and may produce reasonable cross-lingual outputs on related Latin-script languages (Spanish/Italian/German), but is not trained for them. The pruned variant (24 MB) drops non-FR/EN tokens entirely.
Short replies. Optimized for 1-30 word answers. Long passages will be truncated at 64 tokens.
Sarcasm detection has cultural priors. yeah right defaults to "no" because it's overwhelmingly sarcastic in modern English usage — a sincere user without punctuation might get the wrong call. Use threshold=0.85 for action-confirmation contexts to fall back to unknown on borderline cases.

License

Apache 2.0 — same as the base MiniLM model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for jcfossati/ForSureLLM

Base model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Quantized

(11)

this model

Space using jcfossati/ForSureLLM 1

Evaluation results

Adversarial accuracy (124 cases)
self-reported

0.952
Test accuracy (1178 cases)
self-reported

0.917