MedTextBERT
A Hebrew medical document classifier fine-tuned on AlephBERT.
Classifies extracted text into 24 document categories covering a wide range of medical specialties.
Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.
Performance
| Metric | Score |
|---|---|
| Accuracy | 93.8% |
| F1 | 93.75% |
Evaluated on a held-out test set after 20 epochs of fine-tuning.
Categories
family_medicine cardiology cardiology_procedures imagingdiabetes_endocrinology pathology pediatrics orthopedicsneurology psychiatry urology surgery gastroenterologyhematology pulmonology dermatology infections_inflammationgynecology oncology pharmacy emergency_medicinegeriatrics_rehabilitation administration_general lab_results
Training Data
Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, covering edge cases and category variations to improve generalization across real-world formats.
Usage
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="annaadar/MedTextBERT",
tokenizer="annaadar/MedTextBERT"
)
result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)
Limitations
- Trained on synthetic data — performance on real-world clinical documents may vary
- Designed for Hebrew text only
- Not validated for clinical or diagnostic use
Intended Use
Research and portfolio purposes only.
Not intended for clinical or commercial use.
License: CC BY-NC 4.0
- Downloads last month
- 47
Model tree for annaadar/MedTextBERT
Base model
onlplab/alephbert-base