MedTextBERT

A Hebrew medical document classifier fine-tuned on AlephBERT.
Classifies extracted text into 24 document categories covering a wide range of medical specialties.

Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.

Performance

Metric Score
Accuracy 93.8%
F1 93.75%

Evaluated on a held-out test set after 20 epochs of fine-tuning.

Categories

family_medicine cardiology cardiology_procedures imaging
diabetes_endocrinology pathology pediatrics orthopedics
neurology psychiatry urology surgery gastroenterology
hematology pulmonology dermatology infections_inflammation
gynecology oncology pharmacy emergency_medicine
geriatrics_rehabilitation administration_general lab_results

Training Data

Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, covering edge cases and category variations to improve generalization across real-world formats.

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="annaadar/MedTextBERT",
    tokenizer="annaadar/MedTextBERT"
)

result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)

Limitations

  • Trained on synthetic data — performance on real-world clinical documents may vary
  • Designed for Hebrew text only
  • Not validated for clinical or diagnostic use

Intended Use

Research and portfolio purposes only.
Not intended for clinical or commercial use.
License: CC BY-NC 4.0

Downloads last month
47
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for annaadar/MedTextBERT

Finetuned
(9)
this model