MedTextBERT

A Hebrew medical document classifier fine-tuned on AlephBERT.
Classifies extracted text into 24 document categories covering a wide range of medical specialties.

Built as part of a privacy-first Android app that performs 100% offline OCR on Hebrew medical documents.

Performance

Metric	Score
Accuracy	93.8%
F1	93.75%

Evaluated on a held-out test set after 20 epochs of fine-tuning.

Training Data

Fine-tuned on a synthetically generated dataset of 4,500+ labeled Hebrew medical documents, covering edge cases and category variations to improve generalization across real-world formats.

Usage

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="annaadar/MedTextBERT",
    tokenizer="annaadar/MedTextBERT"
)

result = classifier("לאחר בדיקת דם שגרתית, נמצאו ערכים תקינים")
print(result)

Limitations

Trained on synthetic data — performance on real-world clinical documents may vary
Designed for Hebrew text only
Not validated for clinical or diagnostic use

Intended Use

Research and portfolio purposes only.
Not intended for clinical or commercial use.
License: CC BY-NC 4.0

Downloads last month: 47

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for annaadar/MedTextBERT

Base model

onlplab/alephbert-base

Finetuned

(9)

this model

annaadar
/

MedTextBERT