Kokoro German — HUI Multispeaker Base (Stage 1)

German multispeaker Stage 1 base model built on Kokoro-82M, trained on the HUI Audio Corpus German (CC0, 51 speakers).

This is a base model, not a voice. Use it as the foundation for Stage 2 fine-tuning with your own speaker data. Compared with a single-speaker Stage 1 base, a multispeaker base is less tied to one speaker identity, but it is not speaker-neutral in any strict sense.

Training repo: semidark/kokoro-deutsch · Discussion: Issue #9 · Dataset: dida-80b/hui-german-51speakers


License Note

This checkpoint is built on top of Kokoro-82M, which is released under apache-2.0.

The training dataset used here, hui-german-51speakers, is cc0-1.0.


Dataset

Corpus

Source HUI Audio Corpus German
License CC0
Raw speakers 122
After filtering 51 (24m / 27f)
Raw speaker duration strongly imbalanced
Total effective ~51h

Train / Val Split

Train samples 20,495
Val samples 418
Clip duration 1–20s
Sample rate 24 kHz
Format WAV (mono)

Quality Filter

Check Threshold
Min RMS −42 dB
Max clipping 0.1%
Max silence 50%
Min per speaker 5 min
Manual review ✓ all 51

Duration cap: 60 min/speaker prevents dominant speakers from biasing the model (Bernd_Ungerer alone contributed 81h raw).
Weighted sampling: Small speakers are duplicated so all 51 appear equally often in batches — speaker embeddings receive identical gradient update rates throughout training.


Training

Setup

Base model kokoro_base.pth
GPU NVIDIA A40 (48 GB)
Batch size 8
Epochs 10
Workers 7

Avg. Mel Loss per Epoch

Epoch Mel Loss (avg) Δ
1 0.5826
2 0.4131 −29%
3 0.3743 −9%
4 0.3577 −4%
5 0.3483 −3%
6 0.3414 −2%
7 0.3364 −1%
8 0.3321 −1%
9 0.3288 −1%
10 0.3264 −44% total

Final Losses (Ep 10)

Loss Value
Mel Loss 0.333
Disc Loss 3.983
Mono Loss 0.030
S2S Loss 0.523
SLM Loss 1.422

Mono Loss staying low throughout training indicates stable attention alignment — the most common Stage 1 failure mode did not occur.


Files

File Description
first_stage.pth Final Stage 1 checkpoint (Epoch 10) — use this for Stage 2 fine-tuning

Audio Samples — Epoch 10

7 phonetically diverse test sentences synthesized from the final checkpoint.
Note: this is a multispeaker base checkpoint, not a single-speaker voice model.

# Text Audio
1 Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch.
2 Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein.
3 Er aß die Maße in der Straße, aber das Maß war voll.
4 Zwei weiße Zwerge zwängen sich zwischen zwei Zweige.
5 Ein Pfau pflegt seine Federn an der Pfütze.
6 Warum hast du das getan? Das ist ja unglaublich!
7 Das kostet genau einhundertdreiundzwanzig Millionen Euro.
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including dida-80b/kokoro-german-hui-multispeaker-base