Kokoro German — HUI Multispeaker Base (Stage 1)

German multispeaker Stage 1 base model built on Kokoro-82M, trained on the HUI Audio Corpus German (CC0, 51 speakers).

This is a base model, not a voice. Use it as the foundation for Stage 2 fine-tuning with your own speaker data. Compared with a single-speaker Stage 1 base, a multispeaker base is less tied to one speaker identity, but it is not speaker-neutral in any strict sense.

Training repo: semidark/kokoro-deutsch · Discussion: Issue #9 · Dataset: dida-80b/hui-german-51speakers

License Note

This checkpoint is built on top of Kokoro-82M, which is released under apache-2.0.

The training dataset used here, hui-german-51speakers, is cc0-1.0.

Dataset

Corpus


Source	HUI Audio Corpus German
License	CC0
Raw speakers	122
After filtering	51 (24m / 27f)
Raw speaker duration	strongly imbalanced
Total effective	~51h

Train / Val Split


Train samples	20,495
Val samples	418
Clip duration	1–20s
Sample rate	24 kHz
Format	WAV (mono)

Quality Filter

Check	Threshold
Min RMS	−42 dB
Max clipping	0.1%
Max silence	50%
Min per speaker	5 min
Manual review	✓ all 51

Duration cap: 60 min/speaker prevents dominant speakers from biasing the model (Bernd_Ungerer alone contributed 81h raw).
Weighted sampling: Small speakers are duplicated so all 51 appear equally often in batches — speaker embeddings receive identical gradient update rates throughout training.

Training

Setup


Base model	kokoro_base.pth
GPU	NVIDIA A40 (48 GB)
Batch size	8
Epochs	10
Workers	7

Avg. Mel Loss per Epoch

Epoch	Mel Loss (avg)	Δ
1	0.5826	—
2	0.4131	−29%
3	0.3743	−9%
4	0.3577	−4%
5	0.3483	−3%
6	0.3414	−2%
7	0.3364	−1%
8	0.3321	−1%
9	0.3288	−1%
10	0.3264	−44% total

Final Losses (Ep 10)

Loss	Value
Mel Loss	0.333
Disc Loss	3.983
Mono Loss	0.030
S2S Loss	0.523
SLM Loss	1.422

Mono Loss staying low throughout training indicates stable attention alignment — the most common Stage 1 failure mode did not occur.

Files

File	Description
`first_stage.pth`	Final Stage 1 checkpoint (Epoch 10) — use this for Stage 2 fine-tuning

Audio Samples — Epoch 10

7 phonetically diverse test sentences synthesized from the final checkpoint.
Note: this is a multispeaker base checkpoint, not a single-speaker voice model.

#	Text	Audio
1	Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch.
2	Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein.
3	Er aß die Maße in der Straße, aber das Maß war voll.
4	Zwei weiße Zwerge zwängen sich zwischen zwei Zweige.
5	Ein Pfau pflegt seine Federn an der Pfütze.
6	Warum hast du das getan? Das ist ja unglaublich!
7	Das kostet genau einhundertdreiundzwanzig Millionen Euro.

Downloads last month: 11

Collection including dida-80b/kokoro-german-hui-multispeaker-base

Production Models

Collection

1 item • Updated 2 days ago