Trim training-procedure detail and tighten narrative
Browse files
README.md
CHANGED
|
@@ -166,41 +166,11 @@ this restriction; see [License](#license).
|
|
| 166 |
|
| 167 |
## Training procedure
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
| Parameter | Value |
|
| 172 |
-
|---|---|
|
| 173 |
-
| Base model | `mlx-community/whisper-large-v3-mlx` |
|
| 174 |
-
| Dtype | float32 |
|
| 175 |
-
| Encoder | frozen |
|
| 176 |
-
| Decoder | trainable (~67M params, 92.5% of total) |
|
| 177 |
-
| Optimizer | AdamW |
|
| 178 |
-
| Peak learning rate | 5e-5 |
|
| 179 |
-
| LR schedule | linear warmup 500 → cosine decay |
|
| 180 |
-
| Min LR | 1e-6 |
|
| 181 |
-
| Batch size | 10 |
|
| 182 |
-
| Gradient accumulation | 1 |
|
| 183 |
-
| Gradient clipping | global max-norm 1.0 |
|
| 184 |
-
| Validation cadence | every 1,000 steps |
|
| 185 |
-
| Validation batch size | 4 |
|
| 186 |
-
| Steps configured | 30,000 |
|
| 187 |
-
| Steps actually run | 30,000 (no early stop) |
|
| 188 |
-
| Random seed | 42 |
|
| 189 |
-
| MLX allocator cache cap | 20 GB |
|
| 190 |
|
| 191 |
### Training-time language token
|
| 192 |
|
| 193 |
-
All training samples use `<|en|>` as the start-of-transcript prefix
|
| 194 |
-
regardless of source-audio language; the token is overloaded as
|
| 195 |
-
"emit IPA". This is intentional — phonetic transcription is meant to
|
| 196 |
-
be language-agnostic, so the decoder is trained without a per-language
|
| 197 |
-
signal. **Pass `language="en"` at inference.**
|
| 198 |
-
|
| 199 |
-
### Hardware and runtime
|
| 200 |
-
|
| 201 |
-
Trained on a single Apple Mac Studio M3 Ultra (96 GB unified memory).
|
| 202 |
-
Total wall-clock: 1,629 minutes (~27 hours). Step time ≈ 3.0 s/step
|
| 203 |
-
average at fp32, batch 10, on whisper-large-v3.
|
| 204 |
|
| 205 |
## Evaluation
|
| 206 |
|
|
|
|
| 166 |
|
| 167 |
## Training procedure
|
| 168 |
|
| 169 |
+
Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
### Training-time language token
|
| 172 |
|
| 173 |
+
All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
## Evaluation
|
| 176 |
|