barathanasln commited on
Commit
2ab9d7b
·
verified ·
1 Parent(s): cf5e3e8

Trim training-procedure detail and tighten narrative

Browse files
Files changed (1) hide show
  1. README.md +2 -32
README.md CHANGED
@@ -166,41 +166,11 @@ this restriction; see [License](#license).
166
 
167
  ## Training procedure
168
 
169
- ### Hyperparameters
170
-
171
- | Parameter | Value |
172
- |---|---|
173
- | Base model | `mlx-community/whisper-large-v3-mlx` |
174
- | Dtype | float32 |
175
- | Encoder | frozen |
176
- | Decoder | trainable (~67M params, 92.5% of total) |
177
- | Optimizer | AdamW |
178
- | Peak learning rate | 5e-5 |
179
- | LR schedule | linear warmup 500 → cosine decay |
180
- | Min LR | 1e-6 |
181
- | Batch size | 10 |
182
- | Gradient accumulation | 1 |
183
- | Gradient clipping | global max-norm 1.0 |
184
- | Validation cadence | every 1,000 steps |
185
- | Validation batch size | 4 |
186
- | Steps configured | 30,000 |
187
- | Steps actually run | 30,000 (no early stop) |
188
- | Random seed | 42 |
189
- | MLX allocator cache cap | 20 GB |
190
 
191
  ### Training-time language token
192
 
193
- All training samples use `<|en|>` as the start-of-transcript prefix
194
- regardless of source-audio language; the token is overloaded as
195
- "emit IPA". This is intentional — phonetic transcription is meant to
196
- be language-agnostic, so the decoder is trained without a per-language
197
- signal. **Pass `language="en"` at inference.**
198
-
199
- ### Hardware and runtime
200
-
201
- Trained on a single Apple Mac Studio M3 Ultra (96 GB unified memory).
202
- Total wall-clock: 1,629 minutes (~27 hours). Step time ≈ 3.0 s/step
203
- average at fp32, batch 10, on whisper-large-v3.
204
 
205
  ## Evaluation
206
 
 
166
 
167
  ## Training procedure
168
 
169
+ Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with [MLX](https://github.com/ml-explore/mlx). Full hyperparameters, launchers, and reproduction commands are in the [GitHub repository](https://github.com/barathanaslan/phonetic-whisper-mlx).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ### Training-time language token
172
 
173
+ All training samples use `<|en|>` as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. **Pass `language="en"` at inference.**
 
 
 
 
 
 
 
 
 
 
174
 
175
  ## Evaluation
176