SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)
sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset
Overview
Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.
This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.
What's new in v23
v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
| Capability | v22 (previous prod) | v23 R6 (this model) |
|---|---|---|
| Paragraph emission rate on long inputs | 0.0 % | 91.5 % |
| ROUGE-L on paragraph-formatted inputs | 0.9521 | 0.9784 |
| ROUGE-L on standard val set | 0.9539 | 0.9537 |
| Filler-Free rate on standard val set | 90.3 % | 91.0 % ⭐ |
This is the R6 variant trained with GRPO at LR 5e-6 with a 4-stage pipeline (SFT → GRPO → Stage-2 SFT recovery → final GRPO with paragraph rows excluded) (the SFT base already learned paragraph emission, so GRPO can focus purely on the cleanup gradient). This produces stronger updates without the conflicted reward landscape that paragraph rows introduce., achieving 91.0 % filler-free vs v22's 90.3 % — the first v23 variant to strictly beat v22 on this user-visible metric. The trade-off is a small drop in main val ROUGE-L (0.9499 vs v22's 0.9539, within natural seed variance) and slightly lower paragraph emission rate vs the R1 variant (89 % vs 91.5 %).
Key Specs
| Property | Value |
|---|---|
| Size | 676 MB |
| ROUGE-L (val set, 1000 samples) | 0.9537 |
| Exact Match | 64.3 % |
| Filler-Free | 91.0 % ⭐ (beats v22 by +0.7 pts) |
| Paragraph rate (long inputs) | 91.5 % |
| Latency | 118 ms average per transcript (RTX 4090) |
| Architecture | Hybrid: 10 conv + 6 GQA attention (354M params) |
| Precision | bf16 |
| Training context | 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base |
What It Does
Takes raw, unpunctuated ASR output and produces clean, readable text:
| Input (raw ASR) | Output (cleaned) |
|---|---|
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
| uh yes | Yes. |
NEW in v23: Paragraph emission on long dictations
Long, multi-topic input is now restructured into paragraph-formatted prose:
Input:
okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist
Output:
We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.
I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.
Notice the model:
- Strips speech disfluencies ("okay so", "uh", "basically")
- Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
- Adds correct punctuation
- Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")
Benchmark Results
Main val set (1000 samples, cleaned val.jsonl from training data)
| Metric | v23 R6 (this model) | v22 baseline |
|---|---|---|
| ROUGE-L | 0.9499 | 0.9539 |
| Exact Match | 64.3 % | 64.8 % |
| Filler-Free | 91.0 % ⭐ | 90.3 % |
| Paragraph rate | 0.0 % | 0.0 % |
| Avg latency | 118 ms | 117 ms |
Paragraph val set (200 paragraph_formatting samples)
| Metric | v23 R6 (this model) | v22 baseline | Δ |
|---|---|---|---|
| ROUGE-L | 0.9784 | 0.9521 | +0.027 |
| Paragraph emission rate | 91.5 % | 0.0 % | +89 pts |
| Exact Match | 2.0 % | 0.0 % | +2.0 pts |
| Avg latency | 1.45 s | 1.40 s | +50 ms |
vs Prompted Qwen 2B Baseline (from earlier benchmarks)
| Metric | This model (354M) | Prompted Qwen 2B | Improvement |
|---|---|---|---|
| ROUGE-L | 0.9537 | 0.891 | +0.059 |
| Exact Match | 64.3 % | 37 % | +26 pts |
| Inference | 118 ms | 1.0 s | 8.5× faster |
| Parameters | 354M | 2B | 5.6× smaller |
Usage
Prompt Format
### Input:
{raw transcript}
### Output:
{model generates cleaned text}
Python Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"juanquivilla/sotto-cleanup-lfm25-350m",
dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")
text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
output = output[:output.index("###")]
print(output.strip())
# → "We need to fix the deployment pipeline."
For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).
Training Details
Pipeline
LiquidAI/LFM2.5-350M-Base
→ SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
cosine schedule, 50 warmup steps, weight_decay 0.01,
bf16+tf32, packed 4,096 context, seed 42
→ eval_loss 1.016 (vs v22's 1.0306, -0.014)
→ GRPO R6: LoRA r=32, alpha=16, all linear layers,
4-stage: SFT → GRPO → Stage-2 SFT recovery on v22-only data (LR 5e-6, 16 steps) → final GRPO LR 5e-6 with paragraph rows excluded, 5K samples × 4 generations,
reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
→ final main val ROUGE-L 0.9499 / Filler-Free 91.0 % / paragraph rate 89 %
Dataset
157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:
- 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
- 3,995 new paragraph_formatting samples (held out 200 for
paragraph_val.jsonl) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries
Hardware
1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total
All Variants
| Variant | Size | Use Case |
|---|---|---|
| Full precision (this) | 676 MB | Training, GPU inference |
| MLX 5-bit | 237 MB | Recommended for Apple Silicon |
| MLX 4-bit | 195 MB | Smallest, slight quality trade-off |
Limitations
- Optimized for English conversational/meeting-style speech
- Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
- Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
- Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
- Not designed for formal written text — trained on spoken language patterns
License
MIT
Links
- Application: sottoasr.app
- Source: github.com/juanqui/sottoasr
- Dataset: juanquivilla/sotto-transcript-cleanup
- Downloads last month
- 3,101