SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)

sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset

Overview

Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability	v22 (previous prod)	v23 R6 (this model)
Paragraph emission rate on long inputs	0.0 %	91.5 %
ROUGE-L on paragraph-formatted inputs	0.9521	0.9784
ROUGE-L on standard val set	0.9539	0.9537
Filler-Free rate on standard val set	90.3 %	91.0 % ⭐

This is the R6 variant trained with GRPO at LR 5e-6 with a 4-stage pipeline (SFT → GRPO → Stage-2 SFT recovery → final GRPO with paragraph rows excluded) (the SFT base already learned paragraph emission, so GRPO can focus purely on the cleanup gradient). This produces stronger updates without the conflicted reward landscape that paragraph rows introduce., achieving 91.0 % filler-free vs v22's 90.3 % — the first v23 variant to strictly beat v22 on this user-visible metric. The trade-off is a small drop in main val ROUGE-L (0.9499 vs v22's 0.9539, within natural seed variance) and slightly lower paragraph emission rate vs the R1 variant (89 % vs 91.5 %).

Key Specs

Property	Value
Size	676 MB
ROUGE-L (val set, 1000 samples)	0.9537
Exact Match	64.3 %
Filler-Free	91.0 % ⭐ (beats v22 by +0.7 pts)
Paragraph rate (long inputs)	91.5 %
Latency	118 ms average per transcript (RTX 4090)
Architecture	Hybrid: 10 conv + 6 GQA attention (354M params)
Precision	bf16
Training context	4,096 tokens (packed); model supports 32,768 tokens natively, 128K base

What It Does

Takes raw, unpunctuated ASR output and produces clean, readable text:

Input (raw ASR)	Output (cleaned)
so uh basically we need to fix the deployment pipeline	We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday	The deadline is Monday.
what we what i wanted to say is the tests pass	What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space	We're running out of disk space.
uh yes	Yes.

NEW in v23: Paragraph emission on long dictations

Long, multi-topic input is now restructured into paragraph-formatted prose:

Input:

okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist

Output:

We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.

I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.

Notice the model:

Strips speech disfluencies ("okay so", "uh", "basically")
Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
Adds correct punctuation
Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")

Benchmark Results

Main val set (1000 samples, cleaned val.jsonl from training data)

Metric	v23 R6 (this model)	v22 baseline
ROUGE-L	0.9499	0.9539
Exact Match	64.3 %	64.8 %
Filler-Free	91.0 % ⭐	90.3 %
Paragraph rate	0.0 %	0.0 %
Avg latency	118 ms	117 ms

Paragraph val set (200 paragraph_formatting samples)

Metric	v23 R6 (this model)	v22 baseline	Δ
ROUGE-L	0.9784	0.9521	+0.027
Paragraph emission rate	91.5 %	0.0 %	+89 pts
Exact Match	2.0 %	0.0 %	+2.0 pts
Avg latency	1.45 s	1.40 s	+50 ms

vs Prompted Qwen 2B Baseline (from earlier benchmarks)

Metric	This model (354M)	Prompted Qwen 2B	Improvement
ROUGE-L	0.9537	0.891	+0.059
Exact Match	64.3 %	37 %	+26 pts
Inference	118 ms	1.0 s	8.5× faster
Parameters	354M	2B	5.6× smaller

Usage

Prompt Format

### Input:
{raw transcript}

### Output:
{model generates cleaned text}

Python Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
# → "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).

Training Details

Pipeline

LiquidAI/LFM2.5-350M-Base
  → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
         LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
         cosine schedule, 50 warmup steps, weight_decay 0.01,
         bf16+tf32, packed 4,096 context, seed 42
         → eval_loss 1.016 (vs v22's 1.0306, -0.014)
  → GRPO R6: LoRA r=32, alpha=16, all linear layers,
             4-stage: SFT → GRPO → Stage-2 SFT recovery on v22-only data (LR 5e-6, 16 steps) → final GRPO LR 5e-6 with paragraph rows excluded, 5K samples × 4 generations,
             reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
             → final main val ROUGE-L 0.9499 / Filler-Free 91.0 % / paragraph rate 89 %

Dataset

157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:

153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
3,995 new paragraph_formatting samples (held out 200 for paragraph_val.jsonl) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries

Hardware

1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total

All Variants

Variant	Size	Use Case
Full precision (this)	676 MB	Training, GPU inference
MLX 5-bit	237 MB	Recommended for Apple Silicon
MLX 4-bit	195 MB	Smallest, slight quality trade-off

Limitations

Optimized for English conversational/meeting-style speech
Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
Not designed for formal written text — trained on spoken language patterns

License

MIT

Model tree for juanquivilla/sotto-cleanup-lfm25-350m

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

(7)

this model

Quantizations

2 models

juanquivilla
/

sotto-cleanup-lfm25-350m