SottoASR Transcript Cleanup — LFM2.5-350M (Full Precision, v23 + Paragraphs)

sottoasr.app · MLX 5-bit (recommended) · MLX 4-bit (smaller) · Training Dataset

Overview

Full-precision bf16 fine-tune of LiquidAI/LFM2.5-350M-Base for on-device speech-to-text transcript cleanup. This is the training artifact — for on-device deployment on Apple Silicon, use the 5-bit MLX variant instead.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 (this model) adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length, which made multi-topic dictations hard to read. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability v22 (previous prod) v23 R6 (this model)
Paragraph emission rate on long inputs 0.0 % 91.5 %
ROUGE-L on paragraph-formatted inputs 0.9521 0.9784
ROUGE-L on standard val set 0.9539 0.9537
Filler-Free rate on standard val set 90.3 % 91.0 %

This is the R6 variant trained with GRPO at LR 5e-6 with a 4-stage pipeline (SFT → GRPO → Stage-2 SFT recovery → final GRPO with paragraph rows excluded) (the SFT base already learned paragraph emission, so GRPO can focus purely on the cleanup gradient). This produces stronger updates without the conflicted reward landscape that paragraph rows introduce., achieving 91.0 % filler-free vs v22's 90.3 % — the first v23 variant to strictly beat v22 on this user-visible metric. The trade-off is a small drop in main val ROUGE-L (0.9499 vs v22's 0.9539, within natural seed variance) and slightly lower paragraph emission rate vs the R1 variant (89 % vs 91.5 %).

Key Specs

Property Value
Size 676 MB
ROUGE-L (val set, 1000 samples) 0.9537
Exact Match 64.3 %
Filler-Free 91.0 % ⭐ (beats v22 by +0.7 pts)
Paragraph rate (long inputs) 91.5 %
Latency 118 ms average per transcript (RTX 4090)
Architecture Hybrid: 10 conv + 6 GQA attention (354M params)
Precision bf16
Training context 4,096 tokens (packed); model supports 32,768 tokens natively, 128K base

What It Does

Takes raw, unpunctuated ASR output and produces clean, readable text:

Input (raw ASR) Output (cleaned)
so uh basically we need to fix the deployment pipeline We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday The deadline is Monday.
what we what i wanted to say is the tests pass What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space We're running out of disk space.
uh yes Yes.

NEW in v23: Paragraph emission on long dictations

Long, multi-topic input is now restructured into paragraph-formatted prose:

Input:

okay so were having some real issues with the deployment pipeline and i want to walk through whats going wrong the main problem is that the redis cache is timing out during deploys we push a new version and then for about thirty seconds the connections hang and customers see errors so i think we need to add a graceful shutdown period before we kill the old pods now separately the elasticsearch cluster has been a pain we deployed a new svelte frontend last month and it caused index corruption we need to validate the schema before pushing anything live going forward i think the answer here is to add both of these checks to our standard deployment checklist

Output:

We're having some real issues with the deployment pipeline, and I want to walk through what's going wrong. The main problem is that the Redis cache is timing out during deploys. We push a new version and then for about thirty seconds the connections hang, and customers see errors.

I think we need to add a graceful shutdown period before we kill the old pods. Now separately, the Elasticsearch cluster has been a pain. We deployed a new Svelte frontend last month and it caused index corruption. We need to validate the schema before pushing anything live. Going forward, I think the answer here is to add both of these checks to our standard deployment checklist.

Notice the model:

  • Strips speech disfluencies ("okay so", "uh", "basically")
  • Capitalizes proper nouns (Redis, Elasticsearch, Svelte)
  • Adds correct punctuation
  • Inserts a paragraph break at the topic shift ("the elasticsearch cluster has been a pain")

Benchmark Results

Main val set (1000 samples, cleaned val.jsonl from training data)

Metric v23 R6 (this model) v22 baseline
ROUGE-L 0.9499 0.9539
Exact Match 64.3 % 64.8 %
Filler-Free 91.0 % 90.3 %
Paragraph rate 0.0 % 0.0 %
Avg latency 118 ms 117 ms

Paragraph val set (200 paragraph_formatting samples)

Metric v23 R6 (this model) v22 baseline Δ
ROUGE-L 0.9784 0.9521 +0.027
Paragraph emission rate 91.5 % 0.0 % +89 pts
Exact Match 2.0 % 0.0 % +2.0 pts
Avg latency 1.45 s 1.40 s +50 ms

vs Prompted Qwen 2B Baseline (from earlier benchmarks)

Metric This model (354M) Prompted Qwen 2B Improvement
ROUGE-L 0.9537 0.891 +0.059
Exact Match 64.3 % 37 % +26 pts
Inference 118 ms 1.0 s 8.5× faster
Parameters 354M 2B 5.6× smaller

Usage

Prompt Format

### Input:
{raw transcript}

### Output:
{model generates cleaned text}

Python Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "juanquivilla/sotto-cleanup-lfm25-350m",
    dtype=torch.bfloat16, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("juanquivilla/sotto-cleanup-lfm25-350m")

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
if "###" in output:
    output = output[:output.index("###")]
print(output.strip())
# → "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, use a higher max_new_tokens (1024-2048).

Training Details

Pipeline

LiquidAI/LFM2.5-350M-Base
  → SFT: 157,556 rows (v22 base + 4,012 paragraph_formatting),
         LR 3e-5, β2=0.95, 3 epochs, batch 1×8,
         cosine schedule, 50 warmup steps, weight_decay 0.01,
         bf16+tf32, packed 4,096 context, seed 42
         → eval_loss 1.016 (vs v22's 1.0306, -0.014)
  → GRPO R6: LoRA r=32, alpha=16, all linear layers,
             4-stage: SFT → GRPO → Stage-2 SFT recovery on v22-only data (LR 5e-6, 16 steps) → final GRPO LR 5e-6 with paragraph rows excluded, 5K samples × 4 generations,
             reward = ROUGE-L × 5.0 - filler_count × 0.5 (capped 2.0) × 3.0 + format_bonus
             → final main val ROUGE-L 0.9499 / Filler-Free 91.0 % / paragraph rate 89 %

Dataset

157,556 train / 7,121 val rows in juanquivilla/sotto-transcript-cleanup:

  • 153,561 v22 base (prior versions: filler removal, crutch words, self-correction, false starts, grammar, dictation commands, list formatting, preserve-wording, mixed disfluencies, short utterances, long dictation)
  • 3,995 new paragraph_formatting samples (held out 200 for paragraph_val.jsonl) — generated via AWS Bedrock Claude Haiku 4.5, instructed to produce 100–500 word raw input + 2–5 paragraph clean output, split at natural discourse boundaries

Hardware

1× RTX 4090, ~42 minutes for SFT + ~30 minutes for GRPO = ~72 min total

All Variants

Variant Size Use Case
Full precision (this) 676 MB Training, GPU inference
MLX 5-bit 237 MB Recommended for Apple Silicon
MLX 4-bit 195 MB Smallest, slight quality trade-off

Limitations

  • Optimized for English conversational/meeting-style speech
  • Domain-specific jargon (medical, legal) may not be corrected without additional fine-tuning
  • Paragraph emission is conditional on input structure — short single-topic inputs (typical) will not be paragraph-broken
  • Filler-free rate on long-form content is lower than on short inputs (long content has more legitimate uses of words like "so", "okay", "right", which the eval list flags)
  • Not designed for formal written text — trained on spoken language patterns

License

MIT

Links

Downloads last month
3,101
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for juanquivilla/sotto-cleanup-lfm25-350m

Finetuned
(7)
this model
Quantizations
2 models

Dataset used to train juanquivilla/sotto-cleanup-lfm25-350m