SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v23 + Paragraphs)
sottoasr.app · Full precision (bf16) · MLX 4-bit (smaller) · Training Dataset
Overview
MLX 5-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m for on-device deployment on Apple Silicon. This is the recommended variant for SottoASR's on-device transcript cleanup — minimal quality loss vs full precision, fits in 237 MB, runs at ~85 ms per typical transcript on M-series chips.
This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.
What's new in v23
v23 adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
| Capability | v22 (previous prod) | v23 R6 (this model) |
|---|---|---|
| Paragraph emission rate on long inputs | 0.0 % | 91.5 % |
| ROUGE-L on paragraph-formatted inputs | 0.9521 | 0.9784 |
| ROUGE-L on standard val set | 0.9539 | 0.9537 |
| Filler-Free rate on standard val set | 90.3 % | 91.0 % ⭐ |
Key Specs
| Property | Value |
|---|---|
| Size | 237 MB |
| Quantization | 5-bit affine, group_size=64 |
| Effective bits/weight | 5.502 |
| ROUGE-L (val set) | ~0.9505 (≈ bf16) |
| Paragraph rate (long inputs) | ~89.5 % |
| Architecture | Hybrid: 10 conv + 6 GQA attention (354M params) |
| Latency | ~85 ms average per transcript (M-series) |
Quantization Recipe
mlx_lm.convert \
--hf-path juanquivilla/sotto-cleanup-lfm25-350m \
--mlx-path sotto-cleanup-lfm25-350m-mlx-5bit \
-q --q-bits 5 --q-group-size 64 \
--trust-remote-code
Usage
Python (mlx_lm)
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
sampler = make_sampler(temp=0.0) # greedy
text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"
output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
output = output[:output.index("###")].strip()
print(output)
# → "We need to fix the deployment pipeline."
For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.
What It Does
| Input (raw ASR) | Output (cleaned) |
|---|---|
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
| uh yes | Yes. |
NEW in v23: Paragraph emission on long dictations
Multi-topic input is now restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.
Benchmark Results
Identical to the bf16 model within MLX 5-bit quantization noise.
All Variants
| Variant | Size | Use Case |
|---|---|---|
| Full precision (bf16) | 676 MB | Training, GPU inference |
| MLX 5-bit (this) | 237 MB | Recommended for Apple Silicon |
| MLX 4-bit | 195 MB | Smallest, slight quality trade-off |
License
MIT
Links
- Application: sottoasr.app
- Source: github.com/juanqui/sottoasr
- Dataset: juanquivilla/sotto-transcript-cleanup
- Downloads last month
- 1,992
5-bit
Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit
Base model
LiquidAI/LFM2.5-350M-Base