SottoASR Transcript Cleanup — LFM2.5-350M MLX 4-bit (v23 + Paragraphs)
sottoasr.app · Full precision (bf16) · MLX 5-bit (recommended) · Training Dataset
Overview
MLX 4-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m. The smallest variant — 195 MB — for memory-constrained Apple Silicon devices. The 5-bit MLX variant is recommended for most users (slightly better quality at 237 MB).
This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.
What's new in v23
v23 adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples generated via Bedrock Claude Haiku 4.5, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.
| Capability | v22 (previous prod) | v23 R6 (this model) |
|---|---|---|
| Paragraph emission rate on long inputs | 0.0 % | 91.5 % |
| ROUGE-L on paragraph-formatted inputs | 0.9521 | 0.9784 |
| ROUGE-L on standard val set | 0.9539 | 0.9537 |
| Filler-Free rate on standard val set | 90.3 % | 91.0 % ⭐ |
Key Specs
| Property | Value |
|---|---|
| Size | 195 MB |
| Quantization | 4-bit affine, group_size=64 |
| Effective bits/weight | 4.502 |
| ROUGE-L (val set) | ~0.9495 (slight loss vs bf16) |
| Paragraph rate (long inputs) | ~89.5 % |
| Architecture | Hybrid: 10 conv + 6 GQA attention (354M params) |
| Latency | ~75 ms average per transcript (M-series) |
Quantization Recipe
mlx_lm.convert \
--hf-path juanquivilla/sotto-cleanup-lfm25-350m \
--mlx-path sotto-cleanup-lfm25-350m-mlx-4bit \
-q --q-bits 4 --q-group-size 64 \
--trust-remote-code
Usage
Python (mlx_lm)
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit")
sampler = make_sampler(temp=0.0) # greedy
text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"
output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
output = output[:output.index("###")].strip()
print(output)
# → "We need to fix the deployment pipeline."
For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.
What It Does
| Input (raw ASR) | Output (cleaned) |
|---|---|
| so uh basically we need to fix the deployment pipeline | We need to fix the deployment pipeline. |
| the deadline is friday no monday we have until monday | The deadline is Monday. |
| what we what i wanted to say is the tests pass | What I wanted to say is the tests pass. |
| okay so the thing is basically we're running out of disk space | We're running out of disk space. |
| uh yes | Yes. |
NEW in v23: Paragraph emission on long dictations
Multi-topic input is now restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.
All Variants
| Variant | Size | Use Case |
|---|---|---|
| Full precision (bf16) | 676 MB | Training, GPU inference |
| MLX 5-bit | 237 MB | Recommended for Apple Silicon |
| MLX 4-bit (this) | 195 MB | Smallest, slight quality trade-off |
License
MIT
Links
- Application: sottoasr.app
- Source: github.com/juanqui/sottoasr
- Dataset: juanquivilla/sotto-transcript-cleanup
- Downloads last month
- 253
4-bit
Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-4bit
Base model
LiquidAI/LFM2.5-350M-Base