SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v23 + Paragraphs)

sottoasr.app · Full precision (bf16) · MLX 4-bit (smaller) · Training Dataset

Overview

MLX 5-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m for on-device deployment on Apple Silicon. This is the recommended variant for SottoASR's on-device transcript cleanup — minimal quality loss vs full precision, fits in 237 MB, runs at ~85 ms per typical transcript on M-series chips.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability	v22 (previous prod)	v23 R6 (this model)
Paragraph emission rate on long inputs	0.0 %	91.5 %
ROUGE-L on paragraph-formatted inputs	0.9521	0.9784
ROUGE-L on standard val set	0.9539	0.9537
Filler-Free rate on standard val set	90.3 %	91.0 % ⭐

Key Specs

Property	Value
Size	237 MB
Quantization	5-bit affine, group_size=64
Effective bits/weight	5.502
ROUGE-L (val set)	~0.9505 (≈ bf16)
Paragraph rate (long inputs)	~89.5 %
Architecture	Hybrid: 10 conv + 6 GQA attention (354M params)
Latency	~85 ms average per transcript (M-series)

Quantization Recipe

mlx_lm.convert \
  --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
  --mlx-path sotto-cleanup-lfm25-350m-mlx-5bit \
  -q --q-bits 5 --q-group-size 64 \
  --trust-remote-code

Usage

Python (mlx_lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
sampler = make_sampler(temp=0.0)  # greedy

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
    output = output[:output.index("###")].strip()
print(output)
# → "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.

What It Does

Input (raw ASR)	Output (cleaned)
so uh basically we need to fix the deployment pipeline	We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday	The deadline is Monday.
what we what i wanted to say is the tests pass	What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space	We're running out of disk space.
uh yes	Yes.

NEW in v23: Paragraph emission on long dictations

Multi-topic input is now restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.

Benchmark Results

Identical to the bf16 model within MLX 5-bit quantization noise.

All Variants

Variant	Size	Use Case
Full precision (bf16)	676 MB	Training, GPU inference
MLX 5-bit (this)	237 MB	Recommended for Apple Silicon
MLX 4-bit	195 MB	Smallest, slight quality trade-off

License

MIT

Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit

Base model

LiquidAI/LFM2.5-350M-Base

Finetuned

juanquivilla/sotto-cleanup-lfm25-350m

Quantized

(2)

this model

juanquivilla
/

sotto-cleanup-lfm25-350m-mlx-5bit