SottoASR Transcript Cleanup — LFM2.5-350M MLX 5-bit (v23 + Paragraphs)

sottoasr.app · Full precision (bf16) · MLX 4-bit (smaller) · Training Dataset

Overview

MLX 5-bit affine quantization of juanquivilla/sotto-cleanup-lfm25-350m for on-device deployment on Apple Silicon. This is the recommended variant for SottoASR's on-device transcript cleanup — minimal quality loss vs full precision, fits in 237 MB, runs at ~85 ms per typical transcript on M-series chips.

This model powers on-device transcript cleanup in SottoASR — a local, privacy-first speech-to-text application for macOS. It removes filler words, corrects grammar, formats punctuation, handles false starts and self-corrections, and — new in v23 — restructures long dictations into paragraph-formatted prose, all locally with zero cloud dependency.

What's new in v23

v23 adds paragraph emission for long-form dictation. The previous v18/v22 production model produced output as a single run-on paragraph regardless of input length. v23 was retrained on a dataset augmented with 4,012 new paragraph_formatting samples, teaching the model to insert \n\n paragraph breaks at natural topic / time-reference / discourse-marker boundaries.

Capability v22 (previous prod) v23 R6 (this model)
Paragraph emission rate on long inputs 0.0 % 91.5 %
ROUGE-L on paragraph-formatted inputs 0.9521 0.9784
ROUGE-L on standard val set 0.9539 0.9537
Filler-Free rate on standard val set 90.3 % 91.0 %

Key Specs

Property Value
Size 237 MB
Quantization 5-bit affine, group_size=64
Effective bits/weight 5.502
ROUGE-L (val set) ~0.9505 (≈ bf16)
Paragraph rate (long inputs) ~89.5 %
Architecture Hybrid: 10 conv + 6 GQA attention (354M params)
Latency ~85 ms average per transcript (M-series)

Quantization Recipe

mlx_lm.convert \
  --hf-path juanquivilla/sotto-cleanup-lfm25-350m \
  --mlx-path sotto-cleanup-lfm25-350m-mlx-5bit \
  -q --q-bits 5 --q-group-size 64 \
  --trust-remote-code

Usage

Python (mlx_lm)

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit")
sampler = make_sampler(temp=0.0)  # greedy

text = "so uh basically we need to fix the deployment pipeline"
prompt = f"### Input:\n{text}\n\n### Output:\n"

output = generate(model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler)
if "###" in output:
    output = output[:output.index("###")].strip()
print(output)
# → "We need to fix the deployment pipeline."

For long dictation that may need paragraph formatting, raise max_tokens to 1024–2048.

What It Does

Input (raw ASR) Output (cleaned)
so uh basically we need to fix the deployment pipeline We need to fix the deployment pipeline.
the deadline is friday no monday we have until monday The deadline is Monday.
what we what i wanted to say is the tests pass What I wanted to say is the tests pass.
okay so the thing is basically we're running out of disk space We're running out of disk space.
uh yes Yes.

NEW in v23: Paragraph emission on long dictations

Multi-topic input is now restructured into paragraphed prose with \n\n breaks at natural topic boundaries. See the bf16 model card for a full example.

Benchmark Results

Identical to the bf16 model within MLX 5-bit quantization noise.

All Variants

Variant Size Use Case
Full precision (bf16) 676 MB Training, GPU inference
MLX 5-bit (this) 237 MB Recommended for Apple Silicon
MLX 4-bit 195 MB Smallest, slight quality trade-off

License

MIT

Links

Downloads last month
1,992
Safetensors
Model size
66.5M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit

Quantized
(2)
this model

Dataset used to train juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit