lilfugu
A Japanese ASR model fine-tuned for software development.
Based on Qwen3-ASR-1.7B. Designed to produce clean, usable transcriptions for developers β not just programming term recognition, but also proper Arabic numerals (e.g. 3000, not δΈε), consistent punctuation, and overall higher-quality Japanese output.
What's improved over the base model
- Programming terms in English:
useEffect,Docker,Vercel,Prisma,Tailwind CSS, etc. β not katakana - Arabic numerals:
3000ηͺγγΌγ,200ms,8GBβ not kanji numerals - Punctuation and formatting: cleaner, more consistent output
- General Japanese quality: improvements not fully captured by existing benchmarks (JSUT, etc.) due to their normalization
Benchmarks
ADLIB
| Model | CER | Term Accuracy (Exact) | Composite |
|---|---|---|---|
| lilfugu | 26.3% | 51.6% | 0.6272 |
| Qwen3-ASR-1.7B (base) | 41.1% | 24.6% | 0.4203 |
| Whisper large-v3-turbo | 41.9% | 20.2% | 0.3935 |
| kotoba-whisper-v2.0 | 61.1% | 7.0% | 0.2256 |
| SenseVoice Small | 56.8% | 0.0% | 0.2090 |
Composite = 0.4 Γ (1 - CER) + 0.6 Γ Term Accuracy (includes both exact and flexible matches)
Benchmark: ADLIB β Language-aware ASR benchmark for Japanese
JSUT
| Model | CER |
|---|---|
| Qwen3-ASR-1.7B (base) | 10.7% |
| lilfugu | 10.8% |
| Whisper large-v3-turbo | 12.0% |
| kotoba-whisper-v2.0 | 15.7% |
| SenseVoice Small | 16.2% |
Dataset: JSUT
Note: Existing Japanese ASR benchmarks are not designed to properly evaluate Japanese language quality β they normalize numbers, punctuation, and whitespace before scoring. These scores should be taken as a rough reference only.
Variants
| Repository | Size | Format |
|---|---|---|
| lilfugu (this) | 4.1 GB | MLX bfloat16 |
| lilfugu-8bit | 2.8 GB | MLX 8bit quantized |
| lilfugu-transformers | 4.1 GB | safetensors fp16 (CUDA/Linux) |
| lilfugu-lora | ~49 MB | LoRA adapter |
See also: lilfugu-experimental β higher term accuracy, but may over-convert in some cases.
Usage
MLX (Apple Silicon)
pip install -U mlx-audio
from mlx_audio.stt import load
model = load("holotherapper/lilfugu")
result = model.generate("audio.wav", language="Japanese")
print(result.text)
For the 8bit version:
model = load("holotherapper/lilfugu-8bit")
CUDA / Linux
from qwen_asr import Qwen3ASR
model = Qwen3ASR.from_pretrained("holotherapper/lilfugu-transformers")
result = model.transcribe("audio.wav")
LoRA adapter (custom scale tuning)
from mlx_tune.stt import FastSTTModel
from mlx_lm.tuner.lora import LoRALinear
model, _ = FastSTTModel.from_pretrained("mlx-community/Qwen3-ASR-1.7B-bf16")
model.load_adapter("holotherapper/lilfugu-lora")
# Adjust scale (0.0-1.0). Higher = stronger term conversion.
for _, module in model.model.named_modules():
if isinstance(module, LoRALinear):
module.scale = 1.0
text = model.transcribe("audio.wav", language="ja")
License
Apache 2.0 (following Qwen3-ASR-1.7B)
- Downloads last month
- 84
Quantized