mega-asr-bench / index.html
Reza2kn's picture
Use INT8 encoder + INT4 decoder (91.9% accuracy); force-English prompt default
61dfe9b verified
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1" />
<title>Mega-ASR — pure browser ASR</title>
<style>
:root { color-scheme: light dark; --fg:#1a1a1a; --muted:#666; --bg:#f6f7fb; --panel:#fff; --border:#d8dadf;
--green:#2ec27e; --orange:#e8a23a; --yellow:#e0c34a; --red:#e0524c; --accent:#4c6ef5; }
@media (prefers-color-scheme: dark){
:root { --fg:#e8eaed; --muted:#9ba0a8; --bg:#171a1f; --panel:#22262e; --border:#3a3f48; }
}
* { box-sizing: border-box; }
body { margin:0; background:var(--bg); color:var(--fg); font-family: ui-sans-serif,system-ui,-apple-system,Segoe UI,Roboto,sans-serif; line-height:1.5; }
.wrap { max-width: 980px; margin: 0 auto; padding: 24px; }
h1 { font-size: 28px; margin: 0 0 8px; }
.sub { color: var(--muted); margin-bottom: 20px; }
.panel { background: var(--panel); border:1px solid var(--border); border-radius: 12px; padding: 16px 18px; margin-bottom: 16px; }
label { display:block; font-weight: 600; margin-bottom: 6px; }
textarea { width:100%; min-height: 72px; padding: 8px 10px; border-radius: 8px; border:1px solid var(--border); background:var(--panel); color:var(--fg); font-family: inherit; resize: vertical; }
input[type=file] { font-family: inherit; }
.examples { display: grid; grid-template-columns: repeat(auto-fill, minmax(140px, 1fr)); gap: 8px; margin-top: 10px; }
.examples button { background: var(--panel); border:1px solid var(--border); border-radius: 8px; padding: 8px 10px; cursor: pointer; color: var(--fg); font-size: 13px; }
.examples button:hover:not(:disabled) { border-color: var(--accent); }
.examples button:disabled { opacity: 0.4; cursor: not-allowed; }
.primary { background: var(--accent); color: white; border: none; border-radius: 8px; padding: 10px 16px; font-size: 15px; font-weight: 600; cursor: pointer; }
.primary:disabled { opacity: 0.5; cursor: not-allowed; }
.row { display: flex; gap: 12px; flex-wrap: wrap; align-items: center; }
audio { width:100%; margin-top: 8px; }
.result { padding: 14px 16px; border-radius: 10px; font-size: 15px; }
.result .label { font-size: 18px; margin-bottom: 6px; }
.result.green { background: rgba(46,194,126,0.13); border: 2px solid var(--green); }
.result.green .label, .result.green .pct { color: var(--green); }
.result.orange { background: rgba(232,162,58,0.13); border: 2px solid var(--orange); }
.result.orange .label, .result.orange .pct { color: var(--orange); }
.result.yellow { background: rgba(224,195,74,0.18); border: 2px solid var(--yellow); }
.result.yellow .label, .result.yellow .pct { color: var(--yellow); }
.result.red { background: rgba(224,82,76,0.13); border: 2px solid var(--red); }
.result.red .label, .result.red .pct { color: var(--red); }
.result.neutral { background: var(--bg); border: 1px solid var(--border); }
.ref-line { font-size: 13px; color: var(--muted); margin-top: 8px; }
.progress { height: 8px; background: var(--border); border-radius: 4px; overflow: hidden; margin-top: 8px; }
.progress > div { height: 100%; background: var(--accent); width: 0%; transition: width 0.2s; }
.log { font-family: ui-monospace, SF Mono, Menlo, monospace; font-size: 12px; color: var(--muted); max-height: 180px; overflow-y: auto; margin-top: 8px; padding: 6px 8px; background: var(--bg); border-radius: 4px; border: 1px solid var(--border); }
code { font-family: ui-monospace, SF Mono, Menlo, monospace; font-size: 13px; }
.muted { color: var(--muted); font-size: 13px; }
.grid2 { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
@media (max-width: 720px) { .grid2 { grid-template-columns: 1fr; } }
details summary { cursor: pointer; padding: 4px 0; font-weight: 600; }
</style>
</head>
<body>
<div class="wrap">
<h1>🎙️ Mega-ASR — robust ASR in your browser</h1>
<div class="sub">
INT4 ONNX of <a href="https://huggingface.co/zhifeixie/Mega-ASR" target="_blank">Mega-ASR</a> (1.7B params)
running entirely on your device via <code>onnxruntime-web</code> + WebGPU.
First load fetches ~2&nbsp;GB of model weights (cached by the browser for subsequent runs).
Models hosted at <a href="https://huggingface.co/Reza2kn/mega-asr-onnx" target="_blank">Reza2kn/mega-asr-onnx</a>.
</div>
<div class="panel" id="loader-panel">
<div class="row" style="justify-content:space-between">
<div><b>Models</b> &middot; <span id="loader-status">not loaded</span></div>
<button class="primary" id="load-btn">Load model</button>
</div>
<div class="progress"><div id="loader-bar"></div></div>
<div class="log" id="log"></div>
</div>
<div class="grid2">
<div class="panel">
<label for="audio-file">Audio (any format)</label>
<input type="file" id="audio-file" accept="audio/*" />
<audio id="audio-player" controls></audio>
<label for="lang-select" style="margin-top:14px">Force language (auto-detect can fail at INT4)</label>
<select id="lang-select" style="padding:8px 10px;border-radius:8px;border:1px solid var(--border);background:var(--panel);color:var(--fg);font-family:inherit;width:100%">
<option value="english" selected>English</option>
<option value="chinese">Chinese</option>
<option value="japanese">Japanese</option>
<option value="korean">Korean</option>
<option value="auto">Auto-detect</option>
</select>
<label for="ref-text" style="margin-top:14px">Reference transcript (optional)</label>
<textarea id="ref-text" placeholder="Paste the ground-truth text for scoring."></textarea>
<div style="margin-top: 12px;" class="row">
<button class="primary" id="transcribe-btn" disabled>Transcribe</button>
<span class="muted" id="status"></span>
</div>
<div style="margin-top: 14px;"><b>Try a noisy example</b></div>
<div class="examples" id="examples"></div>
</div>
<div class="panel">
<label>Result</label>
<div id="result" class="result neutral">Load the model, pick an audio clip, and hit Transcribe.</div>
<details style="margin-top: 12px;">
<summary class="muted">How agreement is computed</summary>
<p class="muted">
Hypothesis and reference are lowercased and stripped of punctuation. Word-level Levenshtein
gives WER; agreement = max(0, 1 − WER) × 100%. Bands: <b style="color:var(--green)">≥70%</b>
<b style="color:var(--orange)">50-70%</b> <b style="color:var(--yellow)">25-50%</b>
<b style="color:var(--red)">&lt;25%</b>.
</p>
</details>
</div>
</div>
<div class="panel">
<details open>
<summary>About this demo</summary>
<ul class="muted">
<li>Loads three ONNX files (audio encoder + decoder prefill + decoder step) + the Qwen3 tokenizer + an embedding table — all directly from the HF Hub.</li>
<li>Audio is resampled to 16 kHz via the Web Audio API, then log-mel features (128 bins, Whisper-style) are extracted in pure JS.</li>
<li>WebGPU inference where available; falls back to WASM CPU.</li>
<li>First load downloads ~2&nbsp;GB. Subsequent transcriptions reuse the browser cache.</li>
<li>Max audio per pass: 30&nbsp;seconds (longer audio is truncated to the first 30&nbsp;s).</li>
</ul>
</details>
</div>
</div>
<script type="module" src="./mega-asr.js"></script>
</body>
</html>