whispermath-webdemo / README.md
vibhuiitj's picture
Expand README with setup and deployment steps
27f52a8 verified
metadata
title: WhisperMath
emoji: 🧮
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false

WhisperMath Web Demo

WhisperMath is an interactive demo for converting spoken mathematical phrases into rendered math notation.

browser microphone
  -> faster-whisper transcript
  -> ByT5 math decoder
  -> rendered KaTeX output

The demo has two useful modes:

  • Record audio: speak a math expression in the browser.
  • Edit transcript: correct Whisper's transcript and click Decode Transcript to test only the ByT5 decoder.

This separation is important because spoken-math errors can come from two different places:

  • Whisper may hear the audio incorrectly.
  • The ByT5 decoder may convert a correct transcript incorrectly.

Live Demo

Hugging Face Space:

https://huggingface.co/spaces/vibhuiitj/whispermath-webdemo

Direct app URL:

https://vibhuiitj-whispermath-webdemo.hf.space

The public Space is configured for free CPU:

Whisper: small.en
Decoder: vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724
Device: CPU

Files

webdemo/
  app.py              FastAPI backend
  static/index.html   Browser UI with recorder and KaTeX rendering
  requirements.txt    Python dependencies
  Dockerfile          Hugging Face Spaces Docker image
  .dockerignore       Files ignored by Docker build
  README.md           Space metadata and this guide

How It Works

  1. The browser records audio using MediaRecorder.
  2. The frontend uploads the recording to POST /api/transcribe.
  3. The backend saves the audio to a temporary file.
  4. faster-whisper transcribes the audio into English text.
  5. The transcript is passed to the ByT5 checkpoint.
  6. The ByT5 output is returned as raw math/LaTeX-like text.
  7. The frontend renders the output using KaTeX and also shows the raw model output for debugging.

Local Setup

From this folder:

cd /Users/vaibhav/Desktop/beyond/whispermath/webdemo

You can use the existing Phase 3 virtualenv:

/Users/vaibhav/Desktop/beyond/whispermath/phase-3-decoder/.venv/bin/python -m pip install -r requirements.txt

Or create a fresh virtualenv:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Run Locally

CPU-friendly default:

uvicorn app:app --host 127.0.0.1 --port 8766

Then open:

http://127.0.0.1:8766

If using the Phase 3 virtualenv directly:

/Users/vaibhav/Desktop/beyond/whispermath/phase-3-decoder/.venv/bin/python \
  -m uvicorn app:app --host 127.0.0.1 --port 8766

Model Configuration

The app is controlled with environment variables.

export WHISPERMATH_WHISPER_MODEL=small.en
export WHISPERMATH_WHISPER_DEVICE=cpu
export WHISPERMATH_WHISPER_COMPUTE_TYPE=int8
export WHISPERMATH_DECODER_MODEL=vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724
export WHISPERMATH_DECODER_DEVICE=auto

Whisper model options:

tiny.en    Fastest, weakest transcription
base.en    Better quality, still fairly light
small.en   Good CPU default for the public Space
medium.en  Better transcription, slower and heavier

For local testing with medium Whisper:

WHISPERMATH_WHISPER_MODEL=medium.en \
/Users/vaibhav/Desktop/beyond/whispermath/phase-3-decoder/.venv/bin/python \
  -m uvicorn app:app --host 127.0.0.1 --port 8766

For the free Hugging Face CPU Space, keep:

WHISPERMATH_WHISPER_MODEL=small.en
WHISPERMATH_DECODER_DEVICE=cpu

API Endpoints

Health

curl http://127.0.0.1:8766/api/health

Example:

{
  "status": "ok",
  "whisper_model": "small.en",
  "decoder_model": "vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724",
  "decoder_device": "cpu"
}

Decode Text Only

Use this when you want to test the ByT5 decoder without audio:

curl -X POST http://127.0.0.1:8766/api/decode \
  -H "Content-Type: application/json" \
  -d '{"text":"x squared minus y squared equals four","num_beams":1,"max_new_tokens":128}'

Example response:

{
  "transcript": "x squared minus y squared equals four",
  "math_text": "x^2-y^2=4",
  "decoder_model": "vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724"
}

Transcribe Audio

curl -X POST http://127.0.0.1:8766/api/transcribe \
  -F audio=@/path/to/audio.wav \
  -F num_beams=4 \
  -F max_new_tokens=256

Example response:

{
  "transcript": "integral from zero to pi of sine x dx.",
  "math_text": "\\int_0^\\pi \\sin x dx.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.72,
      "text": "integral from zero to pi of sine x dx."
    }
  ],
  "whisper_model": "small.en",
  "decoder_model": "vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724"
}

Deploy To Hugging Face Spaces

This folder is ready for a Docker Space.

1. Login

Do not commit or paste tokens into files.

huggingface-cli login

Or use the Python API with your local authenticated session.

2. Create The Space

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="vibhuiitj/whispermath-webdemo",
    repo_type="space",
    space_sdk="docker",
    private=False,
    exist_ok=True,
)

3. Upload The Folder

from huggingface_hub import HfApi

api = HfApi()
api.upload_folder(
    repo_id="vibhuiitj/whispermath-webdemo",
    repo_type="space",
    folder_path="/Users/vaibhav/Desktop/beyond/whispermath/webdemo",
    path_in_repo=".",
    ignore_patterns=[
        "__pycache__/*",
        "*.pyc",
        ".DS_Store",
        ".venv/*",
        "audio/*",
        "outputs/*",
    ],
    commit_message="Deploy WhisperMath web demo",
)

4. Check Runtime

from huggingface_hub import HfApi

runtime = HfApi().get_space_runtime("vibhuiitj/whispermath-webdemo")
print(runtime)

Expected final state:

stage='RUNNING'
hardware='cpu-basic'
requested_hardware='cpu-basic'

5. Test The Deployed Space

Health:

curl https://vibhuiitj-whispermath-webdemo.hf.space/api/health

Text decode:

curl -L -X POST https://vibhuiitj-whispermath-webdemo.hf.space/api/decode \
  -H "Content-Type: application/json" \
  -d '{"text":"x squared minus y squared equals four","num_beams":1,"max_new_tokens":128}'

Expected:

{
  "transcript": "x squared minus y squared equals four",
  "math_text": "x^2-y^2=4",
  "decoder_model": "vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724"
}

Docker Notes

The Docker image:

  • Uses python:3.11-slim
  • Installs libgomp1, needed by some CPU inference dependencies
  • Runs on port 7860, the Hugging Face Spaces default
  • Sets WHISPERMATH_WHISPER_MODEL=small.en
  • Sets WHISPERMATH_DECODER_DEVICE=cpu
  • Sets HF_HUB_DISABLE_XET=1

HF_HUB_DISABLE_XET=1 is included because local testing showed large model downloads could get stuck with incomplete Xet-backed cache files.

Troubleshooting

Space Takes A Long Time To Start

The first start downloads:

  • Systran/faster-whisper-small.en
  • vibhuiitj/byt5-base-whispermath-a100-checkpoint-10724

On free CPU, startup can take a few minutes.

Audio Transcription Is Wrong

Try these in order:

  1. Speak shorter phrases.
  2. Use clearer operator words, for example over instead of by.
  3. Check the editable transcript box.
  4. Correct the transcript manually.
  5. Click Decode Transcript.
  6. Try base.en, small.en, or medium.en locally.

ByT5 Output Is Wrong But Transcript Is Correct

That means the decoder needs more targeted training data. Common weak phrases include:

by / divided by / over
whole square
derivative of ...
limit as ...
fraction with grouped numerator and denominator

Use the editable transcript box to collect failure cases.

KaTeX Rendering Fails

The app still shows the raw ByT5 output under Raw ByT5 Output. If the raw output is malformed LaTeX-like text, KaTeX may render an error-colored expression or show fallback text.

Medium Whisper Hangs During Download

If a local download leaves an incomplete cache file, run:

HF_HUB_DISABLE_XET=1 python - <<'PY'
from huggingface_hub import snapshot_download
snapshot_download("Systran/faster-whisper-medium.en", max_workers=1)
PY

Then restart:

HF_HUB_DISABLE_XET=1 WHISPERMATH_WHISPER_MODEL=medium.en \
python -m uvicorn app:app --host 127.0.0.1 --port 8766

Security

Never commit Hugging Face tokens or API keys into this folder.

If a token is pasted into a chat or terminal history by mistake, revoke/rotate it from Hugging Face settings.