Spaces:
Build error
A newer version of the Gradio SDK is available: 6.14.0
SignBridge β technical walkthrough
Internal technical record of the build. Not a submission deliverable (Build-in-Public extra challenge was dropped on 2026-05-07). Kept around because it documents the AMD-specific engineering thinking and is useful if anyone later asks "why these design choices?".
What we built
A real-time webcam-based ASL β English speech translator. A deaf user signs into the webcam; the pipeline (MediaPipe Holistic β trained sign classifier β Llama-3.1-8B sentence composer β Coqui XTTS-v2) returns spoken English in under 2 seconds. Designed to fit Track 3 (Vision & Multimodal AI) with the entire model stack running concurrently on a single AMD Instinct MI300X.
Why AMD MI300X
- 192 GB HBM3 β the trained classifier (
20 MB), Llama-3.1-8B (16 GB FP16), XTTS-v2 (2 GB), and (V2 stretch) Whisper-large-v3 (3 GB) all fit concurrently with margin for KV cache. - 5.3 TB/s memory bandwidth β bandwidth-bound streaming workload (many small inferences per second on the classifier + TTS chunked decode + LLM next-token) is exactly what bandwidth wins.
Architecture
webcam frames β MediaPipe Holistic β trained classifier
(CPU-fast) (TorchScript on MI300X)
β
βΌ
Llama-3.1-8B sentence composer
(vLLM on MI300X)
β
βΌ
XTTS-v2 β audio
(XTTS on MI300X)
Models
| Component | Source | Notes |
|---|---|---|
| Pose extractor | MediaPipe Holistic (Google) | CPU-fast preprocessing β not GPU-bound |
| Sign classifier | trained from scratch on WLASL Top-100 + ASL fingerspelling alphabet | 3-layer transformer encoder over 543-dim landmark sequences; published to HF Hub at lucas-loo/signbridge-classifier |
| Sentence composer | meta-llama/Llama-3.1-8B-Instruct |
Pulled from HF Hub; served on MI300X via vLLM |
| Text-to-speech | coqui/XTTS-v2 |
Multilingual; we use English V1 |
Datasets
- WLASL Top-100 subset
- ASL fingerspelling alphabet (open dataset)
ROCm / AMD Developer Cloud experience
Filled in across Day 1β3.
Day 1 β environment + sanity
TODO
Day 2 β training the classifier
TODO
Day 3 β serving + latency tuning
TODO
What worked well
TODO
What we'd flag as friction
TODO
Latency
Target: β€ 2 s from end-of-sign to start of speech.
Measured on a single MI300X (Day 3):
- MediaPipe Holistic per frame: TODO ms
- Classifier per window: TODO ms
- Llama-3.1-8B sentence composition (β€ 30 tokens): TODO ms
- XTTS-v2 first-audio-chunk: TODO ms
MI300X vs NVIDIA H100 β the AMD pitch
| Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed |
|---|---|---|---|
| Llama-3.1-8B FP16 weights | β fits with margin | β fits with margin | 1Γ |
| + XTTS-v2 + Whisper-large-v3 + classifier | β all concurrent | β οΈ tight (~28 GB total + KV) | likely 1Γ but no headroom |
| + 70B reasoner upgrade (V2) | β 70B FP8 ~70 GB still fits | β doesn't fit at all | β₯3Γ |
The single-GPU concurrency story is the AMD pitch. This V1 fits on H100; the architecture has clear headroom on MI300X for higher-quality V2 models.
License
MIT.