Spaces:
Build error
Build error
| # SignBridge β technical walkthrough | |
| > Internal technical record of the build. Not a submission deliverable | |
| > (Build-in-Public extra challenge was dropped on 2026-05-07). | |
| > Kept around because it documents the AMD-specific engineering thinking | |
| > and is useful if anyone later asks "why these design choices?". | |
| ## What we built | |
| A real-time webcam-based ASL β English speech translator. A deaf user signs | |
| into the webcam; the pipeline (MediaPipe Holistic β trained sign classifier | |
| β Llama-3.1-8B sentence composer β Coqui XTTS-v2) returns spoken English | |
| in under 2 seconds. Designed to fit Track 3 (Vision & Multimodal AI) with | |
| the entire model stack running concurrently on a single AMD Instinct MI300X. | |
| ## Why AMD MI300X | |
| - 192 GB HBM3 β the trained classifier (~20 MB), Llama-3.1-8B (~16 GB FP16), | |
| XTTS-v2 (~2 GB), and (V2 stretch) Whisper-large-v3 (~3 GB) all fit | |
| concurrently with margin for KV cache. | |
| - 5.3 TB/s memory bandwidth β bandwidth-bound streaming workload (many | |
| small inferences per second on the classifier + TTS chunked decode + LLM | |
| next-token) is exactly what bandwidth wins. | |
| ## Architecture | |
| ``` | |
| webcam frames β MediaPipe Holistic β trained classifier | |
| (CPU-fast) (TorchScript on MI300X) | |
| β | |
| βΌ | |
| Llama-3.1-8B sentence composer | |
| (vLLM on MI300X) | |
| β | |
| βΌ | |
| XTTS-v2 β audio | |
| (XTTS on MI300X) | |
| ``` | |
| ## Models | |
| | Component | Source | Notes | | |
| |---|---|---| | |
| | Pose extractor | MediaPipe Holistic (Google) | CPU-fast preprocessing β not GPU-bound | | |
| | Sign classifier | trained from scratch on WLASL Top-100 + ASL fingerspelling alphabet | 3-layer transformer encoder over 543-dim landmark sequences; published to HF Hub at `lucas-loo/signbridge-classifier` | | |
| | Sentence composer | `meta-llama/Llama-3.1-8B-Instruct` | Pulled from HF Hub; served on MI300X via vLLM | | |
| | Text-to-speech | `coqui/XTTS-v2` | Multilingual; we use English V1 | | |
| ## Datasets | |
| - [WLASL](https://github.com/dxli94/WLASL) Top-100 subset | |
| - ASL fingerspelling alphabet (open dataset) | |
| ## ROCm / AMD Developer Cloud experience | |
| > *Filled in across Day 1β3.* | |
| ### Day 1 β environment + sanity | |
| TODO | |
| ### Day 2 β training the classifier | |
| TODO | |
| ### Day 3 β serving + latency tuning | |
| TODO | |
| ### What worked well | |
| TODO | |
| ### What we'd flag as friction | |
| TODO | |
| ## Why AMD MI300X β concretely | |
| The pipeline (MediaPipe Holistic + Qwen3-VL-8B + Llama-3.1-8B + Coqui XTTS-v2) | |
| fits comfortably on a single MI300X with KV-cache headroom. The same workload | |
| on NVIDIA forces sharding once we add the V2 reasoner. | |
| | Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB | H200 141 GB | | |
| |---|---|---|---|---| | |
| | Qwen3-VL-8B (vision) | ~16 GB | β fits | β | β | | |
| | Llama-3.1-8B (composer) | ~16 GB | β fits | β | β | | |
| | Whisper-large-v3 (V2 reverse direction) | ~3 GB | β fits | β tight | β | | |
| | Coqui XTTS-v2 (TTS) | ~2 GB | β fits | β tight | β | | |
| | (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | β still fits | β doesn't fit at all | β FP8 only, no headroom | | |
| | **Concurrent serving + KV cache** | β comfortable | β requires sharding | β tight | β | | |
| The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the | |
| architecture has clear MI300X headroom for V2 model upgrades that NVIDIA | |
| H100 cannot match without sharding across multiple cards. | |
| ## Deployment ethics | |
| SignBridge is a *substrate*, not a finished product. We ship the open-source | |
| multi-modal pipeline so Deaf-led organisations β schools-for-the-Deaf, regional | |
| NGOs, ministries of social services β can deploy on their own AMD compute, | |
| fine-tune for their dialect, and own the deployment. | |
| Three principles, drawn from the Deaf-led literature on sign-language AI: | |
| 1. **ASL only V1** is a scope decision. Sign languages are not interchangeable | |
| β BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own | |
| teams, training data, and Deaf community leadership. Bragg et al. | |
| ["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1) | |
| (2024, Deaf-led position paper) is direct on this point. | |
| 2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025 | |
| ["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390) | |
| the productive ML/Deaf collaboration question isn't "how do we build this?" | |
| but "*should* we build this, *for whom*, *with whom*?". Any deployment | |
| downstream of this code must answer that locally. | |
| 3. **Privacy by default.** SignBridge sessions are ephemeral β webcam frames | |
| and audio are processed in-memory and not persisted server-side beyond the | |
| request lifetime. In the spirit of [Privacy-Aware Sign Language Translation | |
| at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024). | |
| ## Future work β academic foundations we'd build on next | |
| - **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) β | |
| learned textβsign embeddings; replaces the prompt-only composer with a | |
| CLIP-style alignment head for higher-quality sign-to-English mapping. | |
| - **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) β | |
| self-supervised representation learning with motion-aware data augmentation; | |
| the right path if we ever train a custom classifier on raw signer footage. | |
| - **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) β the | |
| current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM | |
| zero-shot path we ship here is a *deployment-cost* play, not an | |
| accuracy-ceiling play; SignCLIP-style learned embeddings are the natural | |
| V2 step toward that ceiling. | |
| ## Latency | |
| Target: β€ 2 s from end-of-sign to start of speech. | |
| Measured on a single MI300X (Day 3): | |
| - MediaPipe Holistic per frame: TODO ms | |
| - Classifier per window: TODO ms | |
| - Llama-3.1-8B sentence composition (β€ 30 tokens): TODO ms | |
| - XTTS-v2 first-audio-chunk: TODO ms | |
| ## MI300X vs NVIDIA H100 β the AMD pitch | |
| | Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed | | |
| |---|---|---|---| | |
| | Llama-3.1-8B FP16 weights | β fits with margin | β fits with margin | 1Γ | | |
| | + XTTS-v2 + Whisper-large-v3 + classifier | β all concurrent | β οΈ tight (~28 GB total + KV) | likely 1Γ but no headroom | | |
| | + 70B reasoner upgrade (V2) | β 70B FP8 ~70 GB still fits | β doesn't fit at all | β₯3Γ | | |
| The single-GPU concurrency story is the AMD pitch. This V1 fits on H100; | |
| the architecture has clear headroom on MI300X for higher-quality V2 models. | |
| ## License | |
| MIT. | |