signbridge / docs /walkthrough.md
LucasLooTan's picture
docs(walkthrough): Deaf-community ethics + MI300X comparison + future work
0e615a1
# SignBridge β€” technical walkthrough
> Internal technical record of the build. Not a submission deliverable
> (Build-in-Public extra challenge was dropped on 2026-05-07).
> Kept around because it documents the AMD-specific engineering thinking
> and is useful if anyone later asks "why these design choices?".
## What we built
A real-time webcam-based ASL β†’ English speech translator. A deaf user signs
into the webcam; the pipeline (MediaPipe Holistic β†’ trained sign classifier
β†’ Llama-3.1-8B sentence composer β†’ Coqui XTTS-v2) returns spoken English
in under 2 seconds. Designed to fit Track 3 (Vision & Multimodal AI) with
the entire model stack running concurrently on a single AMD Instinct MI300X.
## Why AMD MI300X
- 192 GB HBM3 β€” the trained classifier (~20 MB), Llama-3.1-8B (~16 GB FP16),
XTTS-v2 (~2 GB), and (V2 stretch) Whisper-large-v3 (~3 GB) all fit
concurrently with margin for KV cache.
- 5.3 TB/s memory bandwidth β€” bandwidth-bound streaming workload (many
small inferences per second on the classifier + TTS chunked decode + LLM
next-token) is exactly what bandwidth wins.
## Architecture
```
webcam frames β†’ MediaPipe Holistic β†’ trained classifier
(CPU-fast) (TorchScript on MI300X)
β”‚
β–Ό
Llama-3.1-8B sentence composer
(vLLM on MI300X)
β”‚
β–Ό
XTTS-v2 β†’ audio
(XTTS on MI300X)
```
## Models
| Component | Source | Notes |
|---|---|---|
| Pose extractor | MediaPipe Holistic (Google) | CPU-fast preprocessing β€” not GPU-bound |
| Sign classifier | trained from scratch on WLASL Top-100 + ASL fingerspelling alphabet | 3-layer transformer encoder over 543-dim landmark sequences; published to HF Hub at `lucas-loo/signbridge-classifier` |
| Sentence composer | `meta-llama/Llama-3.1-8B-Instruct` | Pulled from HF Hub; served on MI300X via vLLM |
| Text-to-speech | `coqui/XTTS-v2` | Multilingual; we use English V1 |
## Datasets
- [WLASL](https://github.com/dxli94/WLASL) Top-100 subset
- ASL fingerspelling alphabet (open dataset)
## ROCm / AMD Developer Cloud experience
> *Filled in across Day 1–3.*
### Day 1 β€” environment + sanity
TODO
### Day 2 β€” training the classifier
TODO
### Day 3 β€” serving + latency tuning
TODO
### What worked well
TODO
### What we'd flag as friction
TODO
## Why AMD MI300X β€” concretely
The pipeline (MediaPipe Holistic + Qwen3-VL-8B + Llama-3.1-8B + Coqui XTTS-v2)
fits comfortably on a single MI300X with KV-cache headroom. The same workload
on NVIDIA forces sharding once we add the V2 reasoner.
| Component | Weights (FP16) | MI300X 1Γ— (192 GB) | H100 80 GB | H200 141 GB |
|---|---|---|---|---|
| Qwen3-VL-8B (vision) | ~16 GB | βœ… fits | βœ… | βœ… |
| Llama-3.1-8B (composer) | ~16 GB | βœ… fits | βœ… | βœ… |
| Whisper-large-v3 (V2 reverse direction) | ~3 GB | βœ… fits | ⚠ tight | βœ… |
| Coqui XTTS-v2 (TTS) | ~2 GB | βœ… fits | ⚠ tight | βœ… |
| (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | βœ… still fits | ❌ doesn't fit at all | ⚠ FP8 only, no headroom |
| **Concurrent serving + KV cache** | βœ… comfortable | ❌ requires sharding | ⚠ tight | βœ… |
The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the
architecture has clear MI300X headroom for V2 model upgrades that NVIDIA
H100 cannot match without sharding across multiple cards.
## Deployment ethics
SignBridge is a *substrate*, not a finished product. We ship the open-source
multi-modal pipeline so Deaf-led organisations β€” schools-for-the-Deaf, regional
NGOs, ministries of social services β€” can deploy on their own AMD compute,
fine-tune for their dialect, and own the deployment.
Three principles, drawn from the Deaf-led literature on sign-language AI:
1. **ASL only V1** is a scope decision. Sign languages are not interchangeable
β€” BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own
teams, training data, and Deaf community leadership. Bragg et al.
["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1)
(2024, Deaf-led position paper) is direct on this point.
2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025
["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390)
the productive ML/Deaf collaboration question isn't "how do we build this?"
but "*should* we build this, *for whom*, *with whom*?". Any deployment
downstream of this code must answer that locally.
3. **Privacy by default.** SignBridge sessions are ephemeral β€” webcam frames
and audio are processed in-memory and not persisted server-side beyond the
request lifetime. In the spirit of [Privacy-Aware Sign Language Translation
at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024).
## Future work β€” academic foundations we'd build on next
- **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) β€”
learned text↔sign embeddings; replaces the prompt-only composer with a
CLIP-style alignment head for higher-quality sign-to-English mapping.
- **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) β€”
self-supervised representation learning with motion-aware data augmentation;
the right path if we ever train a custom classifier on raw signer footage.
- **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) β€” the
current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM
zero-shot path we ship here is a *deployment-cost* play, not an
accuracy-ceiling play; SignCLIP-style learned embeddings are the natural
V2 step toward that ceiling.
## Latency
Target: ≀ 2 s from end-of-sign to start of speech.
Measured on a single MI300X (Day 3):
- MediaPipe Holistic per frame: TODO ms
- Classifier per window: TODO ms
- Llama-3.1-8B sentence composition (≀ 30 tokens): TODO ms
- XTTS-v2 first-audio-chunk: TODO ms
## MI300X vs NVIDIA H100 β€” the AMD pitch
| Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed |
|---|---|---|---|
| Llama-3.1-8B FP16 weights | βœ… fits with margin | βœ… fits with margin | 1Γ— |
| + XTTS-v2 + Whisper-large-v3 + classifier | βœ… all concurrent | ⚠️ tight (~28 GB total + KV) | likely 1Γ— but no headroom |
| + 70B reasoner upgrade (V2) | βœ… 70B FP8 ~70 GB still fits | ❌ doesn't fit at all | β‰₯3Γ— |
The single-GPU concurrency story is the AMD pitch. This V1 fits on H100;
the architecture has clear headroom on MI300X for higher-quality V2 models.
## License
MIT.