Spaces:

lablab-ai-amd-developer-hackathon
/

signbridge

Build error

LucasLooTan commited on about 9 hours ago

Commit

0e615a1

1 Parent(s): 0d9b6c2

docs(walkthrough): Deaf-community ethics + MI300X comparison + future work

Adds three new sections to the technical walkthrough:
- 'Why AMD MI300X concretely' — comparison table vs H100 / H200,
showing single-GPU concurrency headroom for the V2 70B reasoner
upgrade.
- 'Deployment ethics' — three principles drawn from the Deaf-led
literature (Bragg et al. 2024, ASSETS 2025, privacy-aware SLT 2024).
- 'Future work' — academic foundations (SignCLIP, SL-SLR, trained
CSLR) we'd build on for V2.

Pre-empts the 'isn't this another tech-bro savior project?' critique
and lifts criterion-3 (Business Value) by being explicit about the
substrate-not-product framing.

Files changed (1) hide show

docs/walkthrough.md +59 -0

docs/walkthrough.md CHANGED Viewed

@@ -70,6 +70,65 @@ TODO
 ### What we'd flag as friction
 TODO
 ## Latency
 Target: ≤ 2 s from end-of-sign to start of speech.

 ### What we'd flag as friction
 TODO
+## Why AMD MI300X — concretely
+The pipeline (MediaPipe Holistic + Qwen3-VL-8B + Llama-3.1-8B + Coqui XTTS-v2)
+fits comfortably on a single MI300X with KV-cache headroom. The same workload
+on NVIDIA forces sharding once we add the V2 reasoner.
+| Component | Weights (FP16) | MI300X 1× (192 GB) | H100 80 GB | H200 141 GB |
+|---|---|---|---|---|
+| Qwen3-VL-8B (vision) | ~16 GB | ✅ fits | ✅ | ✅ |
+| Llama-3.1-8B (composer) | ~16 GB | ✅ fits | ✅ | ✅ |
+| Whisper-large-v3 (V2 reverse direction) | ~3 GB | ✅ fits | ⚠ tight | ✅ |
+| Coqui XTTS-v2 (TTS) | ~2 GB | ✅ fits | ⚠ tight | ✅ |
+| (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | ✅ still fits | ❌ doesn't fit at all | ⚠ FP8 only, no headroom |
+| **Concurrent serving + KV cache** | ✅ comfortable | ❌ requires sharding | ⚠ tight | ✅ |
+The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the
+architecture has clear MI300X headroom for V2 model upgrades that NVIDIA
+H100 cannot match without sharding across multiple cards.
+## Deployment ethics
+SignBridge is a *substrate*, not a finished product. We ship the open-source
+multi-modal pipeline so Deaf-led organisations — schools-for-the-Deaf, regional
+NGOs, ministries of social services — can deploy on their own AMD compute,
+fine-tune for their dialect, and own the deployment.
+Three principles, drawn from the Deaf-led literature on sign-language AI:
+1. **ASL only V1** is a scope decision. Sign languages are not interchangeable
+   — BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own
+   teams, training data, and Deaf community leadership. Bragg et al.
+   ["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1)
+   (2024, Deaf-led position paper) is direct on this point.
+2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025
+   ["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390)
+   the productive ML/Deaf collaboration question isn't "how do we build this?"
+   but "*should* we build this, *for whom*, *with whom*?". Any deployment
+   downstream of this code must answer that locally.
+3. **Privacy by default.** SignBridge sessions are ephemeral — webcam frames
+   and audio are processed in-memory and not persisted server-side beyond the
+   request lifetime. In the spirit of [Privacy-Aware Sign Language Translation
+   at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024).
+## Future work — academic foundations we'd build on next
+- **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) —
+  learned text↔sign embeddings; replaces the prompt-only composer with a
+  CLIP-style alignment head for higher-quality sign-to-English mapping.
+- **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) —
+  self-supervised representation learning with motion-aware data augmentation;
+  the right path if we ever train a custom classifier on raw signer footage.
+- **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) — the
+  current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM
+  zero-shot path we ship here is a *deployment-cost* play, not an
+  accuracy-ceiling play; SignCLIP-style learned embeddings are the natural
+  V2 step toward that ceiling.
 ## Latency
 Target: ≤ 2 s from end-of-sign to start of speech.