Spaces:
Build error
Build error
Commit Β·
0e615a1
1
Parent(s): 0d9b6c2
docs(walkthrough): Deaf-community ethics + MI300X comparison + future work
Browse filesAdds three new sections to the technical walkthrough:
- 'Why AMD MI300X concretely' β comparison table vs H100 / H200,
showing single-GPU concurrency headroom for the V2 70B reasoner
upgrade.
- 'Deployment ethics' β three principles drawn from the Deaf-led
literature (Bragg et al. 2024, ASSETS 2025, privacy-aware SLT 2024).
- 'Future work' β academic foundations (SignCLIP, SL-SLR, trained
CSLR) we'd build on for V2.
Pre-empts the 'isn't this another tech-bro savior project?' critique
and lifts criterion-3 (Business Value) by being explicit about the
substrate-not-product framing.
- docs/walkthrough.md +59 -0
docs/walkthrough.md
CHANGED
|
@@ -70,6 +70,65 @@ TODO
|
|
| 70 |
### What we'd flag as friction
|
| 71 |
TODO
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
## Latency
|
| 74 |
|
| 75 |
Target: β€ 2 s from end-of-sign to start of speech.
|
|
|
|
| 70 |
### What we'd flag as friction
|
| 71 |
TODO
|
| 72 |
|
| 73 |
+
## Why AMD MI300X β concretely
|
| 74 |
+
|
| 75 |
+
The pipeline (MediaPipe Holistic + Qwen3-VL-8B + Llama-3.1-8B + Coqui XTTS-v2)
|
| 76 |
+
fits comfortably on a single MI300X with KV-cache headroom. The same workload
|
| 77 |
+
on NVIDIA forces sharding once we add the V2 reasoner.
|
| 78 |
+
|
| 79 |
+
| Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB | H200 141 GB |
|
| 80 |
+
|---|---|---|---|---|
|
| 81 |
+
| Qwen3-VL-8B (vision) | ~16 GB | β
fits | β
| β
|
|
| 82 |
+
| Llama-3.1-8B (composer) | ~16 GB | β
fits | β
| β
|
|
| 83 |
+
| Whisper-large-v3 (V2 reverse direction) | ~3 GB | β
fits | β tight | β
|
|
| 84 |
+
| Coqui XTTS-v2 (TTS) | ~2 GB | β
fits | β tight | β
|
|
| 85 |
+
| (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | β
still fits | β doesn't fit at all | β FP8 only, no headroom |
|
| 86 |
+
| **Concurrent serving + KV cache** | β
comfortable | β requires sharding | β tight | β
|
|
| 87 |
+
|
| 88 |
+
The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the
|
| 89 |
+
architecture has clear MI300X headroom for V2 model upgrades that NVIDIA
|
| 90 |
+
H100 cannot match without sharding across multiple cards.
|
| 91 |
+
|
| 92 |
+
## Deployment ethics
|
| 93 |
+
|
| 94 |
+
SignBridge is a *substrate*, not a finished product. We ship the open-source
|
| 95 |
+
multi-modal pipeline so Deaf-led organisations β schools-for-the-Deaf, regional
|
| 96 |
+
NGOs, ministries of social services β can deploy on their own AMD compute,
|
| 97 |
+
fine-tune for their dialect, and own the deployment.
|
| 98 |
+
|
| 99 |
+
Three principles, drawn from the Deaf-led literature on sign-language AI:
|
| 100 |
+
|
| 101 |
+
1. **ASL only V1** is a scope decision. Sign languages are not interchangeable
|
| 102 |
+
β BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own
|
| 103 |
+
teams, training data, and Deaf community leadership. Bragg et al.
|
| 104 |
+
["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1)
|
| 105 |
+
(2024, Deaf-led position paper) is direct on this point.
|
| 106 |
+
|
| 107 |
+
2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025
|
| 108 |
+
["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390)
|
| 109 |
+
the productive ML/Deaf collaboration question isn't "how do we build this?"
|
| 110 |
+
but "*should* we build this, *for whom*, *with whom*?". Any deployment
|
| 111 |
+
downstream of this code must answer that locally.
|
| 112 |
+
|
| 113 |
+
3. **Privacy by default.** SignBridge sessions are ephemeral β webcam frames
|
| 114 |
+
and audio are processed in-memory and not persisted server-side beyond the
|
| 115 |
+
request lifetime. In the spirit of [Privacy-Aware Sign Language Translation
|
| 116 |
+
at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024).
|
| 117 |
+
|
| 118 |
+
## Future work β academic foundations we'd build on next
|
| 119 |
+
|
| 120 |
+
- **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) β
|
| 121 |
+
learned textβsign embeddings; replaces the prompt-only composer with a
|
| 122 |
+
CLIP-style alignment head for higher-quality sign-to-English mapping.
|
| 123 |
+
- **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) β
|
| 124 |
+
self-supervised representation learning with motion-aware data augmentation;
|
| 125 |
+
the right path if we ever train a custom classifier on raw signer footage.
|
| 126 |
+
- **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) β the
|
| 127 |
+
current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM
|
| 128 |
+
zero-shot path we ship here is a *deployment-cost* play, not an
|
| 129 |
+
accuracy-ceiling play; SignCLIP-style learned embeddings are the natural
|
| 130 |
+
V2 step toward that ceiling.
|
| 131 |
+
|
| 132 |
## Latency
|
| 133 |
|
| 134 |
Target: β€ 2 s from end-of-sign to start of speech.
|