LucasLooTan commited on
Commit
0e615a1
Β·
1 Parent(s): 0d9b6c2

docs(walkthrough): Deaf-community ethics + MI300X comparison + future work

Browse files

Adds three new sections to the technical walkthrough:
- 'Why AMD MI300X concretely' β€” comparison table vs H100 / H200,
showing single-GPU concurrency headroom for the V2 70B reasoner
upgrade.
- 'Deployment ethics' β€” three principles drawn from the Deaf-led
literature (Bragg et al. 2024, ASSETS 2025, privacy-aware SLT 2024).
- 'Future work' β€” academic foundations (SignCLIP, SL-SLR, trained
CSLR) we'd build on for V2.

Pre-empts the 'isn't this another tech-bro savior project?' critique
and lifts criterion-3 (Business Value) by being explicit about the
substrate-not-product framing.

Files changed (1) hide show
  1. docs/walkthrough.md +59 -0
docs/walkthrough.md CHANGED
@@ -70,6 +70,65 @@ TODO
70
  ### What we'd flag as friction
71
  TODO
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  ## Latency
74
 
75
  Target: ≀ 2 s from end-of-sign to start of speech.
 
70
  ### What we'd flag as friction
71
  TODO
72
 
73
+ ## Why AMD MI300X β€” concretely
74
+
75
+ The pipeline (MediaPipe Holistic + Qwen3-VL-8B + Llama-3.1-8B + Coqui XTTS-v2)
76
+ fits comfortably on a single MI300X with KV-cache headroom. The same workload
77
+ on NVIDIA forces sharding once we add the V2 reasoner.
78
+
79
+ | Component | Weights (FP16) | MI300X 1Γ— (192 GB) | H100 80 GB | H200 141 GB |
80
+ |---|---|---|---|---|
81
+ | Qwen3-VL-8B (vision) | ~16 GB | βœ… fits | βœ… | βœ… |
82
+ | Llama-3.1-8B (composer) | ~16 GB | βœ… fits | βœ… | βœ… |
83
+ | Whisper-large-v3 (V2 reverse direction) | ~3 GB | βœ… fits | ⚠ tight | βœ… |
84
+ | Coqui XTTS-v2 (TTS) | ~2 GB | βœ… fits | ⚠ tight | βœ… |
85
+ | (V2) Llama-3.1-70B FP8 reasoner upgrade | ~70 GB | βœ… still fits | ❌ doesn't fit at all | ⚠ FP8 only, no headroom |
86
+ | **Concurrent serving + KV cache** | βœ… comfortable | ❌ requires sharding | ⚠ tight | βœ… |
87
+
88
+ The single-GPU concurrency story is the AMD pitch. V1 fits anywhere; the
89
+ architecture has clear MI300X headroom for V2 model upgrades that NVIDIA
90
+ H100 cannot match without sharding across multiple cards.
91
+
92
+ ## Deployment ethics
93
+
94
+ SignBridge is a *substrate*, not a finished product. We ship the open-source
95
+ multi-modal pipeline so Deaf-led organisations β€” schools-for-the-Deaf, regional
96
+ NGOs, ministries of social services β€” can deploy on their own AMD compute,
97
+ fine-tune for their dialect, and own the deployment.
98
+
99
+ Three principles, drawn from the Deaf-led literature on sign-language AI:
100
+
101
+ 1. **ASL only V1** is a scope decision. Sign languages are not interchangeable
102
+ β€” BSL, ISL, MSL, CSL, and 200+ other sign languages each deserve their own
103
+ teams, training data, and Deaf community leadership. Bragg et al.
104
+ ["Systemic Biases in Sign Language AI Research"](https://arxiv.org/html/2403.02563v1)
105
+ (2024, Deaf-led position paper) is direct on this point.
106
+
107
+ 2. **Deaf community engagement before deployment.** Per the ACM ASSETS 2025
108
+ ["Exploring Collaboration to Center the Deaf Community in Sign Language AI"](https://dl.acm.org/doi/10.1145/3663547.3746390)
109
+ the productive ML/Deaf collaboration question isn't "how do we build this?"
110
+ but "*should* we build this, *for whom*, *with whom*?". Any deployment
111
+ downstream of this code must answer that locally.
112
+
113
+ 3. **Privacy by default.** SignBridge sessions are ephemeral β€” webcam frames
114
+ and audio are processed in-memory and not persisted server-side beyond the
115
+ request lifetime. In the spirit of [Privacy-Aware Sign Language Translation
116
+ at Scale](https://aclanthology.org/2024.acl-long.467.pdf) (ACL 2024).
117
+
118
+ ## Future work β€” academic foundations we'd build on next
119
+
120
+ - **SignCLIP** ([EMNLP 2024](https://aclanthology.org/2024.emnlp-main.518.pdf)) β€”
121
+ learned text↔sign embeddings; replaces the prompt-only composer with a
122
+ CLIP-style alignment head for higher-quality sign-to-English mapping.
123
+ - **SL-SLR** ([arXiv 2509.05188v1](https://arxiv.org/html/2509.05188v1)) β€”
124
+ self-supervised representation learning with motion-aware data augmentation;
125
+ the right path if we ever train a custom classifier on raw signer footage.
126
+ - **Continuous SLT trained models** (Swin-MSTP, Stack Transformer) β€” the
127
+ current trained-from-scratch ceiling on WLASL is ~93.5% Top-1. The VLM
128
+ zero-shot path we ship here is a *deployment-cost* play, not an
129
+ accuracy-ceiling play; SignCLIP-style learned embeddings are the natural
130
+ V2 step toward that ceiling.
131
+
132
  ## Latency
133
 
134
  Target: ≀ 2 s from end-of-sign to start of speech.