LucasLooTan commited on
Commit
a5ffd9e
·
1 Parent(s): 7dc8ce6

docs: README — fine-tune narrative + hybrid pipeline diagram + model URLs

Browse files
Files changed (1) hide show
  1. README.md +19 -4
README.md CHANGED
@@ -22,11 +22,26 @@ Submission for the **AMD Developer Hackathon** (LabLab.ai, May 2026) — **Track
22
  ## How it works
23
 
24
  ```
25
- webcam frames → Qwen3-VL-32B → Qwen3-8B → Coqui XTTS-v2 speech
26
- (sign vision) (composer) (TTS)
 
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
- All three stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~34 GB (Qwen3-VL-32B + Qwen3-8B + XTTS-v2) on a 192 GB GPU fits with margin for KV cache + serving overhead. Both LLMs are Qwen-family, served via vLLM 0.17.1 on ROCm 7.2.
 
 
 
 
 
30
 
31
  ## V1 use cases
32
 
@@ -37,7 +52,7 @@ V1 is **one-way**: deaf signs → hearing hears. Reverse direction (speech → o
37
 
38
  ## Why AMD
39
 
40
- The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-32B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch — practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
41
 
42
  ## Why this matters (business case)
43
 
 
22
  ## How it works
23
 
24
  ```
25
+ ┌─► MediaPipe Hand trained MLP (90% acc, 50ms CPU)
26
+ webcam frame ────┤ │
27
+ └─► fine-tuned Qwen3-VL-8B (LoRA on AMD MI300X)
28
+ │ (92% acc, motion + fallback)
29
+
30
+ Qwen3-8B sentence composer
31
+ │ (AMD MI300X)
32
+
33
+ Coqui XTTS-v2 TTS
34
+
35
+
36
+ 🔊 speech
37
  ```
38
 
39
+ A hybrid pipeline: a small classical-ML classifier handles static fingerspelling at 90% accuracy with 50 ms CPU latency; a LoRA-fine-tuned Qwen3-VL-8B handles motion-dependent signs and ambiguous static frames; Qwen3-8B turns sign tokens into natural English. The two LLMs run **concurrently on a single AMD Instinct MI300X** via vLLM 0.17.1 on ROCm 7.2 — combined ~34 GB on a 192 GB GPU.
40
+
41
+ The fine-tune itself was trained on a single MI300X in **54 minutes** with LoRA (rank 16, target q/k/v/o, 2 epochs on 9,786 ASL Alphabet samples). Final eval loss 0.48; gold-set accuracy 92.3% — a 4.8× lift over the 19.2% zero-shot baseline.
42
+
43
+ - Fine-tuned model: `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`
44
+ - Landmark classifier: `huggingface.co/LucasLooTan/signbridge-asl-classifier`
45
 
46
  ## V1 use cases
47
 
 
52
 
53
  ## Why AMD
54
 
55
+ The MI300X did three jobs in this project on a single GPU: (1) ran the LoRA fine-tune of Qwen3-VL-8B in 54 minutes; (2) hosts the merged model for inference via vLLM; (3) hosts the Qwen3-8B composer in parallel for sentence composition. 192 GB HBM3 means we never had to reload weights, swap, or shard between training and serving. NVIDIA H100 (80 GB) would require a 3-GPU cluster for the same V2 70B reasoner upgrade — practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
56
 
57
  ## Why this matters (business case)
58