File size: 3,508 Bytes
18d028b
 
8b64ea8
 
 
 
18d028b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4499b6e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# SignBridge β€” technical walkthrough

> Internal technical record of the build. Not a submission deliverable
> (Build-in-Public extra challenge was dropped on 2026-05-07).
> Kept around because it documents the AMD-specific engineering thinking
> and is useful if anyone later asks "why these design choices?".

## What we built

A real-time webcam-based ASL β†’ English speech translator. A deaf user signs
into the webcam; the pipeline (MediaPipe Holistic β†’ trained sign classifier
β†’ Llama-3.1-8B sentence composer β†’ Coqui XTTS-v2) returns spoken English
in under 2 seconds. Designed to fit Track 3 (Vision & Multimodal AI) with
the entire model stack running concurrently on a single AMD Instinct MI300X.

## Why AMD MI300X

- 192 GB HBM3 β€” the trained classifier (~20 MB), Llama-3.1-8B (~16 GB FP16),
  XTTS-v2 (~2 GB), and (V2 stretch) Whisper-large-v3 (~3 GB) all fit
  concurrently with margin for KV cache.
- 5.3 TB/s memory bandwidth β€” bandwidth-bound streaming workload (many
  small inferences per second on the classifier + TTS chunked decode + LLM
  next-token) is exactly what bandwidth wins.

## Architecture

```
webcam frames β†’ MediaPipe Holistic β†’ trained classifier
                  (CPU-fast)            (TorchScript on MI300X)
                                              β”‚
                                              β–Ό
                                  Llama-3.1-8B sentence composer
                                       (vLLM on MI300X)
                                              β”‚
                                              β–Ό
                                          XTTS-v2 β†’ audio
                                       (XTTS on MI300X)
```

## Models

| Component | Source | Notes |
|---|---|---|
| Pose extractor | MediaPipe Holistic (Google) | CPU-fast preprocessing β€” not GPU-bound |
| Sign classifier | trained from scratch on WLASL Top-100 + ASL fingerspelling alphabet | 3-layer transformer encoder over 543-dim landmark sequences; published to HF Hub at `lucas-loo/signbridge-classifier` |
| Sentence composer | `meta-llama/Llama-3.1-8B-Instruct` | Pulled from HF Hub; served on MI300X via vLLM |
| Text-to-speech | `coqui/XTTS-v2` | Multilingual; we use English V1 |

## Datasets

- [WLASL](https://github.com/dxli94/WLASL) Top-100 subset
- ASL fingerspelling alphabet (open dataset)

## ROCm / AMD Developer Cloud experience

> *Filled in across Day 1–3.*

### Day 1 β€” environment + sanity
TODO

### Day 2 β€” training the classifier
TODO

### Day 3 β€” serving + latency tuning
TODO

### What worked well
TODO

### What we'd flag as friction
TODO

## Latency

Target: ≀ 2 s from end-of-sign to start of speech.

Measured on a single MI300X (Day 3):
- MediaPipe Holistic per frame: TODO ms
- Classifier per window: TODO ms
- Llama-3.1-8B sentence composition (≀ 30 tokens): TODO ms
- XTTS-v2 first-audio-chunk: TODO ms

## MI300X vs NVIDIA H100 β€” the AMD pitch

| Item | MI300X (1 GPU) | H100 (1 GPU) | H100 cluster needed |
|---|---|---|---|
| Llama-3.1-8B FP16 weights | βœ… fits with margin | βœ… fits with margin | 1Γ— |
| + XTTS-v2 + Whisper-large-v3 + classifier | βœ… all concurrent | ⚠️ tight (~28 GB total + KV) | likely 1Γ— but no headroom |
| + 70B reasoner upgrade (V2) | βœ… 70B FP8 ~70 GB still fits | ❌ doesn't fit at all | β‰₯3Γ— |

The single-GPU concurrency story is the AMD pitch. This V1 fits on H100;
the architecture has clear headroom on MI300X for higher-quality V2 models.

## License
MIT.