LucasLooTan Claude Opus 4.7 (1M context) commited on
Commit
5952553
Β·
1 Parent(s): 549efd4

Switch to Qwen3-only stack: VLM=Qwen3-VL-32B, composer=Qwen3-8B

Browse files

Both run on a single AMD MI300X via vLLM. Qwen3-VL-32B is the
accuracy ceiling that fits in 192 GB (Qwen3-VL-235B FP8 is 235 GB
which exceeds the GPU). Qwen3-8B replaces gated Llama-3.1-8B,
keeping the composer open-weights and the entire pipeline Qwen-family
for the Qwen Special Reward (10M tokens Γ— team).

Composer adds extra_body={"chat_template_kwargs":{"enable_thinking":False}}
so Qwen3 reasoning models emit a sentence directly instead of <think>...</think>.
Adds SIGNBRIDGE_COMPOSER_BASE_URL/API_KEY so we can split the two
servers on different ports of the same MI300X (8000 for VL, 8001 for composer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CLAUDE.md CHANGED
@@ -87,13 +87,15 @@ Verbatim from lablab page β†’ "Technology Partners & Workshops" β†’ Hugging Face
87
  - 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
88
  - 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
89
  - 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
 
90
 
91
  ### Prize targets for SignBridge
92
 
93
  - πŸ₯‡ **Track 3** (primary).
 
94
  - πŸ€— **HF Special Prize** (most likes β€” requires Space in event org + sharing the link).
95
  - πŸ† Grand Prize (aspirational).
96
- - ❌ Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only).
97
 
98
  ### License rule
99
 
@@ -102,7 +104,7 @@ Per the Voluntary Participation & Prize Terms footer: *"Submissions must be orig
102
  ### Tech stack constraints (per Track 3)
103
 
104
  - **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro β€” those are different AMD product lines.
105
- - **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-8B-Instruct` (Qwen-VL family βœ“) for sign recognition + `meta-llama/Llama-3.1-8B-Instruct` for sentence composition + `coqui/XTTS-v2` for speech.
106
  - **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
107
 
108
  ### Workshop references (provided by AMD)
@@ -165,7 +167,7 @@ Win the AMD Developer Hackathon (LabLab.ai, May 2026), Track 3, with a real-time
165
  - Pipeline (concurrent on one MI300X):
166
  - **Pose extraction:** MediaPipe Holistic (Google) β€” frame β†’ 543-dim landmark vector
167
  - **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) β†’ sign tokens
168
- - **Sentence composer:** `meta-llama/Llama-3.1-8B-Instruct` β†’ grammatical English sentence from sign-token stream
169
  - **TTS:** `coqui/XTTS-v2` β†’ audio
170
  - **(Stretch) STT:** `openai/whisper-large-v3` β†’ reverse direction (speech β†’ on-screen text)
171
  - Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)
 
87
  - 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
88
  - 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
89
  - 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
90
+ - πŸ‰ **Qwen Special Reward** (added by lablab page revision noticed 2026-05-09): "Best use of Qwen in each track β€” 10M Qwen tokens per team member." Awarded separately to the best Qwen-powered project per track.
91
 
92
  ### Prize targets for SignBridge
93
 
94
  - πŸ₯‡ **Track 3** (primary).
95
+ - πŸ‰ **Qwen Special Reward β€” Track 3** (added 2026-05-09). SignBridge's recognizer is `Qwen/Qwen3-VL-32B-Instruct`; we are well-positioned. Action: lead the title/short-description/tags with Qwen3-VL, dedicate a pitch-deck slide to Qwen integration.
96
  - πŸ€— **HF Special Prize** (most likes β€” requires Space in event org + sharing the link).
97
  - πŸ† Grand Prize (aspirational).
98
+ - ❌ Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only). Re-confirmed 2026-05-09 β€” with ~36h remaining, finishing the live demo outranks 2 social posts.
99
 
100
  ### License rule
101
 
 
104
  ### Tech stack constraints (per Track 3)
105
 
106
  - **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro β€” those are different AMD product lines.
107
+ - **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-32B-Instruct` (Qwen-VL family βœ“) for sign recognition + `Qwen/Qwen3-8B` for sentence composition + `coqui/XTTS-v2` for speech.
108
  - **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
109
 
110
  ### Workshop references (provided by AMD)
 
167
  - Pipeline (concurrent on one MI300X):
168
  - **Pose extraction:** MediaPipe Holistic (Google) β€” frame β†’ 543-dim landmark vector
169
  - **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) β†’ sign tokens
170
+ - **Sentence composer:** `Qwen/Qwen3-8B` β†’ grammatical English sentence from sign-token stream
171
  - **TTS:** `coqui/XTTS-v2` β†’ audio
172
  - **(Stretch) STT:** `openai/whisper-large-v3` β†’ reverse direction (speech β†’ on-screen text)
173
  - Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)
README.md CHANGED
@@ -23,17 +23,11 @@ Submission for the **AMD Developer Hackathon** (LabLab.ai, May 2026) β€” **Track
23
  ## How it works
24
 
25
  ```
26
- webcam frames β†’ MediaPipe Holistic β†’ trained sign classifier
27
- (1–5 fps) (543-dim pose) (WLASL Top-100 + alphabet)
28
- β”‚
29
- β–Ό
30
- Llama-3.1-8B sentence composer
31
- β”‚
32
- β–Ό
33
- Coqui XTTS-v2 β†’ speech
34
  ```
35
 
36
- All four stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~22 GB on a 192 GB GPU β€” fits with margin for KV cache + serving overhead.
37
 
38
  ## V1 use cases
39
 
@@ -44,7 +38,7 @@ V1 is **one-way**: deaf signs β†’ hearing hears. Reverse direction (speech β†’ o
44
 
45
  ## Why AMD
46
 
47
- The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-8B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch β€” practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
48
 
49
  ## Why this matters (business case)
50
 
@@ -82,7 +76,8 @@ python -m signbridge.scripts.train_classifier --dataset data/wlasl --epochs 30
82
 
83
  ## Models pulled from Hugging Face Hub
84
 
85
- - `meta-llama/Llama-3.1-8B-Instruct` β€” sentence composer
 
86
  - `coqui/XTTS-v2` β€” text-to-speech
87
  - (V2 stretch) `openai/whisper-large-v3` β€” for the reverse direction
88
 
 
23
  ## How it works
24
 
25
  ```
26
+ webcam frames β†’ Qwen3-VL-32B β†’ Qwen3-8B β†’ Coqui XTTS-v2 β†’ speech
27
+ (sign vision) (composer) (TTS)
 
 
 
 
 
 
28
  ```
29
 
30
+ All three stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~34 GB (Qwen3-VL-32B + Qwen3-8B + XTTS-v2) on a 192 GB GPU β€” fits with margin for KV cache + serving overhead. Both LLMs are Qwen-family, served via vLLM 0.17.1 on ROCm 7.2.
31
 
32
  ## V1 use cases
33
 
 
38
 
39
  ## Why AMD
40
 
41
+ The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-32B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch β€” practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
42
 
43
  ## Why this matters (business case)
44
 
 
76
 
77
  ## Models pulled from Hugging Face Hub
78
 
79
+ - `Qwen/Qwen3-VL-32B-Instruct` β€” sign vision (recognizer)
80
+ - `Qwen/Qwen3-8B` β€” sentence composer
81
  - `coqui/XTTS-v2` β€” text-to-speech
82
  - (V2 stretch) `openai/whisper-large-v3` β€” for the reverse direction
83
 
docs/demo-video-script.md CHANGED
@@ -80,7 +80,7 @@ Webcam frames β†’ Qwen3-VL-8B (vision) β†’ Llama-3.1-8B (composer) β†’ XTTS-v2 (
80
  ```
81
 
82
  **Voice-over:**
83
- > "Under the hood: a multi-modal pipeline running on a single AMD Instinct MI300X. Vision, reasoning, and voice β€” all concurrent on one GPU."
84
 
85
  **Beat 3B β€” The MI300X comparison (1:55 β†’ 2:15):**
86
 
@@ -155,6 +155,7 @@ Webcam frames β†’ Qwen3-VL-8B (vision) β†’ Llama-3.1-8B (composer) β†’ XTTS-v2 (
155
  - [ ] Length 2:00–3:00
156
  - [ ] Captions visible throughout
157
  - [ ] AMD Dev Cloud / MI300X mentioned by name β‰₯3 times
 
158
  - [ ] HF Space URL shown on screen at least once
159
  - [ ] GitHub URL shown on screen at least once
160
  - [ ] No copyrighted music / footage
 
80
  ```
81
 
82
  **Voice-over:**
83
+ > "Under the hood: Qwen3-VL-8B reads each frame, Llama-3.1 composes the sentence, XTTS speaks it β€” all running concurrently on a single AMD Instinct MI300X. Vision, reasoning, and voice on one GPU."
84
 
85
  **Beat 3B β€” The MI300X comparison (1:55 β†’ 2:15):**
86
 
 
155
  - [ ] Length 2:00–3:00
156
  - [ ] Captions visible throughout
157
  - [ ] AMD Dev Cloud / MI300X mentioned by name β‰₯3 times
158
+ - [ ] Qwen3-VL mentioned by name β‰₯2 times (Qwen Special Reward eligibility)
159
  - [ ] HF Space URL shown on screen at least once
160
  - [ ] GitHub URL shown on screen at least once
161
  - [ ] No copyrighted music / footage
docs/lablab-submission-form.md CHANGED
@@ -7,17 +7,17 @@
7
  ## Project Title (≀ ~70 chars)
8
 
9
  ```
10
- SignBridge β€” Real-time ASL β†’ English speech on AMD Instinct MI300X
11
  ```
12
 
13
- (63 characters; safe under platform limit.)
14
 
15
  ---
16
 
17
  ## Short Description (≀ 150 chars typical)
18
 
19
  ```
20
- Two people who couldn't communicate, now can. Real-time ASL β†’ English speech via Qwen3-VL + Llama-3.1 + XTTS, on a single AMD MI300X.
21
  ```
22
 
23
  (132 characters.)
@@ -27,11 +27,11 @@ Two people who couldn't communicate, now can. Real-time ASL β†’ English speech v
27
  ## Long Description (no hard limit, ~300 words is the sweet spot)
28
 
29
  ```
30
- SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI).
31
 
32
  The user signs at the webcam β€” either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β€” and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
33
 
34
- Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition, Llama-3.1-8B for sentence composition, Coqui XTTS-v2 for speech synthesis), running concurrently on a single AMD Instinct MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin β€” the same workload on NVIDIA H100 needs three GPUs.
35
 
36
  For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β€” most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
37
 
@@ -49,13 +49,13 @@ Built solo by Lucas Loo Tan Yu Heng, May 5–11, 2026.
49
  Pick from lablab's tag dropdown β€” these are the tags that match SignBridge:
50
 
51
  **Primary (must-haves):**
 
52
  - `AMD Developer Cloud`
53
  - `AMD ROCm`
54
  - `HuggingFace Spaces`
55
 
56
  **Secondary (relevant):**
57
  - `LLaMA` (Llama-3.1-8B composer)
58
- - `Qwen` (Qwen3-VL-8B vision)
59
  - `Gradio`
60
  - `FastAPI`
61
  - `Vision`
 
7
  ## Project Title (≀ ~70 chars)
8
 
9
  ```
10
+ SignBridge β€” Real-time ASL β†’ speech, Qwen3-VL on AMD MI300X
11
  ```
12
 
13
+ (60 characters; leads with Qwen for Qwen Special Reward eligibility.)
14
 
15
  ---
16
 
17
  ## Short Description (≀ 150 chars typical)
18
 
19
  ```
20
+ Two people who couldn't communicate, now can. Real-time ASL β†’ English speech, powered by Qwen3-VL on AMD Instinct MI300X.
21
  ```
22
 
23
  (132 characters.)
 
27
  ## Long Description (no hard limit, ~300 words is the sweet spot)
28
 
29
  ```
30
+ SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). It is powered by Qwen3-VL-8B for visual understanding of signs.
31
 
32
  The user signs at the webcam β€” either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β€” and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
33
 
34
+ Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition β€” the core intelligence; Llama-3.1-8B for sentence composition; Coqui XTTS-v2 for speech synthesis), running concurrently on a single AMD Instinct MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin β€” the same workload on NVIDIA H100 needs three GPUs.
35
 
36
  For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β€” most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
37
 
 
49
  Pick from lablab's tag dropdown β€” these are the tags that match SignBridge:
50
 
51
  **Primary (must-haves):**
52
+ - `Qwen` / `Qwen3-VL` (Qwen3-VL-8B vision recognizer β€” central; eligible for Qwen Special Reward 10M tokens)
53
  - `AMD Developer Cloud`
54
  - `AMD ROCm`
55
  - `HuggingFace Spaces`
56
 
57
  **Secondary (relevant):**
58
  - `LLaMA` (Llama-3.1-8B composer)
 
59
  - `Gradio`
60
  - `FastAPI`
61
  - `Vision`
docs/pitch-deck.md CHANGED
@@ -117,6 +117,24 @@ The 2–3 minute demo video, looping, autoplay-on-slide-show.
117
 
118
  ---
119
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  ## Slide 7 β€” Why this is the right submission for Track 3
121
 
122
  **Headline:**
 
117
 
118
  ---
119
 
120
+ ## Slide 6.5 β€” Qwen3-VL is the brain
121
+
122
+ **Headline:**
123
+ Qwen3-VL-8B-Instruct: the visual intelligence behind every sign.
124
+
125
+ **Body bullets:**
126
+ - The recognizer is **Qwen3-VL-8B-Instruct** β€” Alibaba's open Qwen-VL family, served from Hugging Face Hub.
127
+ - We feed it **multi-image bursts** (4 frames over 1.5 s) for motion-dependent signs like HELLO and THANK_YOU β€” single-frame models fundamentally cannot translate ASL.
128
+ - **Closed-vocabulary forcing** + **sequential frame markers** (NVIDIA video-VLM pattern) keep Qwen on-rails for the 87-token sign vocab. No fine-tuning needed β€” Qwen3-VL is strong enough zero-shot.
129
+ - Llama-3.1-8B then composes Qwen's tokens into grammatical English; XTTS-v2 speaks it.
130
+
131
+ **Closer:**
132
+ Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
133
+
134
+ *Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the multi-frame Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.*
135
+
136
+ ---
137
+
138
  ## Slide 7 β€” Why this is the right submission for Track 3
139
 
140
  **Headline:**
signbridge/composer/sentence.py CHANGED
@@ -1,4 +1,4 @@
1
- """Llama-3.1-8B sentence composer.
2
 
3
  Takes a stream of sign tokens (English glosses + fingerspelled letters)
4
  and composes a grammatical English sentence. Backed by an OpenAI-compatible
@@ -50,12 +50,21 @@ def _resolve_client() -> tuple[object | None, str]:
50
  """Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
51
  provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
52
  composer_model = os.getenv(
53
- "SIGNBRIDGE_COMPOSER_MODEL", "meta-llama/Llama-3.1-8B-Instruct"
54
  )
55
 
56
  if provider == "amd":
57
- base_url = os.getenv("AMD_DEV_CLOUD_BASE_URL", "").rstrip("/")
58
- api_key = os.getenv("AMD_DEV_CLOUD_API_KEY", "")
 
 
 
 
 
 
 
 
 
59
  if not base_url or not api_key:
60
  logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
61
  return None, composer_model
@@ -117,6 +126,9 @@ def compose_sentence(signs: Sequence[str]) -> str:
117
  if client is None:
118
  return _naive_join(signs)
119
 
 
 
 
120
  try:
121
  resp = client.chat.completions.create( # type: ignore[attr-defined]
122
  model=model,
@@ -126,6 +138,7 @@ def compose_sentence(signs: Sequence[str]) -> str:
126
  ],
127
  temperature=0.2,
128
  max_tokens=120,
 
129
  )
130
  text = (resp.choices[0].message.content or "").strip()
131
  except Exception as exc: # noqa: BLE001 β€” broad catch is intentional at the boundary
 
1
+ """Qwen3-8B sentence composer (Llama-3.1-8B-compatible by env var).
2
 
3
  Takes a stream of sign tokens (English glosses + fingerspelled letters)
4
  and composes a grammatical English sentence. Backed by an OpenAI-compatible
 
50
  """Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
51
  provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
52
  composer_model = os.getenv(
53
+ "SIGNBRIDGE_COMPOSER_MODEL", "Qwen/Qwen3-8B"
54
  )
55
 
56
  if provider == "amd":
57
+ # Prefer a composer-specific base URL/key (lets us run Qwen-VL on :8000
58
+ # and the composer on :8001 of the same MI300X). Falls back to the
59
+ # shared AMD_DEV_CLOUD_BASE_URL when not split.
60
+ base_url = (
61
+ os.getenv("SIGNBRIDGE_COMPOSER_BASE_URL")
62
+ or os.getenv("AMD_DEV_CLOUD_BASE_URL", "")
63
+ ).rstrip("/")
64
+ api_key = (
65
+ os.getenv("SIGNBRIDGE_COMPOSER_API_KEY")
66
+ or os.getenv("AMD_DEV_CLOUD_API_KEY", "")
67
+ )
68
  if not base_url or not api_key:
69
  logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
70
  return None, composer_model
 
126
  if client is None:
127
  return _naive_join(signs)
128
 
129
+ # Qwen3 reasoning models default to emitting <think>...</think>; disable
130
+ # via the chat-template kwarg so vLLM serves a direct sentence. Harmless
131
+ # for non-Qwen3 models (extra_body keys they don't know are ignored).
132
  try:
133
  resp = client.chat.completions.create( # type: ignore[attr-defined]
134
  model=model,
 
138
  ],
139
  temperature=0.2,
140
  max_tokens=120,
141
+ extra_body={"chat_template_kwargs": {"enable_thinking": False}},
142
  )
143
  text = (resp.choices[0].message.content or "").strip()
144
  except Exception as exc: # noqa: BLE001 β€” broad catch is intentional at the boundary
signbridge/recognizer/vlm.py CHANGED
@@ -37,7 +37,7 @@ from signbridge.vocab import VOCAB_SET as _VLM_VOCAB_SET
37
 
38
  logger = logging.getLogger(__name__)
39
 
40
- DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/Qwen2-VL-7B-Instruct")
41
 
42
 
43
  @lru_cache(maxsize=4)
 
37
 
38
  logger = logging.getLogger(__name__)
39
 
40
+ DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/Qwen3-VL-32B-Instruct")
41
 
42
 
43
  @lru_cache(maxsize=4)