Spaces:

lablab-ai-amd-developer-hackathon
/

signbridge

Running

LucasLooTan Claude Opus 4.7 (1M context) commited on 1 day ago

Commit

5952553

1 Parent(s): 549efd4

Switch to Qwen3-only stack: VLM=Qwen3-VL-32B, composer=Qwen3-8B

Both run on a single AMD MI300X via vLLM. Qwen3-VL-32B is the
accuracy ceiling that fits in 192 GB (Qwen3-VL-235B FP8 is 235 GB
which exceeds the GPU). Qwen3-8B replaces gated Llama-3.1-8B,
keeping the composer open-weights and the entire pipeline Qwen-family
for the Qwen Special Reward (10M tokens × team).

Composer adds extra_body={"chat_template_kwargs":{"enable_thinking":False}}
so Qwen3 reasoning models emit a sentence directly instead of <think>...</think>.
Adds SIGNBRIDGE_COMPOSER_BASE_URL/API_KEY so we can split the two
servers on different ports of the same MI300X (8000 for VL, 8001 for composer).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (7) hide show

CLAUDE.md +5 -3
README.md +6 -11
docs/demo-video-script.md +2 -1
docs/lablab-submission-form.md +6 -6
docs/pitch-deck.md +18 -0
signbridge/composer/sentence.py +17 -4
signbridge/recognizer/vlm.py +1 -1

CLAUDE.md CHANGED Viewed

@@ -87,13 +87,15 @@ Verbatim from lablab page → "Technology Partners & Workshops" → Hugging Face
   - 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
   - 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
   - 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
 ### Prize targets for SignBridge
 - 🥇 **Track 3** (primary).
 - 🤗 **HF Special Prize** (most likes — requires Space in event org + sharing the link).
 - 🏆 Grand Prize (aspirational).
-- ❌ Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only).
 ### License rule
@@ -102,7 +104,7 @@ Per the Voluntary Participation & Prize Terms footer: *"Submissions must be orig
 ### Tech stack constraints (per Track 3)
 - **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro — those are different AMD product lines.
-- **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-8B-Instruct` (Qwen-VL family ✓) for sign recognition + `meta-llama/Llama-3.1-8B-Instruct` for sentence composition + `coqui/XTTS-v2` for speech.
 - **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
 ### Workshop references (provided by AMD)
@@ -165,7 +167,7 @@ Win the AMD Developer Hackathon (LabLab.ai, May 2026), Track 3, with a real-time
 - Pipeline (concurrent on one MI300X):
   - **Pose extraction:** MediaPipe Holistic (Google) — frame → 543-dim landmark vector
   - **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) → sign tokens
-  - **Sentence composer:** `meta-llama/Llama-3.1-8B-Instruct` → grammatical English sentence from sign-token stream
   - **TTS:** `coqui/XTTS-v2` → audio
   - **(Stretch) STT:** `openai/whisper-large-v3` → reverse direction (speech → on-screen text)
 - Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)

   - 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
   - 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
   - 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
+- 🐉 **Qwen Special Reward** (added by lablab page revision noticed 2026-05-09): "Best use of Qwen in each track — 10M Qwen tokens per team member." Awarded separately to the best Qwen-powered project per track.
 ### Prize targets for SignBridge
 - 🥇 **Track 3** (primary).
+- 🐉 **Qwen Special Reward — Track 3** (added 2026-05-09). SignBridge's recognizer is `Qwen/Qwen3-VL-32B-Instruct`; we are well-positioned. Action: lead the title/short-description/tags with Qwen3-VL, dedicate a pitch-deck slide to Qwen integration.
 - 🤗 **HF Special Prize** (most likes — requires Space in event org + sharing the link).
 - 🏆 Grand Prize (aspirational).
+- ❌ Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only). Re-confirmed 2026-05-09 — with ~36h remaining, finishing the live demo outranks 2 social posts.
 ### License rule
 ### Tech stack constraints (per Track 3)
 - **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro — those are different AMD product lines.
+- **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-32B-Instruct` (Qwen-VL family ✓) for sign recognition + `Qwen/Qwen3-8B` for sentence composition + `coqui/XTTS-v2` for speech.
 - **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
 ### Workshop references (provided by AMD)
 - Pipeline (concurrent on one MI300X):
   - **Pose extraction:** MediaPipe Holistic (Google) — frame → 543-dim landmark vector
   - **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) → sign tokens
+  - **Sentence composer:** `Qwen/Qwen3-8B` → grammatical English sentence from sign-token stream
   - **TTS:** `coqui/XTTS-v2` → audio
   - **(Stretch) STT:** `openai/whisper-large-v3` → reverse direction (speech → on-screen text)
 - Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)

README.md CHANGED Viewed

@@ -23,17 +23,11 @@ Submission for the **AMD Developer Hackathon** (LabLab.ai, May 2026) — **Track
 ## How it works
 ```
-webcam frames  →  MediaPipe Holistic   →  trained sign classifier
-   (1–5 fps)        (543-dim pose)        (WLASL Top-100 + alphabet)
-                                                  │
-                                                  ▼
-                                      Llama-3.1-8B sentence composer
-                                                  │
-                                                  ▼
-                                            Coqui XTTS-v2  →  speech
 ```
-All four stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~22 GB on a 192 GB GPU — fits with margin for KV cache + serving overhead.
 ## V1 use cases
@@ -44,7 +38,7 @@ V1 is **one-way**: deaf signs → hearing hears. Reverse direction (speech → o
 ## Why AMD
-The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-8B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch — practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
 ## Why this matters (business case)
@@ -82,7 +76,8 @@ python -m signbridge.scripts.train_classifier --dataset data/wlasl --epochs 30
 ## Models pulled from Hugging Face Hub
-- `meta-llama/Llama-3.1-8B-Instruct` — sentence composer
 - `coqui/XTTS-v2` — text-to-speech
 - (V2 stretch) `openai/whisper-large-v3` — for the reverse direction

 ## How it works
 ```
+webcam frames  →  Qwen3-VL-32B  →  Qwen3-8B  →  Coqui XTTS-v2  →  speech
+                  (sign vision)   (composer)   (TTS)
 ```
+All three stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~34 GB (Qwen3-VL-32B + Qwen3-8B + XTTS-v2) on a 192 GB GPU — fits with margin for KV cache + serving overhead. Both LLMs are Qwen-family, served via vLLM 0.17.1 on ROCm 7.2.
 ## V1 use cases
 ## Why AMD
+The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-32B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch — practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
 ## Why this matters (business case)
 ## Models pulled from Hugging Face Hub
+- `Qwen/Qwen3-VL-32B-Instruct` — sign vision (recognizer)
+- `Qwen/Qwen3-8B` — sentence composer
 - `coqui/XTTS-v2` — text-to-speech
 - (V2 stretch) `openai/whisper-large-v3` — for the reverse direction

docs/demo-video-script.md CHANGED Viewed

@@ -80,7 +80,7 @@ Webcam frames → Qwen3-VL-8B (vision) → Llama-3.1-8B (composer) → XTTS-v2 (
 ```
 **Voice-over:**
-> "Under the hood: a multi-modal pipeline running on a single AMD Instinct MI300X. Vision, reasoning, and voice — all concurrent on one GPU."
 **Beat 3B — The MI300X comparison (1:55 → 2:15):**
@@ -155,6 +155,7 @@ Webcam frames → Qwen3-VL-8B (vision) → Llama-3.1-8B (composer) → XTTS-v2 (
 - [ ] Length 2:00–3:00
 - [ ] Captions visible throughout
 - [ ] AMD Dev Cloud / MI300X mentioned by name ≥3 times
 - [ ] HF Space URL shown on screen at least once
 - [ ] GitHub URL shown on screen at least once
 - [ ] No copyrighted music / footage

 ```
 **Voice-over:**
+> "Under the hood: Qwen3-VL-8B reads each frame, Llama-3.1 composes the sentence, XTTS speaks it — all running concurrently on a single AMD Instinct MI300X. Vision, reasoning, and voice on one GPU."
 **Beat 3B — The MI300X comparison (1:55 → 2:15):**
 - [ ] Length 2:00–3:00
 - [ ] Captions visible throughout
 - [ ] AMD Dev Cloud / MI300X mentioned by name ≥3 times
+- [ ] Qwen3-VL mentioned by name ≥2 times (Qwen Special Reward eligibility)
 - [ ] HF Space URL shown on screen at least once
 - [ ] GitHub URL shown on screen at least once
 - [ ] No copyrighted music / footage

docs/lablab-submission-form.md CHANGED Viewed

@@ -7,17 +7,17 @@
 ## Project Title (≤ ~70 chars)
 ```
-SignBridge — Real-time ASL → English speech on AMD Instinct MI300X
 ```
-(63 characters; safe under platform limit.)
 ---
 ## Short Description (≤ 150 chars typical)
 ```
-Two people who couldn't communicate, now can. Real-time ASL → English speech via Qwen3-VL + Llama-3.1 + XTTS, on a single AMD MI300X.
 ```
 (132 characters.)
@@ -27,11 +27,11 @@ Two people who couldn't communicate, now can. Real-time ASL → English speech v
 ## Long Description (no hard limit, ~300 words is the sweet spot)
 ```
-SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI).
 The user signs at the webcam — either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) — and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
-Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition, Llama-3.1-8B for sentence composition, Coqui XTTS-v2 for speech synthesis), running concurrently on a single AMD Instinct MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin — the same workload on NVIDIA H100 needs three GPUs.
 For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt — most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
@@ -49,13 +49,13 @@ Built solo by Lucas Loo Tan Yu Heng, May 5–11, 2026.
 Pick from lablab's tag dropdown — these are the tags that match SignBridge:
 **Primary (must-haves):**
 - `AMD Developer Cloud`
 - `AMD ROCm`
 - `HuggingFace Spaces`
 **Secondary (relevant):**
 - `LLaMA` (Llama-3.1-8B composer)
-- `Qwen` (Qwen3-VL-8B vision)
 - `Gradio`
 - `FastAPI`
 - `Vision`

 ## Project Title (≤ ~70 chars)
 ```
+SignBridge — Real-time ASL → speech, Qwen3-VL on AMD MI300X
 ```
+(60 characters; leads with Qwen for Qwen Special Reward eligibility.)
 ---
 ## Short Description (≤ 150 chars typical)
 ```
+Two people who couldn't communicate, now can. Real-time ASL → English speech, powered by Qwen3-VL on AMD Instinct MI300X.
 ```
 (132 characters.)
 ## Long Description (no hard limit, ~300 words is the sweet spot)
 ```
+SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). It is powered by Qwen3-VL-8B for visual understanding of signs.
 The user signs at the webcam — either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) — and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
+Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition — the core intelligence; Llama-3.1-8B for sentence composition; Coqui XTTS-v2 for speech synthesis), running concurrently on a single AMD Instinct MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin — the same workload on NVIDIA H100 needs three GPUs.
 For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt — most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
 Pick from lablab's tag dropdown — these are the tags that match SignBridge:
 **Primary (must-haves):**
+- `Qwen` / `Qwen3-VL` (Qwen3-VL-8B vision recognizer — central; eligible for Qwen Special Reward 10M tokens)
 - `AMD Developer Cloud`
 - `AMD ROCm`
 - `HuggingFace Spaces`
 **Secondary (relevant):**
 - `LLaMA` (Llama-3.1-8B composer)
 - `Gradio`
 - `FastAPI`
 - `Vision`

docs/pitch-deck.md CHANGED Viewed

@@ -117,6 +117,24 @@ The 2–3 minute demo video, looping, autoplay-on-slide-show.
 ---
 ## Slide 7 — Why this is the right submission for Track 3
 **Headline:**

 ---
+## Slide 6.5 — Qwen3-VL is the brain
+**Headline:**
+Qwen3-VL-8B-Instruct: the visual intelligence behind every sign.
+**Body bullets:**
+- The recognizer is **Qwen3-VL-8B-Instruct** — Alibaba's open Qwen-VL family, served from Hugging Face Hub.
+- We feed it **multi-image bursts** (4 frames over 1.5 s) for motion-dependent signs like HELLO and THANK_YOU — single-frame models fundamentally cannot translate ASL.
+- **Closed-vocabulary forcing** + **sequential frame markers** (NVIDIA video-VLM pattern) keep Qwen on-rails for the 87-token sign vocab. No fine-tuning needed — Qwen3-VL is strong enough zero-shot.
+- Llama-3.1-8B then composes Qwen's tokens into grammatical English; XTTS-v2 speaks it.
+**Closer:**
+Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
+*Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the multi-frame Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.*
+---
 ## Slide 7 — Why this is the right submission for Track 3
 **Headline:**

signbridge/composer/sentence.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Llama-3.1-8B sentence composer.
 Takes a stream of sign tokens (English glosses + fingerspelled letters)
 and composes a grammatical English sentence. Backed by an OpenAI-compatible
@@ -50,12 +50,21 @@ def _resolve_client() -> tuple[object | None, str]:
     """Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
     provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
     composer_model = os.getenv(
-        "SIGNBRIDGE_COMPOSER_MODEL", "meta-llama/Llama-3.1-8B-Instruct"
     )
     if provider == "amd":
-        base_url = os.getenv("AMD_DEV_CLOUD_BASE_URL", "").rstrip("/")
-        api_key = os.getenv("AMD_DEV_CLOUD_API_KEY", "")
         if not base_url or not api_key:
             logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
             return None, composer_model
@@ -117,6 +126,9 @@ def compose_sentence(signs: Sequence[str]) -> str:
     if client is None:
         return _naive_join(signs)
     try:
         resp = client.chat.completions.create(  # type: ignore[attr-defined]
             model=model,
@@ -126,6 +138,7 @@ def compose_sentence(signs: Sequence[str]) -> str:
             ],
             temperature=0.2,
             max_tokens=120,
         )
         text = (resp.choices[0].message.content or "").strip()
     except Exception as exc:  # noqa: BLE001 — broad catch is intentional at the boundary

+"""Qwen3-8B sentence composer (Llama-3.1-8B-compatible by env var).
 Takes a stream of sign tokens (English glosses + fingerspelled letters)
 and composes a grammatical English sentence. Backed by an OpenAI-compatible
     """Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
     provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
     composer_model = os.getenv(
+        "SIGNBRIDGE_COMPOSER_MODEL", "Qwen/Qwen3-8B"
     )
     if provider == "amd":
+        # Prefer a composer-specific base URL/key (lets us run Qwen-VL on :8000
+        # and the composer on :8001 of the same MI300X). Falls back to the
+        # shared AMD_DEV_CLOUD_BASE_URL when not split.
+        base_url = (
+            os.getenv("SIGNBRIDGE_COMPOSER_BASE_URL")
+            or os.getenv("AMD_DEV_CLOUD_BASE_URL", "")
+        ).rstrip("/")
+        api_key = (
+            os.getenv("SIGNBRIDGE_COMPOSER_API_KEY")
+            or os.getenv("AMD_DEV_CLOUD_API_KEY", "")
+        )
         if not base_url or not api_key:
             logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
             return None, composer_model
     if client is None:
         return _naive_join(signs)
+    # Qwen3 reasoning models default to emitting <think>...</think>; disable
+    # via the chat-template kwarg so vLLM serves a direct sentence. Harmless
+    # for non-Qwen3 models (extra_body keys they don't know are ignored).
     try:
         resp = client.chat.completions.create(  # type: ignore[attr-defined]
             model=model,
             ],
             temperature=0.2,
             max_tokens=120,
+            extra_body={"chat_template_kwargs": {"enable_thinking": False}},
         )
         text = (resp.choices[0].message.content or "").strip()
     except Exception as exc:  # noqa: BLE001 — broad catch is intentional at the boundary

signbridge/recognizer/vlm.py CHANGED Viewed

@@ -37,7 +37,7 @@ from signbridge.vocab import VOCAB_SET as _VLM_VOCAB_SET
 logger = logging.getLogger(__name__)
-DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/Qwen2-VL-7B-Instruct")
 @lru_cache(maxsize=4)

 logger = logging.getLogger(__name__)
+DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/Qwen3-VL-32B-Instruct")
 @lru_cache(maxsize=4)