Commit Β·
5952553
1
Parent(s): 549efd4
Switch to Qwen3-only stack: VLM=Qwen3-VL-32B, composer=Qwen3-8B
Browse filesBoth run on a single AMD MI300X via vLLM. Qwen3-VL-32B is the
accuracy ceiling that fits in 192 GB (Qwen3-VL-235B FP8 is 235 GB
which exceeds the GPU). Qwen3-8B replaces gated Llama-3.1-8B,
keeping the composer open-weights and the entire pipeline Qwen-family
for the Qwen Special Reward (10M tokens Γ team).
Composer adds extra_body={"chat_template_kwargs":{"enable_thinking":False}}
so Qwen3 reasoning models emit a sentence directly instead of <think>...</think>.
Adds SIGNBRIDGE_COMPOSER_BASE_URL/API_KEY so we can split the two
servers on different ports of the same MI300X (8000 for VL, 8001 for composer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md +5 -3
- README.md +6 -11
- docs/demo-video-script.md +2 -1
- docs/lablab-submission-form.md +6 -6
- docs/pitch-deck.md +18 -0
- signbridge/composer/sentence.py +17 -4
- signbridge/recognizer/vlm.py +1 -1
CLAUDE.md
CHANGED
|
@@ -87,13 +87,15 @@ Verbatim from lablab page β "Technology Partners & Workshops" β Hugging Face
|
|
| 87 |
- 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
|
| 88 |
- 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
|
| 89 |
- 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
|
|
|
|
| 90 |
|
| 91 |
### Prize targets for SignBridge
|
| 92 |
|
| 93 |
- π₯ **Track 3** (primary).
|
|
|
|
| 94 |
- π€ **HF Special Prize** (most likes β requires Space in event org + sharing the link).
|
| 95 |
- π Grand Prize (aspirational).
|
| 96 |
-
- β Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only).
|
| 97 |
|
| 98 |
### License rule
|
| 99 |
|
|
@@ -102,7 +104,7 @@ Per the Voluntary Participation & Prize Terms footer: *"Submissions must be orig
|
|
| 102 |
### Tech stack constraints (per Track 3)
|
| 103 |
|
| 104 |
- **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro β those are different AMD product lines.
|
| 105 |
-
- **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-
|
| 106 |
- **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
|
| 107 |
|
| 108 |
### Workshop references (provided by AMD)
|
|
@@ -165,7 +167,7 @@ Win the AMD Developer Hackathon (LabLab.ai, May 2026), Track 3, with a real-time
|
|
| 165 |
- Pipeline (concurrent on one MI300X):
|
| 166 |
- **Pose extraction:** MediaPipe Holistic (Google) β frame β 543-dim landmark vector
|
| 167 |
- **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) β sign tokens
|
| 168 |
-
- **Sentence composer:** `
|
| 169 |
- **TTS:** `coqui/XTTS-v2` β audio
|
| 170 |
- **(Stretch) STT:** `openai/whisper-large-v3` β reverse direction (speech β on-screen text)
|
| 171 |
- Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)
|
|
|
|
| 87 |
- 1st: 1 Reachy Mini Wireless + 6 months Hugging Face PRO + $500 Hugging Face Credits.
|
| 88 |
- 2nd: 3 months Hugging Face PRO + $300 Hugging Face Credits.
|
| 89 |
- 3rd: 2 months Hugging Face PRO + $200 Hugging Face Credits.
|
| 90 |
+
- π **Qwen Special Reward** (added by lablab page revision noticed 2026-05-09): "Best use of Qwen in each track β 10M Qwen tokens per team member." Awarded separately to the best Qwen-powered project per track.
|
| 91 |
|
| 92 |
### Prize targets for SignBridge
|
| 93 |
|
| 94 |
- π₯ **Track 3** (primary).
|
| 95 |
+
- π **Qwen Special Reward β Track 3** (added 2026-05-09). SignBridge's recognizer is `Qwen/Qwen3-VL-32B-Instruct`; we are well-positioned. Action: lead the title/short-description/tags with Qwen3-VL, dedicate a pitch-deck slide to Qwen integration.
|
| 96 |
- π€ **HF Special Prize** (most likes β requires Space in event org + sharing the link).
|
| 97 |
- π Grand Prize (aspirational).
|
| 98 |
+
- β Build-in-Public extra: **dropped** by user direction 2026-05-07 (no tweet obligations; walkthrough kept as internal doc only). Re-confirmed 2026-05-09 β with ~36h remaining, finishing the live demo outranks 2 social posts.
|
| 99 |
|
| 100 |
### License rule
|
| 101 |
|
|
|
|
| 104 |
### Tech stack constraints (per Track 3)
|
| 105 |
|
| 106 |
- **Compute:** AMD Instinct MI300X via AMD Developer Cloud (datacenter GPU, 192 GB HBM3, 5.3 TB/s memory bandwidth). Not Ryzen, not Radeon Pro β those are different AMD product lines.
|
| 107 |
+
- **Models:** Multimodal models optimized for ROCm. Examples called out by the rules: Llama 3.2 Vision, Qwen-VL family. SignBridge uses `Qwen/Qwen3-VL-32B-Instruct` (Qwen-VL family β) for sign recognition + `Qwen/Qwen3-8B` for sentence composition + `coqui/XTTS-v2` for speech.
|
| 108 |
- **Frameworks:** ROCm + PyTorch + Hugging Face Optimum-AMD + vLLM (per the rules).
|
| 109 |
|
| 110 |
### Workshop references (provided by AMD)
|
|
|
|
| 167 |
- Pipeline (concurrent on one MI300X):
|
| 168 |
- **Pose extraction:** MediaPipe Holistic (Google) β frame β 543-dim landmark vector
|
| 169 |
- **Sign classifier:** trained-from-scratch small transformer over landmark sequences (WLASL Top-100 + ASL fingerspelling alphabet) β sign tokens
|
| 170 |
+
- **Sentence composer:** `Qwen/Qwen3-8B` β grammatical English sentence from sign-token stream
|
| 171 |
- **TTS:** `coqui/XTTS-v2` β audio
|
| 172 |
- **(Stretch) STT:** `openai/whisper-large-v3` β reverse direction (speech β on-screen text)
|
| 173 |
- Datasets: [WLASL](https://github.com/dxli94/WLASL) Top-100 subset + ASL fingerspelling alphabet (open)
|
README.md
CHANGED
|
@@ -23,17 +23,11 @@ Submission for the **AMD Developer Hackathon** (LabLab.ai, May 2026) β **Track
|
|
| 23 |
## How it works
|
| 24 |
|
| 25 |
```
|
| 26 |
-
webcam frames β
|
| 27 |
-
|
| 28 |
-
β
|
| 29 |
-
βΌ
|
| 30 |
-
Llama-3.1-8B sentence composer
|
| 31 |
-
β
|
| 32 |
-
βΌ
|
| 33 |
-
Coqui XTTS-v2 β speech
|
| 34 |
```
|
| 35 |
|
| 36 |
-
All
|
| 37 |
|
| 38 |
## V1 use cases
|
| 39 |
|
|
@@ -44,7 +38,7 @@ V1 is **one-way**: deaf signs β hearing hears. Reverse direction (speech β o
|
|
| 44 |
|
| 45 |
## Why AMD
|
| 46 |
|
| 47 |
-
The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-
|
| 48 |
|
| 49 |
## Why this matters (business case)
|
| 50 |
|
|
@@ -82,7 +76,8 @@ python -m signbridge.scripts.train_classifier --dataset data/wlasl --epochs 30
|
|
| 82 |
|
| 83 |
## Models pulled from Hugging Face Hub
|
| 84 |
|
| 85 |
-
- `
|
|
|
|
| 86 |
- `coqui/XTTS-v2` β text-to-speech
|
| 87 |
- (V2 stretch) `openai/whisper-large-v3` β for the reverse direction
|
| 88 |
|
|
|
|
| 23 |
## How it works
|
| 24 |
|
| 25 |
```
|
| 26 |
+
webcam frames β Qwen3-VL-32B β Qwen3-8B β Coqui XTTS-v2 β speech
|
| 27 |
+
(sign vision) (composer) (TTS)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
```
|
| 29 |
|
| 30 |
+
All three stages run **concurrently on a single AMD Instinct MI300X** via AMD Developer Cloud. Total weights ~34 GB (Qwen3-VL-32B + Qwen3-8B + XTTS-v2) on a 192 GB GPU β fits with margin for KV cache + serving overhead. Both LLMs are Qwen-family, served via vLLM 0.17.1 on ROCm 7.2.
|
| 31 |
|
| 32 |
## V1 use cases
|
| 33 |
|
|
|
|
| 38 |
|
| 39 |
## Why AMD
|
| 40 |
|
| 41 |
+
The MI300X's 192 GB HBM3 fits the entire pipeline (Qwen3-VL-32B + Llama-3.1-8B + XTTS-v2) on one GPU with margin. NVIDIA H100 (80 GB) requires sharding, and the V2 plan to upgrade to a 70B reasoner is impossible on H100 without a 3-GPU cluster. Single-GPU concurrency + 5.3 TB/s memory bandwidth is the actual AMD pitch β practical accessibility tools running globally need the cost-and-availability profile that AMD enables.
|
| 42 |
|
| 43 |
## Why this matters (business case)
|
| 44 |
|
|
|
|
| 76 |
|
| 77 |
## Models pulled from Hugging Face Hub
|
| 78 |
|
| 79 |
+
- `Qwen/Qwen3-VL-32B-Instruct` β sign vision (recognizer)
|
| 80 |
+
- `Qwen/Qwen3-8B` β sentence composer
|
| 81 |
- `coqui/XTTS-v2` β text-to-speech
|
| 82 |
- (V2 stretch) `openai/whisper-large-v3` β for the reverse direction
|
| 83 |
|
docs/demo-video-script.md
CHANGED
|
@@ -80,7 +80,7 @@ Webcam frames β Qwen3-VL-8B (vision) β Llama-3.1-8B (composer) β XTTS-v2 (
|
|
| 80 |
```
|
| 81 |
|
| 82 |
**Voice-over:**
|
| 83 |
-
> "Under the hood:
|
| 84 |
|
| 85 |
**Beat 3B β The MI300X comparison (1:55 β 2:15):**
|
| 86 |
|
|
@@ -155,6 +155,7 @@ Webcam frames β Qwen3-VL-8B (vision) β Llama-3.1-8B (composer) β XTTS-v2 (
|
|
| 155 |
- [ ] Length 2:00β3:00
|
| 156 |
- [ ] Captions visible throughout
|
| 157 |
- [ ] AMD Dev Cloud / MI300X mentioned by name β₯3 times
|
|
|
|
| 158 |
- [ ] HF Space URL shown on screen at least once
|
| 159 |
- [ ] GitHub URL shown on screen at least once
|
| 160 |
- [ ] No copyrighted music / footage
|
|
|
|
| 80 |
```
|
| 81 |
|
| 82 |
**Voice-over:**
|
| 83 |
+
> "Under the hood: Qwen3-VL-8B reads each frame, Llama-3.1 composes the sentence, XTTS speaks it β all running concurrently on a single AMD Instinct MI300X. Vision, reasoning, and voice on one GPU."
|
| 84 |
|
| 85 |
**Beat 3B β The MI300X comparison (1:55 β 2:15):**
|
| 86 |
|
|
|
|
| 155 |
- [ ] Length 2:00β3:00
|
| 156 |
- [ ] Captions visible throughout
|
| 157 |
- [ ] AMD Dev Cloud / MI300X mentioned by name β₯3 times
|
| 158 |
+
- [ ] Qwen3-VL mentioned by name β₯2 times (Qwen Special Reward eligibility)
|
| 159 |
- [ ] HF Space URL shown on screen at least once
|
| 160 |
- [ ] GitHub URL shown on screen at least once
|
| 161 |
- [ ] No copyrighted music / footage
|
docs/lablab-submission-form.md
CHANGED
|
@@ -7,17 +7,17 @@
|
|
| 7 |
## Project Title (β€ ~70 chars)
|
| 8 |
|
| 9 |
```
|
| 10 |
-
SignBridge β Real-time ASL β
|
| 11 |
```
|
| 12 |
|
| 13 |
-
(
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
## Short Description (β€ 150 chars typical)
|
| 18 |
|
| 19 |
```
|
| 20 |
-
Two people who couldn't communicate, now can. Real-time ASL β English speech
|
| 21 |
```
|
| 22 |
|
| 23 |
(132 characters.)
|
|
@@ -27,11 +27,11 @@ Two people who couldn't communicate, now can. Real-time ASL β English speech v
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
-
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI).
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
-
Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition
|
| 35 |
|
| 36 |
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
|
| 37 |
|
|
@@ -49,13 +49,13 @@ Built solo by Lucas Loo Tan Yu Heng, May 5β11, 2026.
|
|
| 49 |
Pick from lablab's tag dropdown β these are the tags that match SignBridge:
|
| 50 |
|
| 51 |
**Primary (must-haves):**
|
|
|
|
| 52 |
- `AMD Developer Cloud`
|
| 53 |
- `AMD ROCm`
|
| 54 |
- `HuggingFace Spaces`
|
| 55 |
|
| 56 |
**Secondary (relevant):**
|
| 57 |
- `LLaMA` (Llama-3.1-8B composer)
|
| 58 |
-
- `Qwen` (Qwen3-VL-8B vision)
|
| 59 |
- `Gradio`
|
| 60 |
- `FastAPI`
|
| 61 |
- `Vision`
|
|
|
|
| 7 |
## Project Title (β€ ~70 chars)
|
| 8 |
|
| 9 |
```
|
| 10 |
+
SignBridge β Real-time ASL β speech, Qwen3-VL on AMD MI300X
|
| 11 |
```
|
| 12 |
|
| 13 |
+
(60 characters; leads with Qwen for Qwen Special Reward eligibility.)
|
| 14 |
|
| 15 |
---
|
| 16 |
|
| 17 |
## Short Description (β€ 150 chars typical)
|
| 18 |
|
| 19 |
```
|
| 20 |
+
Two people who couldn't communicate, now can. Real-time ASL β English speech, powered by Qwen3-VL on AMD Instinct MI300X.
|
| 21 |
```
|
| 22 |
|
| 23 |
(132 characters.)
|
|
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
+
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). It is powered by Qwen3-VL-8B for visual understanding of signs.
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
+
Architecture: a multi-stage pipeline (Qwen3-VL-8B for sign recognition β the core intelligence; Llama-3.1-8B for sentence composition; Coqui XTTS-v2 for speech synthesis), running concurrently on a single AMD Instinct MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin β the same workload on NVIDIA H100 needs three GPUs.
|
| 35 |
|
| 36 |
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
|
| 37 |
|
|
|
|
| 49 |
Pick from lablab's tag dropdown β these are the tags that match SignBridge:
|
| 50 |
|
| 51 |
**Primary (must-haves):**
|
| 52 |
+
- `Qwen` / `Qwen3-VL` (Qwen3-VL-8B vision recognizer β central; eligible for Qwen Special Reward 10M tokens)
|
| 53 |
- `AMD Developer Cloud`
|
| 54 |
- `AMD ROCm`
|
| 55 |
- `HuggingFace Spaces`
|
| 56 |
|
| 57 |
**Secondary (relevant):**
|
| 58 |
- `LLaMA` (Llama-3.1-8B composer)
|
|
|
|
| 59 |
- `Gradio`
|
| 60 |
- `FastAPI`
|
| 61 |
- `Vision`
|
docs/pitch-deck.md
CHANGED
|
@@ -117,6 +117,24 @@ The 2β3 minute demo video, looping, autoplay-on-slide-show.
|
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
## Slide 7 β Why this is the right submission for Track 3
|
| 121 |
|
| 122 |
**Headline:**
|
|
|
|
| 117 |
|
| 118 |
---
|
| 119 |
|
| 120 |
+
## Slide 6.5 β Qwen3-VL is the brain
|
| 121 |
+
|
| 122 |
+
**Headline:**
|
| 123 |
+
Qwen3-VL-8B-Instruct: the visual intelligence behind every sign.
|
| 124 |
+
|
| 125 |
+
**Body bullets:**
|
| 126 |
+
- The recognizer is **Qwen3-VL-8B-Instruct** β Alibaba's open Qwen-VL family, served from Hugging Face Hub.
|
| 127 |
+
- We feed it **multi-image bursts** (4 frames over 1.5 s) for motion-dependent signs like HELLO and THANK_YOU β single-frame models fundamentally cannot translate ASL.
|
| 128 |
+
- **Closed-vocabulary forcing** + **sequential frame markers** (NVIDIA video-VLM pattern) keep Qwen on-rails for the 87-token sign vocab. No fine-tuning needed β Qwen3-VL is strong enough zero-shot.
|
| 129 |
+
- Llama-3.1-8B then composes Qwen's tokens into grammatical English; XTTS-v2 speaks it.
|
| 130 |
+
|
| 131 |
+
**Closer:**
|
| 132 |
+
Qwen3-VL is the only thing in the pipeline making the visual judgement. The rest is plumbing.
|
| 133 |
+
|
| 134 |
+
*Visual: a single screenshot of `signbridge/recognizer/vlm.py` showing the multi-frame Qwen call, alongside an arrow into a "detected: HELLO (85%)" overlay.*
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
## Slide 7 β Why this is the right submission for Track 3
|
| 139 |
|
| 140 |
**Headline:**
|
signbridge/composer/sentence.py
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
"""Llama-3.1-8B
|
| 2 |
|
| 3 |
Takes a stream of sign tokens (English glosses + fingerspelled letters)
|
| 4 |
and composes a grammatical English sentence. Backed by an OpenAI-compatible
|
|
@@ -50,12 +50,21 @@ def _resolve_client() -> tuple[object | None, str]:
|
|
| 50 |
"""Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
|
| 51 |
provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
|
| 52 |
composer_model = os.getenv(
|
| 53 |
-
"SIGNBRIDGE_COMPOSER_MODEL", "
|
| 54 |
)
|
| 55 |
|
| 56 |
if provider == "amd":
|
| 57 |
-
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
if not base_url or not api_key:
|
| 60 |
logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
|
| 61 |
return None, composer_model
|
|
@@ -117,6 +126,9 @@ def compose_sentence(signs: Sequence[str]) -> str:
|
|
| 117 |
if client is None:
|
| 118 |
return _naive_join(signs)
|
| 119 |
|
|
|
|
|
|
|
|
|
|
| 120 |
try:
|
| 121 |
resp = client.chat.completions.create( # type: ignore[attr-defined]
|
| 122 |
model=model,
|
|
@@ -126,6 +138,7 @@ def compose_sentence(signs: Sequence[str]) -> str:
|
|
| 126 |
],
|
| 127 |
temperature=0.2,
|
| 128 |
max_tokens=120,
|
|
|
|
| 129 |
)
|
| 130 |
text = (resp.choices[0].message.content or "").strip()
|
| 131 |
except Exception as exc: # noqa: BLE001 β broad catch is intentional at the boundary
|
|
|
|
| 1 |
+
"""Qwen3-8B sentence composer (Llama-3.1-8B-compatible by env var).
|
| 2 |
|
| 3 |
Takes a stream of sign tokens (English glosses + fingerspelled letters)
|
| 4 |
and composes a grammatical English sentence. Backed by an OpenAI-compatible
|
|
|
|
| 50 |
"""Return (cached client, model_id) based on SIGNBRIDGE_PROVIDER env var."""
|
| 51 |
provider = os.getenv("SIGNBRIDGE_PROVIDER", "amd").lower()
|
| 52 |
composer_model = os.getenv(
|
| 53 |
+
"SIGNBRIDGE_COMPOSER_MODEL", "Qwen/Qwen3-8B"
|
| 54 |
)
|
| 55 |
|
| 56 |
if provider == "amd":
|
| 57 |
+
# Prefer a composer-specific base URL/key (lets us run Qwen-VL on :8000
|
| 58 |
+
# and the composer on :8001 of the same MI300X). Falls back to the
|
| 59 |
+
# shared AMD_DEV_CLOUD_BASE_URL when not split.
|
| 60 |
+
base_url = (
|
| 61 |
+
os.getenv("SIGNBRIDGE_COMPOSER_BASE_URL")
|
| 62 |
+
or os.getenv("AMD_DEV_CLOUD_BASE_URL", "")
|
| 63 |
+
).rstrip("/")
|
| 64 |
+
api_key = (
|
| 65 |
+
os.getenv("SIGNBRIDGE_COMPOSER_API_KEY")
|
| 66 |
+
or os.getenv("AMD_DEV_CLOUD_API_KEY", "")
|
| 67 |
+
)
|
| 68 |
if not base_url or not api_key:
|
| 69 |
logger.info("AMD Dev Cloud not configured; falling back to naive joiner.")
|
| 70 |
return None, composer_model
|
|
|
|
| 126 |
if client is None:
|
| 127 |
return _naive_join(signs)
|
| 128 |
|
| 129 |
+
# Qwen3 reasoning models default to emitting <think>...</think>; disable
|
| 130 |
+
# via the chat-template kwarg so vLLM serves a direct sentence. Harmless
|
| 131 |
+
# for non-Qwen3 models (extra_body keys they don't know are ignored).
|
| 132 |
try:
|
| 133 |
resp = client.chat.completions.create( # type: ignore[attr-defined]
|
| 134 |
model=model,
|
|
|
|
| 138 |
],
|
| 139 |
temperature=0.2,
|
| 140 |
max_tokens=120,
|
| 141 |
+
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
| 142 |
)
|
| 143 |
text = (resp.choices[0].message.content or "").strip()
|
| 144 |
except Exception as exc: # noqa: BLE001 β broad catch is intentional at the boundary
|
signbridge/recognizer/vlm.py
CHANGED
|
@@ -37,7 +37,7 @@ from signbridge.vocab import VOCAB_SET as _VLM_VOCAB_SET
|
|
| 37 |
|
| 38 |
logger = logging.getLogger(__name__)
|
| 39 |
|
| 40 |
-
DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/
|
| 41 |
|
| 42 |
|
| 43 |
@lru_cache(maxsize=4)
|
|
|
|
| 37 |
|
| 38 |
logger = logging.getLogger(__name__)
|
| 39 |
|
| 40 |
+
DEFAULT_VLM_MODEL = os.getenv("SIGNBRIDGE_VLM_MODEL", "Qwen/Qwen3-VL-32B-Instruct")
|
| 41 |
|
| 42 |
|
| 43 |
@lru_cache(maxsize=4)
|