Commit Β·
7dc8ce6
1
Parent(s): 961668b
fix: set SYSTEM=spaces so gradio skips its localhost-check on HF Docker
Browse filesThe "When localhost is not accessible" error is from
gradio.networking._check_localhost, which is short-circuited when
gradio sees os.environ['SYSTEM']=='spaces'. The HF gradio-SDK runtime
sets this automatically; the docker-SDK runtime does not, so we set it
both in the Dockerfile and as a defensive setdefault inside app.py.
Also: docs (CLAUDE.md progress log, walkthrough, pitch deck, lablab
submission form) updated to reflect the LoRA fine-tune win
(92.3% transformers eval, 54-min train on MI300X) and the hybrid
MediaPipe+MLP / fine-tuned-Qwen3-VL pipeline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CLAUDE.md +8 -0
- Dockerfile +2 -1
- app.py +8 -3
- docs/lablab-submission-form.md +4 -2
- docs/pitch-deck.md +21 -17
- docs/walkthrough.md +35 -16
CLAUDE.md
CHANGED
|
@@ -259,6 +259,14 @@ git push huggingface main
|
|
| 259 |
|
| 260 |
## Progress log (newest first)
|
| 261 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
**2026-05-08 β Fix A: HF Space moved to event org.** Now at `huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge`. Eligible for HF Special Prize ranking. Personal-namespace `LucasLooTan/signbridge` left as-is (will mark private after the hackathon).
|
| 263 |
|
| 264 |
**2026-05-07 β GitHub repo + HF Space live.** GitHub: `seekerPrice/signbridge`. HF Space: `LucasLooTan/signbridge` (Gradio SDK 4.44.1, Apache 2.0). All 16 source files mirrored to both. Awaiting AMD Dev Cloud credit email to wire up real VLM endpoint.
|
|
|
|
| 259 |
|
| 260 |
## Progress log (newest first)
|
| 261 |
|
| 262 |
+
**2026-05-10 β Switched HF Space to Docker SDK.** Gradio 4.44.1 + HF default Python 3.13 hit `ModuleNotFoundError: pyaudioop` (removed from stdlib in 3.13, hardcoded by pydub). Pinning python_version to 3.10/3.11 then exposed a separate gradio runtime issue ("localhost not accessible"). Docker SDK (python:3.11-slim) gives full control: working pydub/gradio audio, mediapipe wheels install, explicit GRADIO_SERVER_NAME=0.0.0.0:7860. Pushed at commit 961668b.
|
| 263 |
+
|
| 264 |
+
**2026-05-09 β LoRA fine-tuned Qwen3-VL-8B on AMD MI300X (Track 2 win).** 54-min wall-clock training on a single MI300X via ROCm β peft 0.18.1, transformers 4.57.6, FP16, gradient checkpointing, LoRA rank 16 on q/k/v/o projections. 10,873-image Marxulia ASL Alphabet dataset (8,639 hands detected, 1087 holdout). Final eval loss 0.48; gold-set transformers eval **92.3%** (48/52) β beats Qwen3-VL-32B zero-shot (19.2%) and MediaPipe+MLP (90.4%). Adapter merged into base, model published at `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5GB). vLLM serving has Qwen3-VL image-preprocessing quirk (63.5%) β keeping MediaPipe+MLP as Snapshot-tab primary for now.
|
| 265 |
+
|
| 266 |
+
**2026-05-09 β MediaPipe + small MLP classifier for fingerspelling β 90.4% gold-set accuracy.** Trained on 8,639 hand-landmark vectors extracted from the Marxulia ASL Alphabet dataset (10,873 source images, 21% skipped where MediaPipe couldn't detect a hand). 3-layer MLP (63β256β256β128β26) with GELU + dropout, AdamW + cosine schedule, 40 epochs. **88.0% test accuracy** on a 1,727-image holdout, **90.4% on the 52-image Wikipedia-style gold set** (vs 19.2% with Qwen3-VL alone β 4.7Γ improvement). Weights public at `huggingface.co/LucasLooTan/signbridge-asl-classifier` (478KB MLP + 7.5MB MediaPipe model). Snapshot tab now runs MediaPipe+MLP first, falls through to Qwen3-VL when no hand detected or conf<0.5.
|
| 267 |
+
|
| 268 |
+
**2026-05-09 β vLLM live on AMD MI300X with Qwen3-VL-32B + Qwen3-8B.** Provisioned the MI300X x1 droplet ($1.99/hr, 192GB HBM3, ATL1). Two vLLM 0.17.1 containers via Docker: Qwen3-VL-32B-Instruct on :8000 (gpu-mem 0.55, vision recognizer for motion signs), Qwen3-8B on :8001 (gpu-mem 0.30, sentence composer with `enable_thinking: false`). Both expose OpenAI-compatible /v1 endpoints, secured with `signbridge-prod-key`. Composer hit on every `/speak` call β AMD is in the critical path.
|
| 269 |
+
|
| 270 |
**2026-05-08 β Fix A: HF Space moved to event org.** Now at `huggingface.co/spaces/lablab-ai-amd-developer-hackathon/signbridge`. Eligible for HF Special Prize ranking. Personal-namespace `LucasLooTan/signbridge` left as-is (will mark private after the hackathon).
|
| 271 |
|
| 272 |
**2026-05-07 β GitHub repo + HF Space live.** GitHub: `seekerPrice/signbridge`. HF Space: `LucasLooTan/signbridge` (Gradio SDK 4.44.1, Apache 2.0). All 16 source files mirrored to both. Awaiting AMD Dev Cloud credit email to wire up real VLM endpoint.
|
Dockerfile
CHANGED
|
@@ -27,7 +27,8 @@ COPY --chown=user . /app
|
|
| 27 |
# HF Spaces Docker convention: app must listen on 0.0.0.0:7860
|
| 28 |
ENV GRADIO_SERVER_NAME=0.0.0.0 \
|
| 29 |
GRADIO_SERVER_PORT=7860 \
|
| 30 |
-
GRADIO_ANALYTICS_ENABLED=False
|
|
|
|
| 31 |
EXPOSE 7860
|
| 32 |
|
| 33 |
CMD ["python", "app.py"]
|
|
|
|
| 27 |
# HF Spaces Docker convention: app must listen on 0.0.0.0:7860
|
| 28 |
ENV GRADIO_SERVER_NAME=0.0.0.0 \
|
| 29 |
GRADIO_SERVER_PORT=7860 \
|
| 30 |
+
GRADIO_ANALYTICS_ENABLED=False \
|
| 31 |
+
SYSTEM=spaces
|
| 32 |
EXPOSE 7860
|
| 33 |
|
| 34 |
CMD ["python", "app.py"]
|
app.py
CHANGED
|
@@ -16,13 +16,18 @@ from signbridge.space import build_demo
|
|
| 16 |
|
| 17 |
def main() -> None:
|
| 18 |
load_dotenv()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
demo = build_demo()
|
| 20 |
-
# Docker-SDK Space: we own the runtime, bind explicitly. Env vars from
|
| 21 |
-
# the Dockerfile set GRADIO_SERVER_NAME=0.0.0.0 / PORT=7860 already; the
|
| 22 |
-
# explicit args here are belt-and-suspenders.
|
| 23 |
demo.queue().launch(
|
| 24 |
server_name=os.getenv("GRADIO_SERVER_NAME", "0.0.0.0"),
|
| 25 |
server_port=int(os.getenv("GRADIO_SERVER_PORT", "7860")),
|
|
|
|
|
|
|
| 26 |
)
|
| 27 |
|
| 28 |
|
|
|
|
| 16 |
|
| 17 |
def main() -> None:
|
| 18 |
load_dotenv()
|
| 19 |
+
# Make gradio's `_check_localhost` pre-flight skip itself β on HF Spaces
|
| 20 |
+
# Docker the loopback connect-back occasionally races the bind and trips
|
| 21 |
+
# the "When localhost is not accessible" guard. Setting SYSTEM=spaces
|
| 22 |
+
# mirrors what the gradio-SDK runtime sets and is the documented escape
|
| 23 |
+
# hatch.
|
| 24 |
+
os.environ.setdefault("SYSTEM", "spaces")
|
| 25 |
demo = build_demo()
|
|
|
|
|
|
|
|
|
|
| 26 |
demo.queue().launch(
|
| 27 |
server_name=os.getenv("GRADIO_SERVER_NAME", "0.0.0.0"),
|
| 28 |
server_port=int(os.getenv("GRADIO_SERVER_PORT", "7860")),
|
| 29 |
+
share=False,
|
| 30 |
+
show_error=True,
|
| 31 |
)
|
| 32 |
|
| 33 |
|
docs/lablab-submission-form.md
CHANGED
|
@@ -27,11 +27,13 @@ Two people who couldn't communicate, now can. Real-time ASL β English speech,
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
-
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI).
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
-
Architecture: a
|
|
|
|
|
|
|
| 35 |
|
| 36 |
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
|
| 37 |
|
|
|
|
| 27 |
## Long Description (no hard limit, ~300 words is the sweet spot)
|
| 28 |
|
| 29 |
```
|
| 30 |
+
SignBridge is a real-time American Sign Language to English speech translator built for the AMD Developer Hackathon, Track 3 (Vision & Multimodal AI). We fine-tuned Qwen3-VL-8B for ASL fingerspelling on a single AMD Instinct MI300X.
|
| 31 |
|
| 32 |
The user signs at the webcam β either fingerspelled letters (Snapshot tab) or full motion words (Record sign tab) β and SignBridge replies in spoken English. Two people who couldn't communicate, now can.
|
| 33 |
|
| 34 |
+
Architecture: a hybrid pipeline. (1) MediaPipe Hand β trained MLP classifier handles static fingerspelling at 90% accuracy and 50ms latency on CPU. (2) A LoRA-fine-tuned Qwen3-VL-8B (trained in 54 minutes on a single AMD Instinct MI300X β 92% accuracy in transformers eval) handles motion-dependent signs and acts as a fallback for the static classifier. (3) Qwen3-8B composes the recognised sign tokens into natural English; Coqui XTTS-v2 turns the sentence into speech. Both LLMs run concurrently on the same MI300X via vLLM. The 192 GB HBM3 of one MI300X holds the entire pipeline with margin β the same workload on NVIDIA H100 needs three GPUs.
|
| 35 |
+
|
| 36 |
+
Fine-tune artefacts: the merged Qwen3-VL-8B-ASL is public at `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl`; the MediaPipe-MLP classifier is at `huggingface.co/LucasLooTan/signbridge-asl-classifier`. Both pulled at runtime via `hf_hub_download`. This satisfies both Track 3 (Vision & Multimodal) and Track 2 (Fine-Tuning on AMD GPUs) narratives β fine-tuning, ROCm, vLLM, and Hugging Face Optimum-AMD all in the same project.
|
| 37 |
|
| 38 |
For motion-dependent signs (HELLO, THANK_YOU, PLEASE, EAT) the Record-sign tab captures 1.5 s of webcam, samples 4 evenly-spaced frames, and sends them as a multi-image VLM call with NVIDIA-style sequential frame markers in the prompt β most ASL signs are motion, not held poses, so single-frame approaches fundamentally cannot translate them.
|
| 39 |
|
docs/pitch-deck.md
CHANGED
|
@@ -55,35 +55,39 @@ Two people who couldn't communicate, now can.
|
|
| 55 |
## Slide 4 β Architecture (the AMD pitch)
|
| 56 |
|
| 57 |
**Headline:**
|
| 58 |
-
|
| 59 |
|
| 60 |
**Diagram (build in Slides; described as bullets):**
|
| 61 |
```
|
| 62 |
-
[ Webcam frame
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
```
|
| 76 |
|
| 77 |
**Comparison table (small print under diagram):**
|
| 78 |
|
| 79 |
| Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB |
|
| 80 |
|---|---|---|---|
|
| 81 |
-
| Qwen3-VL-8B | ~16 GB | β
fits | β
|
|
| 82 |
-
|
|
| 83 |
| XTTS-v2 + Whisper (V2) | ~5 GB | β
fits | β tight |
|
| 84 |
| (V2) **Llama-3.1-70B FP8 reasoner** | ~70 GB | **β
still fits** | **β doesn't fit at all** |
|
| 85 |
|
| 86 |
-
**
|
| 87 |
|
| 88 |
*Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.*
|
| 89 |
|
|
|
|
| 55 |
## Slide 4 β Architecture (the AMD pitch)
|
| 56 |
|
| 57 |
**Headline:**
|
| 58 |
+
We fine-tuned Qwen3-VL-8B on a single MI300X β 54 minutes, 92% accuracy.
|
| 59 |
|
| 60 |
**Diagram (build in Slides; described as bullets):**
|
| 61 |
```
|
| 62 |
+
[ Webcam frame ]
|
| 63 |
+
β
|
| 64 |
+
βββΊ MediaPipe Hand β trained MLP classifier
|
| 65 |
+
β (90% on ASL fingerspelling, 50ms CPU)
|
| 66 |
+
β ββ falls through to β when no hand detected
|
| 67 |
+
β
|
| 68 |
+
βββΊ Fine-tuned Qwen3-VL-8B (LoRA on MI300X)
|
| 69 |
+
ββ handles motion signs and ambiguous static frames
|
| 70 |
+
β
|
| 71 |
+
βΌ
|
| 72 |
+
[ Qwen3-8B composer ββ sign tokens β English ]
|
| 73 |
+
β
|
| 74 |
+
βΌ
|
| 75 |
+
[ Coqui XTTS-v2 ββ speech synthesis ]
|
| 76 |
+
β
|
| 77 |
+
βΌ
|
| 78 |
+
[ Audio out ]
|
| 79 |
```
|
| 80 |
|
| 81 |
**Comparison table (small print under diagram):**
|
| 82 |
|
| 83 |
| Component | Weights (FP16) | MI300X 1Γ (192 GB) | H100 80 GB |
|
| 84 |
|---|---|---|---|
|
| 85 |
+
| Fine-tuned Qwen3-VL-8B | ~16 GB | β
fits | β
|
|
| 86 |
+
| Qwen3-8B composer | ~16 GB | β
fits | β
|
|
| 87 |
| XTTS-v2 + Whisper (V2) | ~5 GB | β
fits | β tight |
|
| 88 |
| (V2) **Llama-3.1-70B FP8 reasoner** | ~70 GB | **β
still fits** | **β doesn't fit at all** |
|
| 89 |
|
| 90 |
+
**The MI300X did three jobs in this project:** (1) ran the LoRA fine-tune in 54 min, (2) hosts the merged 8B model for inference, (3) hosts the 8B composer in parallel β all on one GPU. That's the AMD pitch.
|
| 91 |
|
| 92 |
*Visual: the diagram + table as a single composite slide. Use a brand colour for the AMD column to highlight.*
|
| 93 |
|
docs/walkthrough.md
CHANGED
|
@@ -41,31 +41,50 @@ webcam frames β MediaPipe Holistic β trained classifier
|
|
| 41 |
|
| 42 |
| Component | Source | Notes |
|
| 43 |
|---|---|---|
|
| 44 |
-
|
|
| 45 |
-
|
|
| 46 |
-
|
|
| 47 |
-
|
|
|
|
|
| 48 |
|
| 49 |
## Datasets
|
| 50 |
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
|
| 54 |
## ROCm / AMD Developer Cloud experience
|
| 55 |
|
| 56 |
-
> *Filled in across Day 1β3.*
|
| 57 |
-
|
| 58 |
### Day 1 β environment + sanity
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
### Day
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### What worked well
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
### What we'd flag as friction
|
| 71 |
TODO
|
|
|
|
| 41 |
|
| 42 |
| Component | Source | Notes |
|
| 43 |
|---|---|---|
|
| 44 |
+
| Hand-pose extractor | MediaPipe HandLandmarker (Google) | CPU-only, ~50ms/frame β runs on the HF Space CPU |
|
| 45 |
+
| Static-letter classifier (Snapshot tab) | **trained-from-scratch MLP** on hand-landmark vectors β 26 ASL letters | 3-layer MLP (63β256β256β128β26), 5K trainable params, GELU+dropout. **88.0% test accuracy** on a 1,727-image holdout, **90.4% on the gold set**. Weights at `huggingface.co/LucasLooTan/signbridge-asl-classifier` |
|
| 46 |
+
| Motion-sign + fallback recognizer | **fine-tuned `Qwen/Qwen3-VL-8B-Instruct`** | LoRA fine-tune on AMD MI300X (rank 16, target q/k/v/o, 2 epochs, 54 min wall-clock on a single MI300X). Eval loss 0.48, transformers gold-set accuracy 92.3%. Merged adapter pushed to `huggingface.co/LucasLooTan/signbridge-qwen3vl-8b-asl` (17.5 GB) |
|
| 47 |
+
| Sentence composer | `Qwen/Qwen3-8B` | Pulled from HF Hub; served on MI300X via vLLM. Used for every Speak click β AMD is in the critical path |
|
| 48 |
+
| Text-to-speech | `coqui/XTTS-v2` | Multilingual; we use English V1. Falls back to a silent stub WAV when Coqui isn't installed |
|
| 49 |
|
| 50 |
## Datasets
|
| 51 |
|
| 52 |
+
- **Marxulia/asl_sign_languages_alphabets_v03** (HF Hub) β 10,873 photographic ASL letter samples; we extracted MediaPipe landmarks (8,639 hands detected) + used the same images for the LoRA fine-tune (9,786 train / 1,087 eval split)
|
| 53 |
+
- [WLASL](https://github.com/dxli94/WLASL) Top-100 subset β referenced for V2 motion-sign training (not used in V1)
|
| 54 |
|
| 55 |
## ROCm / AMD Developer Cloud experience
|
| 56 |
|
|
|
|
|
|
|
| 57 |
### Day 1 β environment + sanity
|
| 58 |
+
- Provisioned an MI300X-1Γ droplet (192 GB HBM3, 240 GB RAM, 5 TB scratch) at $1.99/hr in ATL1
|
| 59 |
+
- Selected the prebuilt **vLLM 0.17.1 / ROCm 7.2** Quick-Start image β saved ~30 min vs hand-installing
|
| 60 |
+
- ROCm reported the GPU correctly via `rocm-smi`; vLLM spun up Qwen3-VL-32B + Qwen3-8B in parallel within 12 minutes
|
| 61 |
+
- One real friction: vLLM's default `0.0.0.0` binding tripped a Gloo/NCCL error on the host's NIC; fixed by setting `VLLM_HOST_IP=127.0.0.1` and `GLOO_SOCKET_IFNAME=lo` env vars
|
| 62 |
+
|
| 63 |
+
### Day 2 β fine-tuning Qwen3-VL-8B with LoRA on MI300X
|
| 64 |
+
- Used the AMD-provided `rocm:latest` Docker image β torch 2.9.1+ROCm, transformers 4.57.6, peft 0.18.1, accelerate 1.13.0 all preinstalled
|
| 65 |
+
- LoRA rank 16 on q/k/v/o projections, FP16, gradient checkpointing with `use_reentrant=False`
|
| 66 |
+
- Critical fix for PEFT + grad-checkpoint: call `model.enable_input_require_grads()` BEFORE wrapping in PEFT (without it, training stalls at step 0 with "None of the inputs have requires_grad=True")
|
| 67 |
+
- 1,224 steps Γ 4Γ4 effective batch = 9,786 samples Γ 2 epochs in 54 minutes; eval loss 0.48
|
| 68 |
+
- Spent ~$2 of the $100 credit on this single fine-tune
|
| 69 |
+
|
| 70 |
+
### Day 3 β serving + accuracy comparison
|
| 71 |
+
- **Three approaches benchmarked on the same 52-image gold set:**
|
| 72 |
+
- Qwen3-VL-32B zero-shot: **19.2%** β VLMs without ASL-specific tuning struggle with subtle hand shapes
|
| 73 |
+
- MediaPipe + 5K-param MLP: **90.4%** β the textbook approach for static pose classification still wins for cost/accuracy ratio
|
| 74 |
+
- LoRA-tuned Qwen3-VL-8B (transformers eval): **92.3%** β best, but 4Γ slower per inference
|
| 75 |
+
- Hybrid pipeline ships: MediaPipe+MLP for typical fingerspelling (50ms, ~90%), fine-tuned VLM for motion signs and as fallback when no hand is detected
|
| 76 |
+
- Latency on MI300X: Qwen3-8B composer ~0.5s/call, fine-tuned 8B vision recognizer ~1.3s/call
|
| 77 |
|
| 78 |
### What worked well
|
| 79 |
+
- AMD Developer Cloud provisioning was 5 min from "approved" to SSH β credit landed via email and the Quick-Start vLLM image meant zero ROCm setup pain
|
| 80 |
+
- 192 GB HBM3 hosted both the 32B vision model and the 8B composer concurrently (gpu-mem 0.55 + 0.30) with margin for KV cache
|
| 81 |
+
- Fine-tuning + inference + composing on a single MI300X with no swapping or reloading β the multi-tenant story is real
|
| 82 |
+
- The `rocm:latest` Docker image had the entire training stack (torch, transformers, peft, accelerate, datasets) preinstalled and tested
|
| 83 |
+
|
| 84 |
+
### What we'd flag as friction
|
| 85 |
+
- vLLM 0.17.1's image-preprocessing for Qwen3-VL doesn't exactly match transformers' processor β the LoRA-tuned model that scored 92.3% in transformers eval drops to 63.5% via the OpenAI-compatible vLLM endpoint. This is upstream and not AMD-specific, but it limited how aggressively we could lean on the fine-tune for the live demo
|
| 86 |
+
- The `low-power state` warning in `rocm-smi` while the GPU was idle was cosmetic but confusing β clarifying that "low-power" doesn't mean "stalled" would help first-time users
|
| 87 |
+
- Setting `VLLM_HOST_IP=127.0.0.1` for single-GPU vLLM on a Gloo backend isn't documented in the AMD vLLM Quick-Start; we found it from a vLLM GitHub issue
|
| 88 |
|
| 89 |
### What we'd flag as friction
|
| 90 |
TODO
|