Qwopus3.5-27B-v3-FP8 — Fixed Metadata Fork
About this fork
This repository is a metadata- and tensor-key-name-fixed copy of
Jackrong/Qwopus3.5-27B-v3-FP8. The FP8 weights are bit-identical to the upstream — only the safetensors tensor names and three small JSON files were changed so that the checkpoint loads cleanly out-of-the-box in modern serving stacks (vLLM, SGLang, transformers 4.5x).Why a fork? The upstream FP8 release shipped with VL-style
model.language_model.*tensor key prefixes despite being declared as the text-onlyQwen3_5ForCausalLMarchitecture, plus a few transformers 5.x-only metadata fields that broke loaders running transformers 4.5x.x. None of this reflects on the upstream training pipeline — these are quantization/export artifacts that the FP8 conversion picked up from the VL parent structure.
What was changed (4 things, weights untouched)
model.safetensors— every tensor key renamedmodel.language_model.*→model.*and rewritten withsafetensors.torch.save_file. The tensor values are byte-for-byte identical to the upstream FP8 release.tokenizer_config.json— dropped the transformers 5.x-onlybackendfield, dropped theTokenizersBackendclass reference, dropped audio / vision / image special-token entries (this is a text-only model), and pinnedtokenizer_class: PreTrainedTokenizerFastso the tokenizer loads on transformers 4.5x.x.generation_config.json— removed thetransformers_version: "5.5.0"pin so older transformers no longer reject the file.quantization_info.json— replaced the phantomTokenizersBackendreferences insidetokenizer_patchwithPreTrainedTokenizerFast.
config.json, chat_template.jinja, tokenizer.json, recipe.yaml, and the
README content below are passed through unchanged. The framework-side fixes
(registering Qwen3_5ForCausalLM / qwen3_5_text in vLLM/SGLang) are
deliberately not baked into this repo — they belong in the serving stacks,
not in the checkpoint.
Validated runtimes
Tested on a single NVIDIA RTX 5090 (32 GB), bf16 activations, FP8 weights,
single-slot, --enforce-eager:
| Engine | Throughput |
|---|---|
| vLLM 0.19.0 | 36.23 tok/s |
| SGLang | 23.6 tok/s |
vLLM additionally needed a one-line patch to is_tma_supported in
vllm/model_executor/layers/fla/ops/utils.py because consumer Blackwell
(RTX 5090, CC 12.0) reports compute capability ≥ 9 but the FLA solve_tril
TMA descriptors fail with a misleading
Triton Error [CUDA]: out of memory at kernel launch. Restricting
is_tma_supported to cc[0] == 9 (actual Hopper datacenter only) falls back
to the non-TMA kernel and inference works. This is a runtime-side fix and is
unrelated to the repository contents.
Running on RTX 5090 / consumer Blackwell
If you're on Hopper (H100), Ampere (A100), or any older consumer card (RTX 4090, etc.), point your existing vLLM (≥ 0.19.0) at this repo and you're done — nothing else is needed.
If you're on a consumer Blackwell GPU (RTX 5090 / 5080), you will hit a
misleading Triton Error [CUDA]: out of memory from solve_tril at the very
first inference, regardless of your --gpu-memory-utilization setting. This
is a real consumer-Blackwell silicon/driver quirk that the FLA upstream
hasn't picked up yet (TMA descriptors are validated only on Hopper). To work
around it, two options:
Option 1 — One-shot patch script (works for any venv-based vLLM install):
pip install --upgrade "vllm>=0.19.0"
curl -sS https://huggingface.co/KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready/raw/main/patches/apply_blackwell_patches.py \
| python3 -
The script is idempotent, makes two surgical edits to vllm/model_executor/layers/fla/ops/,
and exits 0 if everything is already applied. See patches/ for
the script source and a full description of what each patch does.
Option 2 — Docker image (build once, deploy anywhere):
A drop-in Dockerfile at docker/Dockerfile takes the
upstream vllm/vllm-openai:latest image and applies the patches on top.
Build and run:
docker build -t qwopus-fp8-blackwell ./docker
docker run --rm --gpus all -p 8000:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
qwopus-fp8-blackwell \
--model KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
--served-model-name qwopus-fp8 \
--gpu-memory-utilization 0.93 \
--max-model-len 2048 \
--max-num-seqs 1 \
--enforce-eager \
--dtype bfloat16 \
--trust-remote-code
See docker/README.md for the full recipe, recommended
flags, troubleshooting table, and a curl smoke-test.
Both options are runtime-side fixes for vLLM on consumer Blackwell. The weights and metadata in this repo are identical regardless of which path you take. When the upstream FLA fix lands, both will become unnecessary.
Production readiness
A 15-test smoke / regression suite was run against this checkpoint served by
vLLM 0.19.0 on the configuration above. All 15 tests pass (total wall time
~128 s). Repo integrity was independently verified: the SHA-256 of
model.safetensors on the Hub matches the local fixed copy byte-for-byte
(0e17a0ae27f8852201198241de493b6b14191e584ac5b196a97a5bc67822e0f8), and all
seven small text files (configs, tokenizer files, README) are byte-identical
between the local staging dir and the Hub copy.
| # | Test | Result |
|---|---|---|
| 1 | Server health (/v1/models) |
✅ |
| 2 | Tokenizer roundtrip (EN + 中文 + 한국어) | ✅ 18 tokens, exact decode |
| 3 | Basic instruction following | ✅ returns PING |
| 4 | Simple factual QA | ✅ "capital of France" → contains Paris |
| 5 | Math reasoning (multi-step word problem) | ✅ correct meeting time ~11:06 AM |
| 6 | Code generation + execution (is_prime) |
✅ correct on 11 primes & 8 composites |
| 7 | Multilingual: Chinese | ✅ 503 CJK chars in response |
| 8 | Multilingual: Korean | ✅ 376 Hangul chars in response |
| 9 | Multi-turn context retention | ✅ recalled fact across turns |
| 10 | Determinism @ temperature=0 |
✅ bit-exact across two calls |
| 11 | Custom stop token honored | ✅ finish_reason=stop |
| 12 | Clean EOS termination on short answer | ✅ stops at EOS, not max_tokens |
| 13 | Streaming (SSE) | ✅ 61 chunks, content correct |
| 14 | Long-form coherence (~1200 tok) | ✅ 4/4 required keywords covered |
| 15 | Throughput consistency (4 runs) | ✅ 36.9 tok/s avg, ±0.1 tok/s spread |
The test script lives at
tests/prod_readiness_qwopus_fp8.py
in this repo — it talks to a local vLLM server on http://127.0.0.1:8003 and
exits non-zero if any test fails.
Note on reasoning behavior: the model emits a
<think>...</think>reasoning block before its final answer, even on trivial prompts. Downstream consumers should either render the<think>block as collapsed reasoning or strip everything before</think>. Reasoning-heavy prompts (math, code) needmax_tokens ≥ 1000to leave room for the<think>block plus the final answer.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype="auto", # FP8 weights, bf16 activations
device_map="auto",
trust_remote_code=True,
)
For vLLM:
vllm serve KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
--dtype bfloat16 \
--gpu-memory-utilization 0.93 \
--max-model-len 2048 \
--enforce-eager \
--trust-remote-code
Credits
All training, evaluation, and the underlying FP8 quantization were done by Jackrong. This fork only fixes loading metadata. For the full model card — motivation, training pipeline, HumanEval benchmarks, intended use, limitations, and citation — see the upstream repository:
- Downloads last month
- 3,313