Qwopus3.5-27B-v3-FP8 — Fixed Metadata Fork
About this fork
This repository is a metadata- and tensor-key-name-fixed copy of
Jackrong/Qwopus3.5-27B-v3-FP8. The FP8 weights are bit-identical to the upstream — only the safetensors tensor names and three small JSON files were changed so that the checkpoint loads cleanly out-of-the-box in modern serving stacks (vLLM, SGLang, transformers 4.5x).Why a fork? The upstream FP8 release shipped with VL-style
model.language_model.*tensor key prefixes despite being declared as the text-onlyQwen3_5ForCausalLMarchitecture, plus a few transformers 5.x-only metadata fields that broke loaders running transformers 4.5x.x. None of this reflects on the upstream training pipeline — these are quantization/export artifacts that the FP8 conversion picked up from the VL parent structure.
What was changed (4 things, weights untouched)
model.safetensors— every tensor key renamedmodel.language_model.*→model.*and rewritten withsafetensors.torch.save_file. The tensor values are byte-for-byte identical to the upstream FP8 release.tokenizer_config.json— dropped the transformers 5.x-onlybackendfield, dropped theTokenizersBackendclass reference, dropped audio / vision / image special-token entries (this is a text-only model), and pinnedtokenizer_class: PreTrainedTokenizerFastso the tokenizer loads on transformers 4.5x.x.generation_config.json— removed thetransformers_version: "5.5.0"pin so older transformers no longer reject the file.quantization_info.json— replaced the phantomTokenizersBackendreferences insidetokenizer_patchwithPreTrainedTokenizerFast.
config.json, chat_template.jinja, tokenizer.json, recipe.yaml, and the
README content below are passed through unchanged. The framework-side fixes
(registering Qwen3_5ForCausalLM / qwen3_5_text in vLLM/SGLang) are
deliberately not baked into this repo — they belong in the serving stacks,
not in the checkpoint.
Validated runtimes
Tested on a single NVIDIA RTX 5090 (32 GB), bf16 activations, FP8 weights,
single-slot, --enforce-eager:
| Engine | Throughput |
|---|---|
| vLLM 0.19.0 | 36.23 tok/s |
| SGLang | 23.6 tok/s |
vLLM additionally needed a one-line patch to is_tma_supported in
vllm/model_executor/layers/fla/ops/utils.py because consumer Blackwell
(RTX 5090, CC 12.0) reports compute capability ≥ 9 but the FLA solve_tril
TMA descriptors fail with a misleading
Triton Error [CUDA]: out of memory at kernel launch. Restricting
is_tma_supported to cc[0] == 9 (actual Hopper datacenter only) falls back
to the non-TMA kernel and inference works. This is a runtime-side fix and is
unrelated to the repository contents.
Production readiness
A 15-test smoke / regression suite was run against this checkpoint served by
vLLM 0.19.0 on the configuration above. All 15 tests pass (total wall time
~128 s). Repo integrity was independently verified: the SHA-256 of
model.safetensors on the Hub matches the local fixed copy byte-for-byte
(0e17a0ae27f8852201198241de493b6b14191e584ac5b196a97a5bc67822e0f8), and all
seven small text files (configs, tokenizer files, README) are byte-identical
between the local staging dir and the Hub copy.
| # | Test | Result |
|---|---|---|
| 1 | Server health (/v1/models) |
✅ |
| 2 | Tokenizer roundtrip (EN + 中文 + 한국어) | ✅ 18 tokens, exact decode |
| 3 | Basic instruction following | ✅ returns PING |
| 4 | Simple factual QA | ✅ "capital of France" → contains Paris |
| 5 | Math reasoning (multi-step word problem) | ✅ correct meeting time ~11:06 AM |
| 6 | Code generation + execution (is_prime) |
✅ correct on 11 primes & 8 composites |
| 7 | Multilingual: Chinese | ✅ 503 CJK chars in response |
| 8 | Multilingual: Korean | ✅ 376 Hangul chars in response |
| 9 | Multi-turn context retention | ✅ recalled fact across turns |
| 10 | Determinism @ temperature=0 |
✅ bit-exact across two calls |
| 11 | Custom stop token honored | ✅ finish_reason=stop |
| 12 | Clean EOS termination on short answer | ✅ stops at EOS, not max_tokens |
| 13 | Streaming (SSE) | ✅ 61 chunks, content correct |
| 14 | Long-form coherence (~1200 tok) | ✅ 4/4 required keywords covered |
| 15 | Throughput consistency (4 runs) | ✅ 36.9 tok/s avg, ±0.1 tok/s spread |
The test script lives at
tests/prod_readiness_qwopus_fp8.py
in this repo — it talks to a local vLLM server on http://127.0.0.1:8003 and
exits non-zero if any test fails.
Note on reasoning behavior: the model emits a
<think>...</think>reasoning block before its final answer, even on trivial prompts. Downstream consumers should either render the<think>block as collapsed reasoning or strip everything before</think>. Reasoning-heavy prompts (math, code) needmax_tokens ≥ 1000to leave room for the<think>block plus the final answer.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype="auto", # FP8 weights, bf16 activations
device_map="auto",
trust_remote_code=True,
)
For vLLM:
vllm serve KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
--dtype bfloat16 \
--gpu-memory-utilization 0.93 \
--max-model-len 2048 \
--enforce-eager \
--trust-remote-code
Credits
All training, evaluation, and the underlying FP8 quantization were done by Jackrong. This fork only fixes loading metadata. For the full model card — motivation, training pipeline, HumanEval benchmarks, intended use, limitations, and citation — see the upstream repository:
- Downloads last month
- 7,430