Qwopus3.5-27B-v3-FP8 — Fixed Metadata Fork

About this fork

This repository is a metadata- and tensor-key-name-fixed copy of Jackrong/Qwopus3.5-27B-v3-FP8. The FP8 weights are bit-identical to the upstream — only the safetensors tensor names and three small JSON files were changed so that the checkpoint loads cleanly out-of-the-box in modern serving stacks (vLLM, SGLang, transformers 4.5x).

Why a fork? The upstream FP8 release shipped with VL-style model.language_model.* tensor key prefixes despite being declared as the text-only Qwen3_5ForCausalLM architecture, plus a few transformers 5.x-only metadata fields that broke loaders running transformers 4.5x.x. None of this reflects on the upstream training pipeline — these are quantization/export artifacts that the FP8 conversion picked up from the VL parent structure.

What was changed (4 things, weights untouched)

  1. model.safetensors — every tensor key renamed model.language_model.*model.* and rewritten with safetensors.torch.save_file. The tensor values are byte-for-byte identical to the upstream FP8 release.
  2. tokenizer_config.json — dropped the transformers 5.x-only backend field, dropped the TokenizersBackend class reference, dropped audio / vision / image special-token entries (this is a text-only model), and pinned tokenizer_class: PreTrainedTokenizerFast so the tokenizer loads on transformers 4.5x.x.
  3. generation_config.json — removed the transformers_version: "5.5.0" pin so older transformers no longer reject the file.
  4. quantization_info.json — replaced the phantom TokenizersBackend references inside tokenizer_patch with PreTrainedTokenizerFast.

config.json, chat_template.jinja, tokenizer.json, recipe.yaml, and the README content below are passed through unchanged. The framework-side fixes (registering Qwen3_5ForCausalLM / qwen3_5_text in vLLM/SGLang) are deliberately not baked into this repo — they belong in the serving stacks, not in the checkpoint.

Validated runtimes

Tested on a single NVIDIA RTX 5090 (32 GB), bf16 activations, FP8 weights, single-slot, --enforce-eager:

Engine Throughput
vLLM 0.19.0 36.23 tok/s
SGLang 23.6 tok/s

vLLM additionally needed a one-line patch to is_tma_supported in vllm/model_executor/layers/fla/ops/utils.py because consumer Blackwell (RTX 5090, CC 12.0) reports compute capability ≥ 9 but the FLA solve_tril TMA descriptors fail with a misleading Triton Error [CUDA]: out of memory at kernel launch. Restricting is_tma_supported to cc[0] == 9 (actual Hopper datacenter only) falls back to the non-TMA kernel and inference works. This is a runtime-side fix and is unrelated to the repository contents.

Running on RTX 5090 / consumer Blackwell

If you're on Hopper (H100), Ampere (A100), or any older consumer card (RTX 4090, etc.), point your existing vLLM (≥ 0.19.0) at this repo and you're done — nothing else is needed.

If you're on a consumer Blackwell GPU (RTX 5090 / 5080), you will hit a misleading Triton Error [CUDA]: out of memory from solve_tril at the very first inference, regardless of your --gpu-memory-utilization setting. This is a real consumer-Blackwell silicon/driver quirk that the FLA upstream hasn't picked up yet (TMA descriptors are validated only on Hopper). To work around it, two options:

Option 1 — One-shot patch script (works for any venv-based vLLM install):

pip install --upgrade "vllm>=0.19.0"
curl -sS https://huggingface.co/KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready/raw/main/patches/apply_blackwell_patches.py \
    | python3 -

The script is idempotent, makes two surgical edits to vllm/model_executor/layers/fla/ops/, and exits 0 if everything is already applied. See patches/ for the script source and a full description of what each patch does.

Option 2 — Docker image (build once, deploy anywhere):

A drop-in Dockerfile at docker/Dockerfile takes the upstream vllm/vllm-openai:latest image and applies the patches on top. Build and run:

docker build -t qwopus-fp8-blackwell ./docker

docker run --rm --gpus all -p 8000:8000 \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    qwopus-fp8-blackwell \
    --model KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
    --served-model-name qwopus-fp8 \
    --gpu-memory-utilization 0.93 \
    --max-model-len 2048 \
    --max-num-seqs 1 \
    --enforce-eager \
    --dtype bfloat16 \
    --trust-remote-code

See docker/README.md for the full recipe, recommended flags, troubleshooting table, and a curl smoke-test.

Both options are runtime-side fixes for vLLM on consumer Blackwell. The weights and metadata in this repo are identical regardless of which path you take. When the upstream FLA fix lands, both will become unnecessary.

Production readiness

A 15-test smoke / regression suite was run against this checkpoint served by vLLM 0.19.0 on the configuration above. All 15 tests pass (total wall time ~128 s). Repo integrity was independently verified: the SHA-256 of model.safetensors on the Hub matches the local fixed copy byte-for-byte (0e17a0ae27f8852201198241de493b6b14191e584ac5b196a97a5bc67822e0f8), and all seven small text files (configs, tokenizer files, README) are byte-identical between the local staging dir and the Hub copy.

# Test Result
1 Server health (/v1/models)
2 Tokenizer roundtrip (EN + 中文 + 한국어) ✅ 18 tokens, exact decode
3 Basic instruction following ✅ returns PING
4 Simple factual QA ✅ "capital of France" → contains Paris
5 Math reasoning (multi-step word problem) ✅ correct meeting time ~11:06 AM
6 Code generation + execution (is_prime) ✅ correct on 11 primes & 8 composites
7 Multilingual: Chinese ✅ 503 CJK chars in response
8 Multilingual: Korean ✅ 376 Hangul chars in response
9 Multi-turn context retention ✅ recalled fact across turns
10 Determinism @ temperature=0 ✅ bit-exact across two calls
11 Custom stop token honored finish_reason=stop
12 Clean EOS termination on short answer ✅ stops at EOS, not max_tokens
13 Streaming (SSE) ✅ 61 chunks, content correct
14 Long-form coherence (~1200 tok) ✅ 4/4 required keywords covered
15 Throughput consistency (4 runs) 36.9 tok/s avg, ±0.1 tok/s spread

The test script lives at tests/prod_readiness_qwopus_fp8.py in this repo — it talks to a local vLLM server on http://127.0.0.1:8003 and exits non-zero if any test fails.

Note on reasoning behavior: the model emits a <think>...</think> reasoning block before its final answer, even on trivial prompts. Downstream consumers should either render the <think> block as collapsed reasoning or strip everything before </think>. Reasoning-heavy prompts (math, code) need max_tokens ≥ 1000 to leave room for the <think> block plus the final answer.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    torch_dtype="auto",   # FP8 weights, bf16 activations
    device_map="auto",
    trust_remote_code=True,
)

For vLLM:

vllm serve KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.93 \
    --max-model-len 2048 \
    --enforce-eager \
    --trust-remote-code

Credits

All training, evaluation, and the underlying FP8 quantization were done by Jackrong. This fork only fixes loading metadata. For the full model card — motivation, training pipeline, HumanEval benchmarks, intended use, limitations, and citation — see the upstream repository:

👉 Jackrong/Qwopus3.5-27B-v3-FP8

Downloads last month
3,313
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready

Base model

Qwen/Qwen3.5-27B
Quantized
(3)
this model