Qwopus3.5-27B-v3-FP8 — Fixed Metadata Fork

About this fork

This repository is a metadata- and tensor-key-name-fixed copy of Jackrong/Qwopus3.5-27B-v3-FP8. The FP8 weights are bit-identical to the upstream — only the safetensors tensor names and three small JSON files were changed so that the checkpoint loads cleanly out-of-the-box in modern serving stacks (vLLM, SGLang, transformers 4.5x).

Why a fork? The upstream FP8 release shipped with VL-style model.language_model.* tensor key prefixes despite being declared as the text-only Qwen3_5ForCausalLM architecture, plus a few transformers 5.x-only metadata fields that broke loaders running transformers 4.5x.x. None of this reflects on the upstream training pipeline — these are quantization/export artifacts that the FP8 conversion picked up from the VL parent structure.

What was changed (4 things, weights untouched)

  1. model.safetensors — every tensor key renamed model.language_model.*model.* and rewritten with safetensors.torch.save_file. The tensor values are byte-for-byte identical to the upstream FP8 release.
  2. tokenizer_config.json — dropped the transformers 5.x-only backend field, dropped the TokenizersBackend class reference, dropped audio / vision / image special-token entries (this is a text-only model), and pinned tokenizer_class: PreTrainedTokenizerFast so the tokenizer loads on transformers 4.5x.x.
  3. generation_config.json — removed the transformers_version: "5.5.0" pin so older transformers no longer reject the file.
  4. quantization_info.json — replaced the phantom TokenizersBackend references inside tokenizer_patch with PreTrainedTokenizerFast.

config.json, chat_template.jinja, tokenizer.json, recipe.yaml, and the README content below are passed through unchanged. The framework-side fixes (registering Qwen3_5ForCausalLM / qwen3_5_text in vLLM/SGLang) are deliberately not baked into this repo — they belong in the serving stacks, not in the checkpoint.

Validated runtimes

Tested on a single NVIDIA RTX 5090 (32 GB), bf16 activations, FP8 weights, single-slot, --enforce-eager:

Engine Throughput
vLLM 0.19.0 36.23 tok/s
SGLang 23.6 tok/s

vLLM additionally needed a one-line patch to is_tma_supported in vllm/model_executor/layers/fla/ops/utils.py because consumer Blackwell (RTX 5090, CC 12.0) reports compute capability ≥ 9 but the FLA solve_tril TMA descriptors fail with a misleading Triton Error [CUDA]: out of memory at kernel launch. Restricting is_tma_supported to cc[0] == 9 (actual Hopper datacenter only) falls back to the non-TMA kernel and inference works. This is a runtime-side fix and is unrelated to the repository contents.

Production readiness

A 15-test smoke / regression suite was run against this checkpoint served by vLLM 0.19.0 on the configuration above. All 15 tests pass (total wall time ~128 s). Repo integrity was independently verified: the SHA-256 of model.safetensors on the Hub matches the local fixed copy byte-for-byte (0e17a0ae27f8852201198241de493b6b14191e584ac5b196a97a5bc67822e0f8), and all seven small text files (configs, tokenizer files, README) are byte-identical between the local staging dir and the Hub copy.

# Test Result
1 Server health (/v1/models)
2 Tokenizer roundtrip (EN + 中文 + 한국어) ✅ 18 tokens, exact decode
3 Basic instruction following ✅ returns PING
4 Simple factual QA ✅ "capital of France" → contains Paris
5 Math reasoning (multi-step word problem) ✅ correct meeting time ~11:06 AM
6 Code generation + execution (is_prime) ✅ correct on 11 primes & 8 composites
7 Multilingual: Chinese ✅ 503 CJK chars in response
8 Multilingual: Korean ✅ 376 Hangul chars in response
9 Multi-turn context retention ✅ recalled fact across turns
10 Determinism @ temperature=0 ✅ bit-exact across two calls
11 Custom stop token honored finish_reason=stop
12 Clean EOS termination on short answer ✅ stops at EOS, not max_tokens
13 Streaming (SSE) ✅ 61 chunks, content correct
14 Long-form coherence (~1200 tok) ✅ 4/4 required keywords covered
15 Throughput consistency (4 runs) 36.9 tok/s avg, ±0.1 tok/s spread

The test script lives at tests/prod_readiness_qwopus_fp8.py in this repo — it talks to a local vLLM server on http://127.0.0.1:8003 and exits non-zero if any test fails.

Note on reasoning behavior: the model emits a <think>...</think> reasoning block before its final answer, even on trivial prompts. Downstream consumers should either render the <think> block as collapsed reasoning or strip everything before </think>. Reasoning-heavy prompts (math, code) need max_tokens ≥ 1000 to leave room for the <think> block plus the final answer.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    torch_dtype="auto",   # FP8 weights, bf16 activations
    device_map="auto",
    trust_remote_code=True,
)

For vLLM:

vllm serve KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.93 \
    --max-model-len 2048 \
    --enforce-eager \
    --trust-remote-code

Credits

All training, evaluation, and the underlying FP8 quantization were done by Jackrong. This fork only fixes loading metadata. For the full model card — motivation, training pipeline, HumanEval benchmarks, intended use, limitations, and citation — see the upstream repository:

👉 Jackrong/Qwopus3.5-27B-v3-FP8

Downloads last month
7,430
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jackrong/Qwopus3.5-27B-v3-FP8-vllm-ready

Base model

Qwen/Qwen3.5-27B
Quantized
(3)
this model