Qwopus3.5-27B-v3-FP8 — Fixed Metadata Fork

About this fork

This repository is a metadata- and tensor-key-name-fixed copy of Jackrong/Qwopus3.5-27B-v3-FP8. The FP8 weights are bit-identical to the upstream — only the safetensors tensor names and three small JSON files were changed so that the checkpoint loads cleanly out-of-the-box in modern serving stacks (vLLM, SGLang, transformers 4.5x).

Why a fork? The upstream FP8 release shipped with VL-style model.language_model.* tensor key prefixes despite being declared as the text-only Qwen3_5ForCausalLM architecture, plus a few transformers 5.x-only metadata fields that broke loaders running transformers 4.5x.x. None of this reflects on the upstream training pipeline — these are quantization/export artifacts that the FP8 conversion picked up from the VL parent structure.

What was changed (4 things, weights untouched)

model.safetensors — every tensor key renamed model.language_model.* → model.* and rewritten with safetensors.torch.save_file. The tensor values are byte-for-byte identical to the upstream FP8 release.
tokenizer_config.json — dropped the transformers 5.x-only backend field, dropped the TokenizersBackend class reference, dropped audio / vision / image special-token entries (this is a text-only model), and pinned tokenizer_class: PreTrainedTokenizerFast so the tokenizer loads on transformers 4.5x.x.
generation_config.json — removed the transformers_version: "5.5.0" pin so older transformers no longer reject the file.
quantization_info.json — replaced the phantom TokenizersBackend references inside tokenizer_patch with PreTrainedTokenizerFast.

config.json, chat_template.jinja, tokenizer.json, recipe.yaml, and the README content below are passed through unchanged. The framework-side fixes (registering Qwen3_5ForCausalLM / qwen3_5_text in vLLM/SGLang) are deliberately not baked into this repo — they belong in the serving stacks, not in the checkpoint.

Validated runtimes

Tested on a single NVIDIA RTX 5090 (32 GB), bf16 activations, FP8 weights, single-slot, --enforce-eager:

Engine	Throughput
vLLM 0.19.0	36.23 tok/s
SGLang	23.6 tok/s

vLLM additionally needed a one-line patch to is_tma_supported in vllm/model_executor/layers/fla/ops/utils.py because consumer Blackwell (RTX 5090, CC 12.0) reports compute capability ≥ 9 but the FLA solve_tril TMA descriptors fail with a misleading Triton Error [CUDA]: out of memory at kernel launch. Restricting is_tma_supported to cc[0] == 9 (actual Hopper datacenter only) falls back to the non-TMA kernel and inference works. This is a runtime-side fix and is unrelated to the repository contents.

Production readiness

A 15-test smoke / regression suite was run against this checkpoint served by vLLM 0.19.0 on the configuration above. All 15 tests pass (total wall time ~128 s). Repo integrity was independently verified: the SHA-256 of model.safetensors on the Hub matches the local fixed copy byte-for-byte (0e17a0ae27f8852201198241de493b6b14191e584ac5b196a97a5bc67822e0f8), and all seven small text files (configs, tokenizer files, README) are byte-identical between the local staging dir and the Hub copy.

#	Test	Result
1	Server health (`/v1/models`)	✅
2	Tokenizer roundtrip (EN + 中文 + 한국어)	✅ 18 tokens, exact decode
3	Basic instruction following	✅ returns `PING`
4	Simple factual QA	✅ "capital of France" → contains Paris
5	Math reasoning (multi-step word problem)	✅ correct meeting time ~11:06 AM
6	Code generation + execution (`is_prime`)	✅ correct on 11 primes & 8 composites
7	Multilingual: Chinese	✅ 503 CJK chars in response
8	Multilingual: Korean	✅ 376 Hangul chars in response
9	Multi-turn context retention	✅ recalled fact across turns
10	Determinism @ `temperature=0`	✅ bit-exact across two calls
11	Custom stop token honored	✅ `finish_reason=stop`
12	Clean EOS termination on short answer	✅ stops at EOS, not max_tokens
13	Streaming (SSE)	✅ 61 chunks, content correct
14	Long-form coherence (~1200 tok)	✅ 4/4 required keywords covered
15	Throughput consistency (4 runs)	✅ 36.9 tok/s avg, ±0.1 tok/s spread

The test script lives at tests/prod_readiness_qwopus_fp8.py in this repo — it talks to a local vLLM server on http://127.0.0.1:8003 and exits non-zero if any test fails.

Note on reasoning behavior: the model emits a <think>...</think> reasoning block before its final answer, even on trivial prompts. Downstream consumers should either render the <think> block as collapsed reasoning or strip everything before </think>. Reasoning-heavy prompts (math, code) need max_tokens ≥ 1000 to leave room for the <think> block plus the final answer.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
    repo,
    torch_dtype="auto",   # FP8 weights, bf16 activations
    device_map="auto",
    trust_remote_code=True,
)

For vLLM:

vllm serve KyleHessling1/Qwopus3.5-27B-v3-FP8-vllm-ready \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.93 \
    --max-model-len 2048 \
    --enforce-eager \
    --trust-remote-code

Credits

All training, evaluation, and the underlying FP8 quantization were done by Jackrong. This fork only fixes loading metadata. For the full model card — motivation, training pipeline, HumanEval benchmarks, intended use, limitations, and citation — see the upstream repository:

👉 Jackrong/Qwopus3.5-27B-v3-FP8

Downloads last month: 7,430

Safetensors

Model size

27B params

Tensor type

BF16

F8_E4M3

Model tree for Jackrong/Qwopus3.5-27B-v3-FP8-vllm-ready

Base model

Qwen/Qwen3.5-27B

Finetuned

unsloth/Qwen3.5-27B

Adapter

Jackrong/Qwopus3.5-27B-v3-FP8

Quantized

(3)

this model