Huihui-Qwen3.5-9B-abliterated-NVFP4

This repository provides an NVFP4-format derivative based on huihui-ai/Huihui-Qwen3.5-9B-abliterated, which itself is derived from Qwen/Qwen3.5-9B.

The main purpose of this release is research on inference depth, runtime behavior, and deployment characteristics of NVFP4 models. It is intended for study, benchmarking, and controlled experimentation rather than general-purpose safe deployment.

This model card intentionally preserves the provenance of the original Huihui release. The original fine-tuning characteristics, output tendencies, and risk profile should be assumed to carry over unless you independently verify otherwise in your own environment.

Provenance

  • Original base model: Qwen/Qwen3.5-9B
  • Fine-tuned source model: huihui-ai/Huihui-Qwen3.5-9B-abliterated
  • Quantization scope: text-side mlp.gate_proj and mlp.up_proj layers converted to NVFP4 with calibration
  • Packaging strategy: original multimodal wrapper and processor files retained, with the language model weights routed to the quantized checkpoint
  • Visual stack and non-targeted weights remain sourced from the original 9B release

Intended Use

  • Research on inference depth and reasoning behavior
  • Local benchmarking and runtime tuning
  • Controlled experiments on NVFP4 deployment
  • Long-context and VRAM-budget studies on RTX PRO 6000 Blackwell class GPUs

Responsibility Notice

Use of this model is entirely at your own risk.

  • You are solely responsible for how you use the model and how you handle its outputs.
  • The model may produce unsafe, controversial, incorrect, or otherwise unsuitable content.
  • This repository is published for research purposes and does not provide safety guarantees.
  • Do not assume fitness for production, commercial, legal, medical, educational, or public-facing use without your own review and safeguards.

Notes

The fine-tuned dataset is a type of dataset that operates in a non-think mode, and it may actually make thinking simpler.

This 9B release keeps the original multimodal configuration and processor files, but only the text-side language model gate_proj and up_proj layers were processed into NVFP4. The image stack was not quantized.

down_proj, attention layers, and the visual stack remain unquantized in this package.

After saving the model, some weight naming differs from the original export layout, and the packaged checkpoint does not preserve the original MTP weights in the same form. Treat this repository as an inference-focused research artifact rather than a drop-in archival mirror of the source release.

Reference Runtime Results

The following values are provided as reference measurements from local experiments on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition GPU. They should be treated as environment-specific guidance rather than strict guarantees.

BF16 vs NVFP4 at 8K context

Format Idle VRAM Peak VRAM TTFT tok/s
BF16 57.2 GB 63.1 GB 44.84 ms 102.06
NVFP4 57.4 GB 63.2 GB 50.87 ms 121.01

The initial 8K comparison showed that model-load memory dropped, but total end-to-end VRAM stayed close because multimodal wrapper costs, runtime reservation, and KV cache overhead dominated at short context lengths.

NVFP4 long-context budget sweep

KV budget Idle VRAM Peak VRAM Notes
10G 25.3 GB 31.2 GB Stable at 32K, 64K, 128K, 256K runtime settings
8G 23.3 GB 29.1 GB Stable at 32K, 64K, 128K, 256K runtime settings
6G 21.2 GB 27.1 GB Stable at 32K, 64K, 128K, 256K runtime settings

Actual long-prompt probes were also run at roughly 32K, 64K, and 128K token input lengths. All runs completed, and 64K plus 128K retrieval checks were stable across the tested budgets.

Usage Warnings

  • Risk of Sensitive or Controversial Outputs: This model's safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.
  • Not Suitable for All Audiences: Due to limited content filtering, the model's outputs may be inappropriate for public settings, underage users, or applications requiring high security.
  • Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.
  • Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.
  • Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.
  • No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. The original fine-tuned source and this NVFP4 derivative should both be treated as research artifacts, and the publisher bears no responsibility for any consequences arising from their use.

Donation

Your donation helps us continue our further development and improvement, a cup of coffee can do it.
  • bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
  • Support our work on Ko-fi!
Downloads last month
780
Safetensors
Model size
8B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-Qwen3.5-9B-abliterated-NVFP4

Finetuned
Qwen/Qwen3.5-9B
Quantized
(177)
this model