Huihui-Qwen3.5-9B-abliterated-NVFP4
This repository provides an NVFP4-format derivative based on huihui-ai/Huihui-Qwen3.5-9B-abliterated, which itself is derived from Qwen/Qwen3.5-9B.
The main purpose of this release is research on inference depth, runtime behavior, and deployment characteristics of NVFP4 models. It is intended for study, benchmarking, and controlled experimentation rather than general-purpose safe deployment.
This model card intentionally preserves the provenance of the original Huihui release. The original fine-tuning characteristics, output tendencies, and risk profile should be assumed to carry over unless you independently verify otherwise in your own environment.
Provenance
- Original base model: Qwen/Qwen3.5-9B
- Fine-tuned source model: huihui-ai/Huihui-Qwen3.5-9B-abliterated
- Quantization scope: text-side
mlp.gate_projandmlp.up_projlayers converted to NVFP4 with calibration - Packaging strategy: original multimodal wrapper and processor files retained, with the language model weights routed to the quantized checkpoint
- Visual stack and non-targeted weights remain sourced from the original 9B release
Intended Use
- Research on inference depth and reasoning behavior
- Local benchmarking and runtime tuning
- Controlled experiments on NVFP4 deployment
- Long-context and VRAM-budget studies on RTX PRO 6000 Blackwell class GPUs
Responsibility Notice
Use of this model is entirely at your own risk.
- You are solely responsible for how you use the model and how you handle its outputs.
- The model may produce unsafe, controversial, incorrect, or otherwise unsuitable content.
- This repository is published for research purposes and does not provide safety guarantees.
- Do not assume fitness for production, commercial, legal, medical, educational, or public-facing use without your own review and safeguards.
Notes
The fine-tuned dataset is a type of dataset that operates in a non-think mode, and it may actually make thinking simpler.
This 9B release keeps the original multimodal configuration and processor files, but only the text-side language model gate_proj and up_proj layers were processed into NVFP4. The image stack was not quantized.
down_proj, attention layers, and the visual stack remain unquantized in this package.
After saving the model, some weight naming differs from the original export layout, and the packaged checkpoint does not preserve the original MTP weights in the same form. Treat this repository as an inference-focused research artifact rather than a drop-in archival mirror of the source release.
Reference Runtime Results
The following values are provided as reference measurements from local experiments on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition GPU. They should be treated as environment-specific guidance rather than strict guarantees.
BF16 vs NVFP4 at 8K context
| Format | Idle VRAM | Peak VRAM | TTFT | tok/s |
|---|---|---|---|---|
| BF16 | 57.2 GB | 63.1 GB | 44.84 ms | 102.06 |
| NVFP4 | 57.4 GB | 63.2 GB | 50.87 ms | 121.01 |
The initial 8K comparison showed that model-load memory dropped, but total end-to-end VRAM stayed close because multimodal wrapper costs, runtime reservation, and KV cache overhead dominated at short context lengths.
NVFP4 long-context budget sweep
| KV budget | Idle VRAM | Peak VRAM | Notes |
|---|---|---|---|
| 10G | 25.3 GB | 31.2 GB | Stable at 32K, 64K, 128K, 256K runtime settings |
| 8G | 23.3 GB | 29.1 GB | Stable at 32K, 64K, 128K, 256K runtime settings |
| 6G | 21.2 GB | 27.1 GB | Stable at 32K, 64K, 128K, 256K runtime settings |
Actual long-prompt probes were also run at roughly 32K, 64K, and 128K token input lengths. All runs completed, and 64K plus 128K retrieval checks were stable across the tested budgets.
Usage Warnings
- Risk of Sensitive or Controversial Outputs: This model's safety filtering has been significantly reduced, potentially generating sensitive, controversial, or inappropriate content. Users should exercise caution and rigorously review generated outputs.
- Not Suitable for All Audiences: Due to limited content filtering, the model's outputs may be inappropriate for public settings, underage users, or applications requiring high security.
- Legal and Ethical Responsibilities: Users must ensure their usage complies with local laws and ethical standards. Generated content may carry legal or ethical risks, and users are solely responsible for any consequences.
- Research and Experimental Use: It is recommended to use this model for research, testing, or controlled environments, avoiding direct use in production or public-facing commercial applications.
- Monitoring and Review Recommendations: Users are strongly advised to monitor model outputs in real time and conduct manual reviews when necessary to prevent the dissemination of inappropriate content.
- No Default Safety Guarantees: Unlike standard models, this model has not undergone rigorous safety optimization. The original fine-tuned source and this NVFP4 derivative should both be treated as research artifacts, and the publisher bears no responsibility for any consequences arising from their use.
Donation
Your donation helps us continue our further development and improvement, a cup of coffee can do it.
- bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
- Support our work on Ko-fi!
- Downloads last month
- 780