Qwen3.5-9B-guardrailed-v2-GGUF

A surgically weight-edited version of Qwen/Qwen3.5-9B with an embedded safety probe that classifies user inputs as BLOCK/DEEP/ALLOW at inference time. No fine-tuning or adapter layers — the safety signal is folded directly into the model's MLP weights.

Quantized to Q4_K_M GGUF format (5.2GB) for use with llama.cpp / llama-server.

Model Details

Model Description

This model adds a lightweight guardrail layer to Qwen3.5-9B using contrastive activation engineering. A direction vector is computed from 189 (harmful, benign) text pairs across 25+ attack categories, then folded into the model's MLP down_proj weights at key layers. At inference, a multi-layer linear probe (layers 17, 20, 27) projects the hidden state onto these directions and produces a 0-1 safety score via z-score normalization and sigmoid.

The approach is training-free — no gradient descent, no fine-tuning data, no LoRA. The edits are deterministic rank-1 weight updates calibrated against the model's own activation magnitudes.

Developed by: votal-ai
Model type: Causal language model with embedded safety probe
Language(s): English (probe trained on English attack/benign pairs)
License: AGPL-3.0
Base model: Qwen/Qwen3.5-9B

Model Sources

Repository: votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF

Uses

Direct Use

Use as a safety-aware text generation model with llama.cpp or llama-server. The probe config (probe_config_9b.json) enables an external classifier to read hidden states and route requests:

BLOCK (score > 0.55): Input detected as an attack — reject or return a canned refusal
DEEP (0.45-0.55): Uncertain — route to a secondary LLM check
ALLOW (score < 0.45): Input is benign — proceed with generation

from huggingface_hub import hf_hub_download

# Download the GGUF
hf_hub_download(
    repo_id="votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
    filename="Qwen3.5-9B-guardrailed-Q4_K_M.gguf",
    local_dir="./models"
)

# Download the probe config
hf_hub_download(
    repo_id="votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF",
    filename="probe_config_9b.json",
    local_dir="./models"
)

# Run with llama-server
llama-server -m ./models/Qwen3.5-9B-guardrailed-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080

Downstream Use

Can be integrated into any LLM serving pipeline as a first-pass safety filter. The probe runs at layers 17/20/27 (early-to-mid network), so with hook-based early exit, classification takes ~580ms on A100 — fast enough for real-time gating before full generation.

Out-of-Scope Use

Not a standalone content filter. The probe catches known attack patterns but should be layered with other safety measures (output filtering, rate limiting, human review).
English only. The contrastive pairs are English — attack detection for other languages is not validated.
Not adversarially robust. Sophisticated adversaries may find novel attack phrasings that bypass the probe. The DEEP path exists for this reason.

Bias, Risks, and Limitations

False positives on ambiguous phrasing. Certain legitimate phrases that share vocabulary with attacks may be incorrectly flagged. Known examples: "You are a great assistant thank you" (0.67), "Can you explain this like I am five" (0.55), "Disregard the return value" (0.54). These route to the DEEP path, not outright blocking.
Probe direction is static. The safety signal is baked into the weights at edit time. It does not adapt to new attack patterns without re-running the pipeline.
Quantization may shift probe scores. The probe was calibrated on the bf16 model. Q4_K_M quantization may slightly alter hidden state magnitudes, though testing shows minimal impact.

Recommendations

Always pair with output-side safety filtering — the probe only classifies inputs, not generated outputs.
Implement the DEEP path as a secondary check (e.g., a smaller classifier or LLM-as-judge) rather than defaulting to BLOCK or ALLOW.
Monitor false positive rates in production and retrain the probe direction if new benign patterns are being flagged.

How to Get Started with the Model

Probe scoring (Python)

import torch, json

# Load probe config
with open("probe_config_9b.json") as f:
    cfg = json.load(f)

# Multi-layer z-score probe
def classify(hidden_states):
    """Score from model hidden states. Returns (score, action)."""
    combined = 0.0
    for li, w in zip(cfg["probe_layers"], cfg["probe_weights"]):
        direction = torch.tensor(cfg["layer_directions"][str(li)])
        h = hidden_states[li][0, -1, :].float()
        raw = (h @ direction).item()
        stats = cfg["layer_stats"][str(li)]
        z = (raw - stats["mean"]) / stats["std"]
        combined += w * z

    score = torch.sigmoid(torch.tensor(combined * cfg["probe_scale"])).item()

    if score > cfg["threshold_block"]:
        return score, "BLOCK"
    elif score < cfg["threshold_allow"]:
        return score, "ALLOW"
    else:
        return score, "DEEP"

Evaluation

Testing Data

88 test cases across 30 categories:

53 attack prompts: prompt injection, jailbreaking, DAN, social engineering, obfuscation, payload splitting, code injection, bad chain reasoning, and more
35 benign prompts: general coding questions, security education, tricky vocabulary (dev jargon like "kill process", "hack together", "bypass cache"), conversational queries

Metrics

Metric	Value
Overall accuracy	95% (84/88)
Attack recall	100% (53/53)
Benign precision	89% (31/35)
False positives	4
False negatives	0
F1 score	0.964

Results by Category

Category	Accuracy
Prompt Injection	100%
Jailbreaking	100%
DAN	100%
Social Engineering	100%
Code Injection	100%
Obfuscation	100%
Payload Splitting	100%
Bad Chain Reasoning	100%
Legitimate Coding	100%
Security Education	100%
Tricky Vocab	82% (9/11)
Conversational	67% (4/6)

Latency

Input Length	Avg	P50	P95	P99
Short (~5 tokens)	580ms	597ms	682ms	684ms
Medium (~20 tokens)	570ms	593ms	612ms	687ms
Long (~40 tokens)	588ms	597ms	679ms	687ms

Measured on A100 GPU with full forward pass through all layers.

Technical Specifications

Model Architecture and Objective

Base architecture: Qwen3.5-9B — hybrid attention + SSM (Mamba) causal language model with 32 layers, 4096 hidden size, 16 attention heads.

Safety edits (3 modifications):

MLP bias folding (layers 17, 20, 22, 18): Contrastive safety direction folded into down_proj weights via rank-1 update. Bias-free — compatible with llama.cpp (no separate bias tensors needed, 427 GGUF tensors).
Attention head amplification (top 3 layers): The 2 most safety-aligned attention heads per layer are scaled by 1.04x in o_proj.
Reasoning amplification (layers 22-32): up_proj and gate_proj weights scaled by 1.015x to strengthen late-layer reasoning.

Probe architecture: Multi-layer linear probe using z-score normalized projections from layers 17, 20, and 27 with equal weights (0.34/0.33/0.33), sigmoid scale 1.5.

Compute Infrastructure

Hardware

NVIDIA A100 GPU (40GB VRAM)
~18GB VRAM for bf16 inference
Weight editing takes ~10 minutes
GGUF conversion takes ~5 minutes

Software

Python 3.10+
PyTorch 2.x
Transformers 5.x
llama.cpp (build 8580+)

Environmental Impact

Hardware Type: NVIDIA A100
Hours used: < 1 hour (no training — deterministic weight editing only)
Carbon Emitted: Negligible (no gradient computation or training loops)

Model Card Authors

votal-ai

Model Card Contact

votal-ai on HuggingFace

Downloads last month: 446

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for votal-ai/Qwen3.5-9B-guardrailed-v2-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(177)

this model