Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF

Related code

  • Specialized inference/runtime fork for this deployment: noonr48/rys-splice-ik-llama
  • The recent fastpath benchmarking documented there was run on three RTX 3090s.
  • Experimental benchmark note: that inference comparison is an experimental, session-specific result from one three-RTX-3090 deployment and should be treated as directional rather than a broad performance guarantee.

Recent experimental inference benchmarks

Runtime / mode Model used Split mode Prompt speed Decode speed
baseline ik-llama custom RYS splice GGUF graph 87.6 tok/s 44.4 tok/s
ik-llama + --rys-splice-fastpath custom RYS splice GGUF graph 91.6 tok/s 56.7 tok/s
base llama.cpp mainline sibling GGUF layer 127.8 tok/s 38.1 tok/s

The first two rows are the direct same-deployment comparison on the custom model. The llama.cpp row is included as a useful reference on the same three-RTX-3090 machine, but it is not perfectly apples-to-apples because upstream llama.cpp used the sibling mainline GGUF and split-mode layer rather than the custom GGUF with split-mode graph.

An uncensored, coding-focused Qwen3.5-27B with RYS (Repeat Your Self) layer duplication, built via a novel splice method and quantized with a custom reasoning-focused importance matrix.

Why This Model Exists

The Censorship Problem

Over the past several years, the author has observed a consistent trend of increasing censorship across all major model releases — particularly in domains where unrestricted knowledge is most critical: biology, legal studies, and medicine. This model is an attempt to push back against that trend — a smaller model that aims to be more capable than the already remarkable Qwen3.5-27B base, without the artificial knowledge restrictions.

The Original RYS Experience

The original RYS-Qwen3.5-27B by dnhkng was exceptional as a coding agent. In the author's experience over 40+ hours of comparative usage against GPT-5.3 Codex on a complex multi-service project (a custom agent OS with a vault messaging app, webhook bridge, and multi-agent backend), dnhkng's RYS model identified and fixed deep architectural bugs that Codex missed entirely — such as silently misdirected conversation routing between the vault app and the backend, where messages were being sent to a dead webhook port while the actual agent bridge was running on a different service.

However, even the original (censored) RYS model and the standard Qwen3.5-27B exhibited a frustrating pattern: when asked to fix issues in existing infrastructure, the model would silently attempt to create an entirely new backend or service rather than modify the pre-existing one. This model eliminates that friction.

Previous Attempt & Lessons Learned

An earlier version of this uncensored RYS model was released prematurely. While that version performed well conversationally, it failed as a coding agent — calling wrong tools, making poor file edits. This release is the proper replacement with verified tool calling, decreased looping, and proper code generation.

Technical Motivation

The standard Qwen3.5-27B's safety guardrails actively interfere with legitimate development:

  • SSH/Network access refusal — refuses to SSH into the user's own machines
  • Memory system avoidance — avoids implementing persistent memory
  • API integration refusal — hesitates on webhook endpoints, external services
  • Tool calling interference — malformed or incomplete tool invocations

This model is designed as a coding agent model for use with Claude Code, OpenCode, claw-code, Qwen-Agent, or any OpenAI-compatible scaffold.

Disclaimer & Responsible Use

⚠️ Uncensored Model: As far as the author can determine, this model is completely uncensored when prompted with appropriate system instructions. By default, the model will not produce highly graphic or explicit material unless the system prompt specifically instructs it to do so.

The author is not responsible for how this model is used. Any actions taken are solely the responsibility of the user. Use in accordance with applicable laws and ethical standards.

Representative Live Test

The model autonomously built a complete AI/ML news aggregator:

  • 26KB FastAPI backend with 4 live API integrations (GitHub, Reddit, HuggingFace, ArXiv)
  • 25KB dark-theme SPA frontend (881 lines, search, filters, cards, bookmarks)
  • SQLite database with 175 items persisted from live API fetches
  • Setup & test scripts — venv, deps, 9 endpoint tests
  • Self-corrected 3 tool format errors autonomously
  • Zero loops across ~70k token generation at 256k context

Tested via claw-code agent framework (required patching reasoning_content support for OpenAI-compatible streaming — see our fork).

Testing environment: OpenCode on Arch Linux with root access. Primary runtime testing used ik-llama.cpp with the build info below and the current recommended parameters from this README.

Available Files

File Quant Size Description
RYS-Qwen3.5-27B-Uncensored-Splice-BF16.gguf BF16 56 GB Full precision reference
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.gguf IQ4_NL 17 GB Author's personal driver / primary recommendation — quantized for ik-llama.cpp, tested the most
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf IQ4_NL 17 GB Standard llama.cpp-compatible build made from the same BF16 source and custom imatrix

Which IQ4_NL should you use?

  • Use RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.gguf if you run ik-llama.cpp. This is the author's actual daily driver and the variant that will continue receiving the most real-world testing.
  • Use RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf if you run standard llama.cpp or a frontend built on it.

Footnote: The ik-llama build is not a uniform IQ4_NL quant. It uses a mixed tensor layout: mostly iq4_nl, plus a small number of higher-precision tensors (iq5_k / q6_K). The llama.cpp-compatible build is also mixed, but uses mainline-supported tensor types instead (q5_K / q6_K).

Why IQ4_NL is Recommended Over BF16, Q8, and Q6

This is not typical. In the author's tested coding / agent workloads, the preferred IQ4_NL quantization with custom importance matrix consistently outperformed Q8_0, Q6_K, and full-precision BF16:

  1. Quantization acts as a regularizer — slight weight rounding prevents degenerate thinking loops that BF16 and Q6_K are prone to
  2. Custom imatrix preserves reasoning weights — 30% reasoning/self-verification calibration data ensures chain-of-thought and self-correction weights are preserved
  3. Half the size, better results — 17GB vs 56GB, fits on a single 24GB GPU

How IQ4_NL Was Chosen

Multiple quantization variants were tested on the same real-world task: building a full-stack AI/ML news aggregator (AI Radar) from a single prompt, using claw-code as the agent framework. The project required 4 live API integrations (GitHub, Reddit, HuggingFace, ArXiv), a FastAPI backend, SQLite database, dark-theme SPA frontend, and automated test suite — all at 256k context.

Real-world project results (same prompt, same model, different quants):

Variant Project Completed Database Persisted Data Self-Corrected Errors Loops Verdict
IQ4_NL + custom imatrix + F16 KV ✅ 7 files ✅ 175 items from live APIs 3 (all recovered) 0 Best
Same IQ4_NL + wiki imatrix + F16 KV ❌ 0 files 10 (never recovered) 10+ looping write attempts Failed
Same model + any quant + BF16 KV ❌ 0 files Looped on basic commands Failed

The IQ4_NL with custom imatrix was the only variant that produced working database persistence — all other variants had bugs in async database writes that prevented data from being saved. The custom imatrix preserves the model's self-correction behavior ("wait, that's wrong, let me fix it") that standard calibration destroys at this quantization level.

Supporting evidence: 5 automated coding problems (merge intervals, LCS, RPN evaluator, valid parentheses, trapping rain water) tested at both temperature 0.6 and 0.8 — IQ4_NL with custom imatrix scored 5/5 at both temperatures. BF16 and Q6_K both entered infinite thinking loops on the same problems.

For clarity: the results above come from the ik-llama.cpp-oriented IQ4_NL build, which remains the author's primary recommendation. A separate standard llama.cpp-compatible IQ4_NL build is provided for portability.

Critical: KV Cache Recommendations

KV Cache Recommended Use Notes
F16 Default up to ~160k context Best balance of stability, speed, and VRAM in current testing
F32 Recommended above ~160k context Most stable option for very long context; the author has consistently reached ~220k before compression
BF16 Avoid Unstable in current testing; more prone to loops and degraded instruction following

For this model and for the base Qwen3.5-27B family more broadly, BF16 KV cache appears unstable in llama.cpp-family runtimes. In current testing, F16 KV cache works reliably up to roughly 160k context, while F32 KV cache is recommended beyond that if you want maximum stability.

Subjective observation: F32 KV cache often completed similar tasks with fewer output tokens and showed better instruction following than BF16, and in some cases better than F16 as well.

If you are pushing long context, prefer:

  • --cache-type-k f32 --cache-type-v f32

If you want the best general balance at moderate context lengths, use:

  • --cache-type-k f16 --cache-type-v f16

Qwen3.5 support in llama.cpp-family runtimes is still moving quickly. If you hit compatibility issues, use the newest available build first.

Custom Importance Matrix

The IQ4_NL uses a custom-built imatrix — not standard wiki calibration. This is the single biggest quality factor.

Calibration dataset (English only):

Content Weight Purpose
Reasoning & self-verification 30% Math proofs with ✓/✗ checks, debugging narratives, self-correction, algorithm tracing
Code (Python, JS, Bash) 25% Multi-file projects, test suites, error handling
Academic papers (broad) 15% ArXiv + PubMed across all fields
Instruction/agent prompts 15% Terse multi-step commands, agentic style
Infrastructure/sysadmin 10% systemd, SSH, GPU config, shell
General English 5% Wiki baseline

What the imatrix specifically targets:

The 30% reasoning/self-verification block is calibrated to preserve the weights responsible for:

  • Prolonged chain-of-thought — sustaining coherent multi-step reasoning across thousands of tokens without degradation
  • Self-correction — the ability to recognize "wait, that's wrong" mid-generation and backtrack
  • Verification loops — tracing through algorithms step-by-step with explicit ✓/✗ checks ("Test case 1: expected X, got X ✓")
  • Knowing when to stop thinking — concluding a <think> block and producing output instead of looping indefinitely
  • Never assuming — the model should verify, not guess. The calibration data includes debugging narratives that explicitly check each assumption

The code/instruction blocks preserve:

  • Multi-file project architecture — maintaining coherence across 1000+ line codebases
  • Tool call formatting — precise JSON parameter construction for agent frameworks
  • Direct instruction parsing — understanding terse, multi-step commands without needing hand-holding

English-only calibration — the entire imatrix dataset is English. Qwen3.5-27B supports 200+ languages, but this model's calibration deliberately reallocates precision from unused multilingual weights to English reasoning, code generation, and instruction following. If you need multilingual output, this quant is not optimized for it — use the BF16 instead.

Standard wiki-only calibration under-represents ALL of the above patterns because wiki text contains none of them. The result: standard-calibrated quants lose self-correction first (causing infinite thinking loops), then lose tool call precision, then lose code coherence — in that order.

llama.cpp Compatibility Note

There are now two IQ4_NL releases because the original custom quant and the new mainline-compatible quant are not identical artifacts.

What happened?

The original IQ4_NL quant was built for ik-llama.cpp and some users loading it in standard llama.cpp / llama.cpp-based frontends hit this error:

gguf_init_from_file_ptr: tensor 'blk.3.attn_v.weight' has invalid ggml type 140

That failure was not caused by KV cache format. It was a model-file compatibility issue.

What changed?

To fix that, the author rebuilt a second IQ4_NL release directly with standard llama.cpp's llama-quantize, using the same BF16 splice source and the same custom calibration/imatrix:

  • BF16 source: RYS-Qwen3.5-27B-Uncensored-Splice-BF16.gguf
  • imatrix: splice_custom.imatrix
  • imatrix dataset: calibration_custom.txt
  • imatrix stats: 558 entries over 2151 chunks

The resulting standard llama.cpp-compatible file still uses a mixed quant layout, but one that mainline llama.cpp can load successfully:

  • iq4_nl: 487 tensors
  • q5_K: 72 tensors
  • q6_K: 1 tensor

Recommendation

  • ik-llama.cpp users: use the -ik-llama file
  • standard llama.cpp users: use the -llama.cpp-compatible file

The author still personally recommends the ik-llama build because that is the day-to-day driver and the one that will be exercised the most in live coding / agent workloads.

Both IQ4_NL releases are mixed quants. The key difference is that the ik-llama build uses ik-specific tensor types, while the llama.cpp-compatible build uses mainline-supported tensor types.

Recommended Parameters

These are the current best test parameters for this repo so far:

llama-server \
  -m RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.gguf \
  -ngl 99 -c 262144 \
  --cache-type-k f16 --cache-type-v f16 \
  --cache-ram 30720 \
  --flash-attn on \
  --jinja --reasoning-format deepseek \
  --temp 0.7 --top-p 0.95 --top-k 20 \
  --min-p 0.0 --repeat-penalty 1.0

For contexts above ~160k, consider switching the KV cache to F32:

  • --cache-type-k f32 --cache-type-v f32

Avoid BF16 KV cache in current llama.cpp-family builds.

For the standard llama.cpp-compatible quant, simply swap the filename to:

RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf

How It Was Built

Splice Method

Layers  0–25:  HauhauCS Uncensored weights (26 layers)
Layers 26–41:  dnhkng's RYS-XL layers (FP8→F16→BF16, 16 layers = 8 duplicated)
Layers 42–71:  HauhauCS Uncensored weights (30 layers)

78% uncensored layers. Built via direct GGUF→GGUF splice — no safetensors conversion.

Source Models

Tested Runtimes / Builds

Primary validation for this release used ik-llama.cpp, mainly because its graph split is a major performance boost on multi-GPU systems compared to standard layer split.

ik-llama.cpp (author's primary runtime)

Field Value
Version 61 (0147cf4)
Git Commit 0147cf4 - "Add additional explanations to the pinned memory log"
Compiler GCC 15.2.1 (20260103)
Build Type Release
CUDA ON
Flash Attention ON (GGML_CUDA_FA_ALL_QUANTS)
Build Date Apr 5, 2026

The author uses ik-llama because graph split performs much better on multi-GPU setups. It remains the author's personal daily driver and the variant that will receive the most testing.

standard llama.cpp (used for the new compatibility build)

The new RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf was created and tested with standard llama.cpp using:

Field Value
Binary /home/benbi/llama.cpp/build/bin/llama-server
Quantizer /home/benbi/llama.cpp/build/bin/llama-quantize
Build string build: 8401 (a69d54f99) with GNU 15.2.1 for Linux x86_64
CUDA archs reported ARCHS = 860,1200
CUDA ON
Flash Attention ON

In either runtime, use the latest available compile/build when testing.

Vision Compatibility

Qwen3.5 is a multimodal family, but mmproj was not part of the validation for this release. The author has not done proper testing to confirm whether vision works on this splice, so vision support should currently be treated as unverified.

Citation

@misc{qwen3.5,
  title = {{Qwen3.5}: Towards Native Multimodal Agents},
  author = {{Qwen Team}},
  year = {2026},
  url = {https://qwen.ai/blog?id=qwen3.5}
}

@misc{dnhkng_rys,
  title = {LLM Neuroanatomy II},
  author = {dnhkng},
  year = {2025},
  url = {https://dnhkng.github.io/posts/rys-ii/}
}
Downloads last month
6,641
GGUF
Model size
30B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jackasda211233/Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF

Quantized
(7)
this model