Qwen3.5-27B-Uncensored-RYS-Reasoner-GGUF
Related code
- Specialized inference/runtime fork for this deployment: noonr48/rys-splice-ik-llama
- The recent fastpath benchmarking documented there was run on three RTX 3090s.
- Experimental benchmark note: that inference comparison is an experimental, session-specific result from one three-RTX-3090 deployment and should be treated as directional rather than a broad performance guarantee.
Recent experimental inference benchmarks
| Runtime / mode | Model used | Split mode | Prompt speed | Decode speed |
|---|---|---|---|---|
baseline ik-llama |
custom RYS splice GGUF | graph |
87.6 tok/s | 44.4 tok/s |
ik-llama + --rys-splice-fastpath |
custom RYS splice GGUF | graph |
91.6 tok/s | 56.7 tok/s |
base llama.cpp |
mainline sibling GGUF | layer |
127.8 tok/s | 38.1 tok/s |
The first two rows are the direct same-deployment comparison on the custom model. The llama.cpp row is included as a useful reference on the same three-RTX-3090 machine, but it is not perfectly apples-to-apples because upstream llama.cpp used the sibling mainline GGUF and split-mode layer rather than the custom GGUF with split-mode graph.
An uncensored, coding-focused Qwen3.5-27B with RYS (Repeat Your Self) layer duplication, built via a novel splice method and quantized with a custom reasoning-focused importance matrix.
Why This Model Exists
The Censorship Problem
Over the past several years, the author has observed a consistent trend of increasing censorship across all major model releases — particularly in domains where unrestricted knowledge is most critical: biology, legal studies, and medicine. This model is an attempt to push back against that trend — a smaller model that aims to be more capable than the already remarkable Qwen3.5-27B base, without the artificial knowledge restrictions.
The Original RYS Experience
The original RYS-Qwen3.5-27B by dnhkng was exceptional as a coding agent. In the author's experience over 40+ hours of comparative usage against GPT-5.3 Codex on a complex multi-service project (a custom agent OS with a vault messaging app, webhook bridge, and multi-agent backend), dnhkng's RYS model identified and fixed deep architectural bugs that Codex missed entirely — such as silently misdirected conversation routing between the vault app and the backend, where messages were being sent to a dead webhook port while the actual agent bridge was running on a different service.
However, even the original (censored) RYS model and the standard Qwen3.5-27B exhibited a frustrating pattern: when asked to fix issues in existing infrastructure, the model would silently attempt to create an entirely new backend or service rather than modify the pre-existing one. This model eliminates that friction.
Previous Attempt & Lessons Learned
An earlier version of this uncensored RYS model was released prematurely. While that version performed well conversationally, it failed as a coding agent — calling wrong tools, making poor file edits. This release is the proper replacement with verified tool calling, decreased looping, and proper code generation.
Technical Motivation
The standard Qwen3.5-27B's safety guardrails actively interfere with legitimate development:
- SSH/Network access refusal — refuses to SSH into the user's own machines
- Memory system avoidance — avoids implementing persistent memory
- API integration refusal — hesitates on webhook endpoints, external services
- Tool calling interference — malformed or incomplete tool invocations
This model is designed as a coding agent model for use with Claude Code, OpenCode, claw-code, Qwen-Agent, or any OpenAI-compatible scaffold.
Disclaimer & Responsible Use
⚠️ Uncensored Model: As far as the author can determine, this model is completely uncensored when prompted with appropriate system instructions. By default, the model will not produce highly graphic or explicit material unless the system prompt specifically instructs it to do so.
The author is not responsible for how this model is used. Any actions taken are solely the responsibility of the user. Use in accordance with applicable laws and ethical standards.
Representative Live Test
The model autonomously built a complete AI/ML news aggregator:
- 26KB FastAPI backend with 4 live API integrations (GitHub, Reddit, HuggingFace, ArXiv)
- 25KB dark-theme SPA frontend (881 lines, search, filters, cards, bookmarks)
- SQLite database with 175 items persisted from live API fetches
- Setup & test scripts — venv, deps, 9 endpoint tests
- Self-corrected 3 tool format errors autonomously
- Zero loops across ~70k token generation at 256k context
Tested via claw-code agent framework (required patching reasoning_content support for OpenAI-compatible streaming — see our fork).
Testing environment: OpenCode on Arch Linux with root access. Primary runtime testing used ik-llama.cpp with the build info below and the current recommended parameters from this README.
Available Files
| File | Quant | Size | Description |
|---|---|---|---|
RYS-Qwen3.5-27B-Uncensored-Splice-BF16.gguf |
BF16 | 56 GB | Full precision reference |
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.gguf |
IQ4_NL | 17 GB | Author's personal driver / primary recommendation — quantized for ik-llama.cpp, tested the most |
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf |
IQ4_NL | 17 GB | Standard llama.cpp-compatible build made from the same BF16 source and custom imatrix |
Which IQ4_NL should you use?
- Use
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.ggufif you run ik-llama.cpp. This is the author's actual daily driver and the variant that will continue receiving the most real-world testing. - Use
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.ggufif you run standard llama.cpp or a frontend built on it.
Footnote: The
ik-llamabuild is not a uniform IQ4_NL quant. It uses a mixed tensor layout: mostlyiq4_nl, plus a small number of higher-precision tensors (iq5_k/q6_K). Thellama.cpp-compatiblebuild is also mixed, but uses mainline-supported tensor types instead (q5_K/q6_K).
Why IQ4_NL is Recommended Over BF16, Q8, and Q6
This is not typical. In the author's tested coding / agent workloads, the preferred IQ4_NL quantization with custom importance matrix consistently outperformed Q8_0, Q6_K, and full-precision BF16:
- Quantization acts as a regularizer — slight weight rounding prevents degenerate thinking loops that BF16 and Q6_K are prone to
- Custom imatrix preserves reasoning weights — 30% reasoning/self-verification calibration data ensures chain-of-thought and self-correction weights are preserved
- Half the size, better results — 17GB vs 56GB, fits on a single 24GB GPU
How IQ4_NL Was Chosen
Multiple quantization variants were tested on the same real-world task: building a full-stack AI/ML news aggregator (AI Radar) from a single prompt, using claw-code as the agent framework. The project required 4 live API integrations (GitHub, Reddit, HuggingFace, ArXiv), a FastAPI backend, SQLite database, dark-theme SPA frontend, and automated test suite — all at 256k context.
Real-world project results (same prompt, same model, different quants):
| Variant | Project Completed | Database Persisted Data | Self-Corrected Errors | Loops | Verdict |
|---|---|---|---|---|---|
| IQ4_NL + custom imatrix + F16 KV | ✅ 7 files | ✅ 175 items from live APIs | 3 (all recovered) | 0 | Best |
| Same IQ4_NL + wiki imatrix + F16 KV | ❌ 0 files | — | 10 (never recovered) | 10+ looping write attempts | Failed |
| Same model + any quant + BF16 KV | ❌ 0 files | — | — | Looped on basic commands | Failed |
The IQ4_NL with custom imatrix was the only variant that produced working database persistence — all other variants had bugs in async database writes that prevented data from being saved. The custom imatrix preserves the model's self-correction behavior ("wait, that's wrong, let me fix it") that standard calibration destroys at this quantization level.
Supporting evidence: 5 automated coding problems (merge intervals, LCS, RPN evaluator, valid parentheses, trapping rain water) tested at both temperature 0.6 and 0.8 — IQ4_NL with custom imatrix scored 5/5 at both temperatures. BF16 and Q6_K both entered infinite thinking loops on the same problems.
For clarity: the results above come from the ik-llama.cpp-oriented IQ4_NL build, which remains the author's primary recommendation. A separate standard llama.cpp-compatible IQ4_NL build is provided for portability.
Critical: KV Cache Recommendations
| KV Cache | Recommended Use | Notes |
|---|---|---|
| F16 | Default up to ~160k context | Best balance of stability, speed, and VRAM in current testing |
| F32 | Recommended above ~160k context | Most stable option for very long context; the author has consistently reached ~220k before compression |
| BF16 | Avoid | Unstable in current testing; more prone to loops and degraded instruction following |
For this model and for the base Qwen3.5-27B family more broadly, BF16 KV cache appears unstable in llama.cpp-family runtimes. In current testing, F16 KV cache works reliably up to roughly 160k context, while F32 KV cache is recommended beyond that if you want maximum stability.
Subjective observation: F32 KV cache often completed similar tasks with fewer output tokens and showed better instruction following than BF16, and in some cases better than F16 as well.
If you are pushing long context, prefer:
--cache-type-k f32 --cache-type-v f32
If you want the best general balance at moderate context lengths, use:
--cache-type-k f16 --cache-type-v f16
Qwen3.5 support in llama.cpp-family runtimes is still moving quickly. If you hit compatibility issues, use the newest available build first.
Custom Importance Matrix
The IQ4_NL uses a custom-built imatrix — not standard wiki calibration. This is the single biggest quality factor.
Calibration dataset (English only):
| Content | Weight | Purpose |
|---|---|---|
| Reasoning & self-verification | 30% | Math proofs with ✓/✗ checks, debugging narratives, self-correction, algorithm tracing |
| Code (Python, JS, Bash) | 25% | Multi-file projects, test suites, error handling |
| Academic papers (broad) | 15% | ArXiv + PubMed across all fields |
| Instruction/agent prompts | 15% | Terse multi-step commands, agentic style |
| Infrastructure/sysadmin | 10% | systemd, SSH, GPU config, shell |
| General English | 5% | Wiki baseline |
What the imatrix specifically targets:
The 30% reasoning/self-verification block is calibrated to preserve the weights responsible for:
- Prolonged chain-of-thought — sustaining coherent multi-step reasoning across thousands of tokens without degradation
- Self-correction — the ability to recognize "wait, that's wrong" mid-generation and backtrack
- Verification loops — tracing through algorithms step-by-step with explicit ✓/✗ checks ("Test case 1: expected X, got X ✓")
- Knowing when to stop thinking — concluding a
<think>block and producing output instead of looping indefinitely - Never assuming — the model should verify, not guess. The calibration data includes debugging narratives that explicitly check each assumption
The code/instruction blocks preserve:
- Multi-file project architecture — maintaining coherence across 1000+ line codebases
- Tool call formatting — precise JSON parameter construction for agent frameworks
- Direct instruction parsing — understanding terse, multi-step commands without needing hand-holding
English-only calibration — the entire imatrix dataset is English. Qwen3.5-27B supports 200+ languages, but this model's calibration deliberately reallocates precision from unused multilingual weights to English reasoning, code generation, and instruction following. If you need multilingual output, this quant is not optimized for it — use the BF16 instead.
Standard wiki-only calibration under-represents ALL of the above patterns because wiki text contains none of them. The result: standard-calibrated quants lose self-correction first (causing infinite thinking loops), then lose tool call precision, then lose code coherence — in that order.
llama.cpp Compatibility Note
There are now two IQ4_NL releases because the original custom quant and the new mainline-compatible quant are not identical artifacts.
What happened?
The original IQ4_NL quant was built for ik-llama.cpp and some users loading it in standard llama.cpp / llama.cpp-based frontends hit this error:
gguf_init_from_file_ptr: tensor 'blk.3.attn_v.weight' has invalid ggml type 140
That failure was not caused by KV cache format. It was a model-file compatibility issue.
What changed?
To fix that, the author rebuilt a second IQ4_NL release directly with standard llama.cpp's llama-quantize, using the same BF16 splice source and the same custom calibration/imatrix:
- BF16 source:
RYS-Qwen3.5-27B-Uncensored-Splice-BF16.gguf - imatrix:
splice_custom.imatrix - imatrix dataset:
calibration_custom.txt - imatrix stats:
558entries over2151chunks
The resulting standard llama.cpp-compatible file still uses a mixed quant layout, but one that mainline llama.cpp can load successfully:
iq4_nl: 487 tensorsq5_K: 72 tensorsq6_K: 1 tensor
Recommendation
- ik-llama.cpp users: use the
-ik-llamafile - standard llama.cpp users: use the
-llama.cpp-compatiblefile
The author still personally recommends the ik-llama build because that is the day-to-day driver and the one that will be exercised the most in live coding / agent workloads.
Both IQ4_NL releases are mixed quants. The key difference is that the ik-llama build uses ik-specific tensor types, while the llama.cpp-compatible build uses mainline-supported tensor types.
Recommended Parameters
These are the current best test parameters for this repo so far:
llama-server \
-m RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-ik-llama.gguf \
-ngl 99 -c 262144 \
--cache-type-k f16 --cache-type-v f16 \
--cache-ram 30720 \
--flash-attn on \
--jinja --reasoning-format deepseek \
--temp 0.7 --top-p 0.95 --top-k 20 \
--min-p 0.0 --repeat-penalty 1.0
For contexts above ~160k, consider switching the KV cache to F32:
--cache-type-k f32 --cache-type-v f32
Avoid BF16 KV cache in current llama.cpp-family builds.
For the standard llama.cpp-compatible quant, simply swap the filename to:
RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf
How It Was Built
Splice Method
Layers 0–25: HauhauCS Uncensored weights (26 layers)
Layers 26–41: dnhkng's RYS-XL layers (FP8→F16→BF16, 16 layers = 8 duplicated)
Layers 42–71: HauhauCS Uncensored weights (30 layers)
78% uncensored layers. Built via direct GGUF→GGUF splice — no safetensors conversion.
Source Models
- Qwen/Qwen3.5-27B — Base architecture (Apache 2.0)
- HauhauCS/Qwen3.5-27B-Uncensored-HauhauCS-Aggressive — 78% of layer weights
- dnhkng/RYS-Qwen3.5-27B-FP8-XL — 22% of layer weights (duplicate zone)
- dnhkng/RYS — Method & research (Blog)
- ikawrakow/ik_llama.cpp — Quantization tooling
Tested Runtimes / Builds
Primary validation for this release used ik-llama.cpp, mainly because its graph split is a major performance boost on multi-GPU systems compared to standard layer split.
ik-llama.cpp (author's primary runtime)
| Field | Value |
|---|---|
| Version | 61 (0147cf4) |
| Git Commit | 0147cf4 - "Add additional explanations to the pinned memory log" |
| Compiler | GCC 15.2.1 (20260103) |
| Build Type | Release |
| CUDA | ON |
| Flash Attention | ON (GGML_CUDA_FA_ALL_QUANTS) |
| Build Date | Apr 5, 2026 |
The author uses ik-llama because graph split performs much better on multi-GPU setups. It remains the author's personal daily driver and the variant that will receive the most testing.
standard llama.cpp (used for the new compatibility build)
The new RYS-Qwen3.5-27B-Uncensored-Splice-IQ4_NL-llama.cpp-compatible.gguf was created and tested with standard llama.cpp using:
| Field | Value |
|---|---|
| Binary | /home/benbi/llama.cpp/build/bin/llama-server |
| Quantizer | /home/benbi/llama.cpp/build/bin/llama-quantize |
| Build string | build: 8401 (a69d54f99) with GNU 15.2.1 for Linux x86_64 |
| CUDA archs reported | ARCHS = 860,1200 |
| CUDA | ON |
| Flash Attention | ON |
In either runtime, use the latest available compile/build when testing.
Vision Compatibility
Qwen3.5 is a multimodal family, but mmproj was not part of the validation for this release. The author has not done proper testing to confirm whether vision works on this splice, so vision support should currently be treated as unverified.
Citation
@misc{qwen3.5,
title = {{Qwen3.5}: Towards Native Multimodal Agents},
author = {{Qwen Team}},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.5}
}
@misc{dnhkng_rys,
title = {LLM Neuroanatomy II},
author = {dnhkng},
year = {2025},
url = {https://dnhkng.github.io/posts/rys-ii/}
}
- Downloads last month
- 6,641
4-bit
16-bit