--- base_model: Qwen/Qwen3.5-9B tags: - text-generation-inference - transformers - unsloth - qwen3_5 license: apache-2.0 language: - en datasets: - TeichAI/claude-4.5-opus-high-reasoning-250x - armand0e/badlogicgames-pi-mono-opus-filtered --- # Qwen3.5 9B - Opus Agent This is a finetune on Opus traces and a small dataset. Reasoning was left untouched Total train time: 4 hours ## Benchmarks ### General benchmarks  Benchmarks provided by [@nightmedia](https://huggingface.co/nightmedia), as always thanks for taking the time :) ``` arc arc/e boolq armand0e/Qwen3.5-9B-Opus-Agent 0.589 0.747 0.901 Jackrong/Qwopus3.5-9B-Coder 0.561 0.721 0.89 Qwen3.5-9B 0.571 0.719 0.895 ``` ### Targeted benchmarks Conducted via [BenchLocal](https://github.com/stevibe/BenchLocal). All benchmarks are 2-shot (1 retry on failure) for ease of comparison to the numbers found in [Jackrong's Qwopus3.5 Coder](https://huggingface.co/Jackrong/Qwopus3.5-9B-Coder-GGUF) All benchmarks for other models were done in Q8_0 only this model's benchmarks were done in Q4_K_M
InstructFollow-15 evaluates formatting, count, numbering, sentence, and length constraints.
| Instruction Following - InstructFollow-15 Metrics | |||
| Model | Test Set | Comprehensive Score | Dimension Scores (A/B/C/D/E) |
|---|---|---|---|
| armand0e/Qwen3.5-9B-Opus-Agent | InstructFollow-15 | 97 | 100 / 100 / 100 / 85 / 100 |
| Jackrong/Qwopus3.5-9B-coder | InstructFollow-15 | 93 | 100 / 100 / 100 / 67 / 100 |
BugFind-15 evaluates real debugging capability across syntax bugs, logic errors, and trap code.
| Code Debugging & Bug Fixing - BugFind-15 Metrics | |||
| Model | Test Set | Comprehensive Score | Dimension Scores (A/B/C/D/E) |
|---|---|---|---|
| armand0e/Qwen3.5-9B-Opus-Agent | BugFind-15 | 84 | 67 / 100 / 87 / 67 / 90 |
| Jackrong/Qwopus3.5-9B-coder | BugFind-15 | 79 | 67 / 87 / 100 / 77 / 43 |
| Jackrong/MLX-Qwen3.5-9B-DeepSeek-V4-Flash | BugFind-15 | 75 | 67 / 100 / 67 / 57 / 80 |
| armand0e/Qwen3.5-9B-Agent | BugFind-15 | 58 | 29 / 87 / 73 / 20 / 67 |
ToolCall-15 targets stability and precision in direct tool-calling behavior.
| Tool Call Stability - ToolCall-15 Metrics | |||
| Model | Test Set | Comprehensive Score | Dimension Scores (A/B/C/D/E) |
|---|---|---|---|
| armand0e/Qwen3.5-9B-Opus-Agent | ToolCall-15 | 100 | 100 / 100 / 100 / 100 / 100 |
| Jackrong/Qwopus3.5-9B-coder | ToolCall-15 | 100 | 100 / 100 / 100 / 100 / 100 |
| Qwen/Qwen3.5-9B | ToolCall-15 | 100 | 100 / 100 / 100 / 100 / 100 |
| armand0e/Qwen3.5-9B-Agent | ToolCall-15 | 93 | 100 / 100 / 100 / 67 / 100 |
HermesAgent-20 evaluates complex agent behavior across memory, orchestration, skill use, scheduling, and delegation.
| Complex Agent Performance - HermesAgent-20 Metrics | |||
| Model | Test Set | Comprehensive Score | Core Dimensions (Memory / Orchestration / Skills / Scheduling / Boundaries) |
|---|---|---|---|
| Jackrong/Qwopus3.5-9B-coder | HermesAgent-20 | 85 | 84 / 93 / 88 / 75 / 84 |
| armand0e/Qwen3.5-9B-Opus-Agent | HermesAgent-20 | 80 | 100 / 93 / 80 / 75 / 50 |
| Qwen/Qwen3.5-9B | HermesAgent-20 | 71 | 75 / 58 / 100 / 53 / 69 |
| armand0e/Qwen3.5-9B-Agent | HermesAgent-20 | 68 | 71 / 83 / 43 / 61 / 80 |
| DJLougen/Harmonic-Hermes-9B | HermesAgent-20 | 47 | 60 / 45 / 23 / 69 / 38 |
](https://github.com/unslothai/unsloth)