voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.
In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:
Qwen/Qwen3.5-9BDavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTvoidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.
Model Summary
- Base model:
Qwen/Qwen3.5-9B - Model type: Causal language model
- Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
- Strengths: ARC, BoolQ, aggregate benchmark performance
- Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA
Evaluation Setup
We compare three models under the same zero-shot setting:
- 0-shot
- No few-shot examples
- Same benchmark suite
- Reported with standard error
Compared Models
Qwen/Qwen3.5-9BDavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTvoidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
Main Results
Representative 7-task Average
We use acc_norm when available, and acc otherwise.
| Model | Avg. Score |
|---|---|
Qwen/Qwen3.5-9B |
0.7041 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT |
0.6927 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
0.7133 |
Macro Average over All 12 Reported Metrics
| Model | Macro Avg. |
|---|---|
Qwen/Qwen3.5-9B |
0.6655 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT |
0.6587 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
0.6749 |
These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.
Benchmark Results
| Task | Metric | Qwen3.5-9B | DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT | voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning |
|---|---|---|---|---|
| arc_challenge | acc | 0.5427 | 0.5205 | 0.5631 |
| arc_challenge | acc_norm | 0.5555 | 0.5469 | 0.5836 |
| arc_easy | acc | 0.8140 | 0.8018 | 0.8354 |
| arc_easy | acc_norm | 0.7433 | 0.7348 | 0.7950 |
| boolq | acc | 0.8927 | 0.7878 | 0.8792 |
| hellaswag | acc | 0.5827 | 0.6062 | 0.5882 |
| hellaswag | acc_norm | 0.7806 | 0.7944 | 0.7856 |
| openbookqa | acc | 0.3280 | 0.3360 | 0.3240 |
| openbookqa | acc_norm | 0.4280 | 0.4520 | 0.4260 |
| piqa | acc | 0.7905 | 0.7905 | 0.7949 |
| piqa | acc_norm | 0.8014 | 0.8036 | 0.7992 |
| winogrande | acc | 0.7269 | 0.7293 | 0.7245 |
Key Observations
Strengths
- The model achieves the best overall average score across the compared models.
- The model shows clear improvements on ARC-Challenge and ARC-Easy.
- The model strongly outperforms
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTon BoolQ. - The gains are especially visible on reasoning-oriented benchmarks.
Trade-offs
- The model is not the top model on every benchmark.
- The model is slightly behind
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCTon HellaSwag and OpenBookQA. - The model is close to the baselines on PIQA and Winogrande.
Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.
Interpretation
The benchmark pattern suggests that this model improves:
- structured answer selection
- reasoning-oriented multiple-choice QA
- calibration on science and reading-style benchmarks
At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.
This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.
Limitations
- The evaluation here is limited to a small set of common zero-shot benchmarks.
- Some benchmark differences are small and may fall within the reported standard error.
- The model should not be described as universally better on every task.
- Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.
Conclusion
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.
In this comparison, it delivers:
- the best overall aggregate benchmark score
- the strongest ARC performance
- strong BoolQ performance
- competitive general capability on other zero-shot commonsense tasks
This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.
Raw Results
Qwen/Qwen3.5-9B
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5427 | 0.0146 |
| arc_challenge | acc_norm | 0.5555 | 0.0145 |
| arc_easy | acc | 0.8140 | 0.0080 |
| arc_easy | acc_norm | 0.7433 | 0.0090 |
| boolq | acc | 0.8927 | 0.0054 |
| hellaswag | acc | 0.5827 | 0.0049 |
| hellaswag | acc_norm | 0.7806 | 0.0041 |
| openbookqa | acc | 0.3280 | 0.0210 |
| openbookqa | acc_norm | 0.4280 | 0.0221 |
| piqa | acc | 0.7905 | 0.0095 |
| piqa | acc_norm | 0.8014 | 0.0093 |
| winogrande | acc | 0.7269 | 0.0125 |
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5205 | 0.0146 |
| arc_challenge | acc_norm | 0.5469 | 0.0145 |
| arc_easy | acc | 0.8018 | 0.0082 |
| arc_easy | acc_norm | 0.7348 | 0.0091 |
| boolq | acc | 0.7878 | 0.0072 |
| hellaswag | acc | 0.6062 | 0.0049 |
| hellaswag | acc_norm | 0.7944 | 0.0040 |
| openbookqa | acc | 0.3360 | 0.0211 |
| openbookqa | acc_norm | 0.4520 | 0.0223 |
| piqa | acc | 0.7905 | 0.0095 |
| piqa | acc_norm | 0.8036 | 0.0093 |
| winogrande | acc | 0.7293 | 0.0125 |
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
| Task | Metric | Value | Stderr |
|---|---|---|---|
| arc_challenge | acc | 0.5631 | 0.0145 |
| arc_challenge | acc_norm | 0.5836 | 0.0144 |
| arc_easy | acc | 0.8354 | 0.0076 |
| arc_easy | acc_norm | 0.7950 | 0.0083 |
| boolq | acc | 0.8792 | 0.0057 |
| hellaswag | acc | 0.5882 | 0.0049 |
| hellaswag | acc_norm | 0.7856 | 0.0041 |
| openbookqa | acc | 0.3240 | 0.0210 |
| openbookqa | acc_norm | 0.4260 | 0.0221 |
| piqa | acc | 0.7949 | 0.0094 |
| piqa | acc_norm | 0.7992 | 0.0093 |
| winogrande | acc | 0.7245 | 0.0126 |
See axolotl config
axolotl version: 0.16.0.dev0
# Example config for RCCA-TR A+ (Reliability-Calibrated Conflict-Aware Trust-Region) fine-tuning
# A+ variant: only 1 model in GPU memory (active model)
# Prior = offline cache, EMA = drift buffer
base_model: Qwen/Qwen3.5-9B
plugins:
- axolotl.integrations.rcca_tr.RCCATRPlugin
- axolotl.integrations.liger.LigerPlugin
liger_rms_norm: true
liger_glu_activation: true
# Enable RCCA-TR trainer
rcca_tr_trainer: true
# Conflict score hyperparameters
rcca_tr_conflict_lambda1: 1.0 # weight for surprisal in conflict score
rcca_tr_conflict_lambda2: 0.5 # weight for margin-based conflict
rcca_tr_conflict_tau: 1.0 # temperature for conflict sigmoid
# Reliability score hyperparameters
rcca_tr_reliability_beta: 0.5 # balance between stability and evidence
rcca_tr_reliability_tau: 1.0 # temperature for reliability sigmoid
# Trust-region hyperparameters
rcca_tr_epsilon_min: 0.01 # minimum trust-region radius
rcca_tr_epsilon_max: 1.0 # maximum trust-region radius
rcca_tr_kl_lambda: 1.0 # Lagrange multiplier for KL penalty
rcca_tr_use_smooth_objective: true # smooth g(r_t)*KL vs hinge
# Drift buffer (replaces EMA model)
rcca_tr_ema_decay: 0.999 # decay rate for drift buffer
rcca_tr_drift_gamma: 1.0 # drift → reliability scaling
# Prior cache (optional; omit to use fallback mode)
# rcca_tr_prior_cache_path: ./prior_cache/prior_cache.pt
# Dataset
datasets:
- path: voidful/gemini-3.1-opus-4.6-reasoning-merged
type: chat_template
split: train
dataset_prepared_path: ./prepared_data/rcca_tr
chat_template: qwen3_5
# Training settings
sequence_len: 16384
sample_packing: true
pad_to_sequence_len: true
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5
bf16: true
gradient_checkpointing: true
flash_attention: true
dataloader_num_workers: 0
deepspeed: deepspeed_configs/zero2.json
val_set_size: 0.05
save_strategy: epoch
output_dir: ./outputs/rcca-tr-fft
hub_model_id: voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
push_to_hub: true
hub_strategy: end
log_on_each_node: false
logging_steps: 1
Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
This model is a fine-tuned version of Qwen/Qwen3.5-9B on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 80
- gradient_accumulation_steps: 4
- total_train_batch_size: 320
- total_eval_batch_size: 80
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- training_steps: 6
Training results
Framework versions
- Transformers 5.3.0
- Pytorch 2.10.0+cu128
- Datasets 4.5.0
- Tokenizers 0.22.2
- Downloads last month
- 175