voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.

In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:

Qwen/Qwen3.5-9B
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.

Model Summary

Base model: Qwen/Qwen3.5-9B
Model type: Causal language model
Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
Strengths: ARC, BoolQ, aggregate benchmark performance
Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA

Evaluation Setup

We compare three models under the same zero-shot setting:

0-shot
No few-shot examples
Same benchmark suite
Reported with standard error

Compared Models

Qwen/Qwen3.5-9B
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Main Results

Representative 7-task Average

We use acc_norm when available, and acc otherwise.

Model	Avg. Score
`Qwen/Qwen3.5-9B`	0.7041
`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`	0.6927
`voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning`	0.7133

Macro Average over All 12 Reported Metrics

Model	Macro Avg.
`Qwen/Qwen3.5-9B`	0.6655
`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`	0.6587
`voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning`	0.6749

These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.

Benchmark Results

Task	Metric	Qwen3.5-9B	DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT	voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
arc_challenge	acc	0.5427	0.5205	0.5631
arc_challenge	acc_norm	0.5555	0.5469	0.5836
arc_easy	acc	0.8140	0.8018	0.8354
arc_easy	acc_norm	0.7433	0.7348	0.7950
boolq	acc	0.8927	0.7878	0.8792
hellaswag	acc	0.5827	0.6062	0.5882
hellaswag	acc_norm	0.7806	0.7944	0.7856
openbookqa	acc	0.3280	0.3360	0.3240
openbookqa	acc_norm	0.4280	0.4520	0.4260
piqa	acc	0.7905	0.7905	0.7949
piqa	acc_norm	0.8014	0.8036	0.7992
winogrande	acc	0.7269	0.7293	0.7245

Key Observations

Strengths

The model achieves the best overall average score across the compared models.
The model shows clear improvements on ARC-Challenge and ARC-Easy.
The model strongly outperforms DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on BoolQ.
The gains are especially visible on reasoning-oriented benchmarks.

Trade-offs

The model is not the top model on every benchmark.
The model is slightly behind DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on HellaSwag and OpenBookQA.
The model is close to the baselines on PIQA and Winogrande.

Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.

Interpretation

The benchmark pattern suggests that this model improves:

structured answer selection
reasoning-oriented multiple-choice QA
calibration on science and reading-style benchmarks

At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.

This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.

Limitations

The evaluation here is limited to a small set of common zero-shot benchmarks.
Some benchmark differences are small and may fall within the reported standard error.
The model should not be described as universally better on every task.
Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.

Conclusion

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.

In this comparison, it delivers:

the best overall aggregate benchmark score
the strongest ARC performance
strong BoolQ performance
competitive general capability on other zero-shot commonsense tasks

This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.

Raw Results

`Qwen/Qwen3.5-9B`

Task	Metric	Value	Stderr
arc_challenge	acc	0.5427	0.0146
arc_challenge	acc_norm	0.5555	0.0145
arc_easy	acc	0.8140	0.0080
arc_easy	acc_norm	0.7433	0.0090
boolq	acc	0.8927	0.0054
hellaswag	acc	0.5827	0.0049
hellaswag	acc_norm	0.7806	0.0041
openbookqa	acc	0.3280	0.0210
openbookqa	acc_norm	0.4280	0.0221
piqa	acc	0.7905	0.0095
piqa	acc_norm	0.8014	0.0093
winogrande	acc	0.7269	0.0125

`DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT`

Task	Metric	Value	Stderr
arc_challenge	acc	0.5205	0.0146
arc_challenge	acc_norm	0.5469	0.0145
arc_easy	acc	0.8018	0.0082
arc_easy	acc_norm	0.7348	0.0091
boolq	acc	0.7878	0.0072
hellaswag	acc	0.6062	0.0049
hellaswag	acc_norm	0.7944	0.0040
openbookqa	acc	0.3360	0.0211
openbookqa	acc_norm	0.4520	0.0223
piqa	acc	0.7905	0.0095
piqa	acc_norm	0.8036	0.0093
winogrande	acc	0.7293	0.0125

`voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning`

Task	Metric	Value	Stderr
arc_challenge	acc	0.5631	0.0145
arc_challenge	acc_norm	0.5836	0.0144
arc_easy	acc	0.8354	0.0076
arc_easy	acc_norm	0.7950	0.0083
boolq	acc	0.8792	0.0057
hellaswag	acc	0.5882	0.0049
hellaswag	acc_norm	0.7856	0.0041
openbookqa	acc	0.3240	0.0210
openbookqa	acc_norm	0.4260	0.0221
piqa	acc	0.7949	0.0094
piqa	acc_norm	0.7992	0.0093
winogrande	acc	0.7245	0.0126

See axolotl config

axolotl version: 0.16.0.dev0

# Example config for RCCA-TR A+ (Reliability-Calibrated Conflict-Aware Trust-Region) fine-tuning
# A+ variant: only 1 model in GPU memory (active model)
# Prior = offline cache, EMA = drift buffer

base_model: Qwen/Qwen3.5-9B

plugins:
  - axolotl.integrations.rcca_tr.RCCATRPlugin
  - axolotl.integrations.liger.LigerPlugin

liger_rms_norm: true
liger_glu_activation: true

# Enable RCCA-TR trainer
rcca_tr_trainer: true

# Conflict score hyperparameters
rcca_tr_conflict_lambda1: 1.0      # weight for surprisal in conflict score
rcca_tr_conflict_lambda2: 0.5      # weight for margin-based conflict
rcca_tr_conflict_tau: 1.0          # temperature for conflict sigmoid

# Reliability score hyperparameters
rcca_tr_reliability_beta: 0.5      # balance between stability and evidence
rcca_tr_reliability_tau: 1.0       # temperature for reliability sigmoid

# Trust-region hyperparameters
rcca_tr_epsilon_min: 0.01          # minimum trust-region radius
rcca_tr_epsilon_max: 1.0           # maximum trust-region radius
rcca_tr_kl_lambda: 1.0             # Lagrange multiplier for KL penalty
rcca_tr_use_smooth_objective: true  # smooth g(r_t)*KL vs hinge

# Drift buffer (replaces EMA model)
rcca_tr_ema_decay: 0.999           # decay rate for drift buffer
rcca_tr_drift_gamma: 1.0           # drift → reliability scaling

# Prior cache (optional; omit to use fallback mode)
# rcca_tr_prior_cache_path: ./prior_cache/prior_cache.pt

# Dataset
datasets:
  - path: voidful/gemini-3.1-opus-4.6-reasoning-merged
    type: chat_template
    split: train

dataset_prepared_path: ./prepared_data/rcca_tr

chat_template: qwen3_5

# Training settings
sequence_len: 16384
sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

bf16: true
gradient_checkpointing: true
flash_attention: true

dataloader_num_workers: 0

deepspeed: deepspeed_configs/zero2.json

val_set_size: 0.05

save_strategy: epoch

output_dir: ./outputs/rcca-tr-fft

hub_model_id: voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
push_to_hub: true
hub_strategy: end

log_on_each_node: false
logging_steps: 1

Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

This model is a fine-tuned version of Qwen/Qwen3.5-9B on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 80
gradient_accumulation_steps: 4
total_train_batch_size: 320
total_eval_batch_size: 80
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
training_steps: 6