voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a 9B-scale model based on Qwen/Qwen3.5-9B.
This model is designed to improve reasoning-oriented multiple-choice performance while preserving strong general capability.

In our zero-shot evaluation, the model achieves the best overall aggregate performance among the following three models:

  • Qwen/Qwen3.5-9B
  • DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
  • voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

The largest gains appear on ARC-Challenge, ARC-Easy, and BoolQ.
These results suggest that the model improves structured reasoning and calibrated answer selection.


Model Summary

  • Base model: Qwen/Qwen3.5-9B
  • Model type: Causal language model
  • Primary focus: Reasoning, multiple-choice QA, and general zero-shot evaluation
  • Strengths: ARC, BoolQ, aggregate benchmark performance
  • Trade-offs: Slightly weaker than some baselines on HellaSwag and OpenBookQA

Evaluation Setup

We compare three models under the same zero-shot setting:

  • 0-shot
  • No few-shot examples
  • Same benchmark suite
  • Reported with standard error

Compared Models

  1. Qwen/Qwen3.5-9B
  2. DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT
  3. voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Main Results

Representative 7-task Average

We use acc_norm when available, and acc otherwise.

Model Avg. Score
Qwen/Qwen3.5-9B 0.7041
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT 0.6927
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning 0.7133

Macro Average over All 12 Reported Metrics

Model Macro Avg.
Qwen/Qwen3.5-9B 0.6655
DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT 0.6587
voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning 0.6749

These results indicate that voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is the strongest overall model in this comparison.


Benchmark Results

Task Metric Qwen3.5-9B DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
arc_challenge acc 0.5427 0.5205 0.5631
arc_challenge acc_norm 0.5555 0.5469 0.5836
arc_easy acc 0.8140 0.8018 0.8354
arc_easy acc_norm 0.7433 0.7348 0.7950
boolq acc 0.8927 0.7878 0.8792
hellaswag acc 0.5827 0.6062 0.5882
hellaswag acc_norm 0.7806 0.7944 0.7856
openbookqa acc 0.3280 0.3360 0.3240
openbookqa acc_norm 0.4280 0.4520 0.4260
piqa acc 0.7905 0.7905 0.7949
piqa acc_norm 0.8014 0.8036 0.7992
winogrande acc 0.7269 0.7293 0.7245

Key Observations

Strengths

  • The model achieves the best overall average score across the compared models.
  • The model shows clear improvements on ARC-Challenge and ARC-Easy.
  • The model strongly outperforms DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on BoolQ.
  • The gains are especially visible on reasoning-oriented benchmarks.

Trade-offs

  • The model is not the top model on every benchmark.
  • The model is slightly behind DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT on HellaSwag and OpenBookQA.
  • The model is close to the baselines on PIQA and Winogrande.

Overall, the model improves the reasoning profile of the base model without uniformly dominating all commonsense tasks.


Interpretation

The benchmark pattern suggests that this model improves:

  • structured answer selection
  • reasoning-oriented multiple-choice QA
  • calibration on science and reading-style benchmarks

At the same time, the gains are smaller on tasks that rely more heavily on narrative continuation or broad commonsense completion priors.

This behavior is consistent with a model that is optimized more toward reasoning quality than pure completion fluency.


Limitations

  • The evaluation here is limited to a small set of common zero-shot benchmarks.
  • Some benchmark differences are small and may fall within the reported standard error.
  • The model should not be described as universally better on every task.
  • Additional evaluations on instruction following, long-context reasoning, coding, multilingual performance, and open-ended generation are still needed.

Conclusion

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning is a strong 9B reasoning-oriented model built on top of Qwen/Qwen3.5-9B.

In this comparison, it delivers:

  • the best overall aggregate benchmark score
  • the strongest ARC performance
  • strong BoolQ performance
  • competitive general capability on other zero-shot commonsense tasks

This makes it a good choice for users who care about reasoning-oriented zero-shot performance in a compact 9B model.


Raw Results

Qwen/Qwen3.5-9B

Task Metric Value Stderr
arc_challenge acc 0.5427 0.0146
arc_challenge acc_norm 0.5555 0.0145
arc_easy acc 0.8140 0.0080
arc_easy acc_norm 0.7433 0.0090
boolq acc 0.8927 0.0054
hellaswag acc 0.5827 0.0049
hellaswag acc_norm 0.7806 0.0041
openbookqa acc 0.3280 0.0210
openbookqa acc_norm 0.4280 0.0221
piqa acc 0.7905 0.0095
piqa acc_norm 0.8014 0.0093
winogrande acc 0.7269 0.0125

DavidAU/Qwen3.5-9B-Claude-4.6-HighIQ-INSTRUCT

Task Metric Value Stderr
arc_challenge acc 0.5205 0.0146
arc_challenge acc_norm 0.5469 0.0145
arc_easy acc 0.8018 0.0082
arc_easy acc_norm 0.7348 0.0091
boolq acc 0.7878 0.0072
hellaswag acc 0.6062 0.0049
hellaswag acc_norm 0.7944 0.0040
openbookqa acc 0.3360 0.0211
openbookqa acc_norm 0.4520 0.0223
piqa acc 0.7905 0.0095
piqa acc_norm 0.8036 0.0093
winogrande acc 0.7293 0.0125

voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Task Metric Value Stderr
arc_challenge acc 0.5631 0.0145
arc_challenge acc_norm 0.5836 0.0144
arc_easy acc 0.8354 0.0076
arc_easy acc_norm 0.7950 0.0083
boolq acc 0.8792 0.0057
hellaswag acc 0.5882 0.0049
hellaswag acc_norm 0.7856 0.0041
openbookqa acc 0.3240 0.0210
openbookqa acc_norm 0.4260 0.0221
piqa acc 0.7949 0.0094
piqa acc_norm 0.7992 0.0093
winogrande acc 0.7245 0.0126

Built with Axolotl

See axolotl config

axolotl version: 0.16.0.dev0

# Example config for RCCA-TR A+ (Reliability-Calibrated Conflict-Aware Trust-Region) fine-tuning
# A+ variant: only 1 model in GPU memory (active model)
# Prior = offline cache, EMA = drift buffer

base_model: Qwen/Qwen3.5-9B

plugins:
  - axolotl.integrations.rcca_tr.RCCATRPlugin
  - axolotl.integrations.liger.LigerPlugin

liger_rms_norm: true
liger_glu_activation: true

# Enable RCCA-TR trainer
rcca_tr_trainer: true

# Conflict score hyperparameters
rcca_tr_conflict_lambda1: 1.0      # weight for surprisal in conflict score
rcca_tr_conflict_lambda2: 0.5      # weight for margin-based conflict
rcca_tr_conflict_tau: 1.0          # temperature for conflict sigmoid

# Reliability score hyperparameters
rcca_tr_reliability_beta: 0.5      # balance between stability and evidence
rcca_tr_reliability_tau: 1.0       # temperature for reliability sigmoid

# Trust-region hyperparameters
rcca_tr_epsilon_min: 0.01          # minimum trust-region radius
rcca_tr_epsilon_max: 1.0           # maximum trust-region radius
rcca_tr_kl_lambda: 1.0             # Lagrange multiplier for KL penalty
rcca_tr_use_smooth_objective: true  # smooth g(r_t)*KL vs hinge

# Drift buffer (replaces EMA model)
rcca_tr_ema_decay: 0.999           # decay rate for drift buffer
rcca_tr_drift_gamma: 1.0           # drift → reliability scaling

# Prior cache (optional; omit to use fallback mode)
# rcca_tr_prior_cache_path: ./prior_cache/prior_cache.pt

# Dataset
datasets:
  - path: voidful/gemini-3.1-opus-4.6-reasoning-merged
    type: chat_template
    split: train

dataset_prepared_path: ./prepared_data/rcca_tr

chat_template: qwen3_5

# Training settings
sequence_len: 16384
sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

bf16: true
gradient_checkpointing: true
flash_attention: true

dataloader_num_workers: 0

deepspeed: deepspeed_configs/zero2.json

val_set_size: 0.05

save_strategy: epoch

output_dir: ./outputs/rcca-tr-fft

hub_model_id: voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning
push_to_hub: true
hub_strategy: end

log_on_each_node: false
logging_steps: 1

Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

This model is a fine-tuned version of Qwen/Qwen3.5-9B on the voidful/gemini-3.1-opus-4.6-reasoning-merged dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 80
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 320
  • total_eval_batch_size: 80
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • training_steps: 6

Training results

Framework versions

  • Transformers 5.3.0
  • Pytorch 2.10.0+cu128
  • Datasets 4.5.0
  • Tokenizers 0.22.2
Downloads last month
175
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(192)
this model
Quantizations
2 models

Dataset used to train voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning

Collection including voidful/Qwen3.5-9B-gemini-3.1-opus-4.6-reasoning