Qwen3.5-122B-A10B-ReActsft-NVFP4

A reinforcement-trained Qwen3.5-122B-A10B with enhanced browser agent interleaved reasoning capabilities, quantized to NVFP4 for efficient deployment.

Model Summary

Attribute	Value
Base Model	Qwen3.5-122B-A10B (MoE, 122B total / 10B active)
Training Method	Reinforcement Learning (browser agent focus)
Quantization	NVFP4 (NVIDIA TensorRT-Model-Optimizer, custom-patched)
Training Data	<500 fully human-crafted samples
Training Time	~5 hours
Training Hardware	NVIDIA RTX PRO 6000 (Blackwell, 96 GB)
Context Length	262,144 tokens
MTP Layers	Excluded (acceptance rate too low post-quantization)

What's Different

This model is reinforcement-trained over the base Qwen3.5-122B-A10B, with the primary focus on browser agent interleaved reasoning. The reinforcement is not a general-purpose enhancement — it specifically targets the model's ability to operate as a browser automation agent through multi-step thinking-action loops.

Core Strength: Browser Agent Operations

The base Qwen3.5-122B-A10B can call browser tools, but it does so in a shallow, rapid-fire manner — typically 5 steps with no intermediate reasoning, just blind tool calls. After reinforcement training, this model can sustain 20-30 consecutive browser action rounds in a single session, with reflective thinking between each step.

The key improvements:

Deep Interleaved Thinking-Action Loops — Instead of firing off tool calls without thinking, the model now pauses to reason about what it observed, what went wrong, and what to do next before each browser action. This transforms shallow 5-step browsing into sustained 20-30 step workflows with reflection at every stage.
Page State Comprehension — Improved ability to parse and reason about DOM snapshots, accessibility trees, and visual page structures returned by browser tools.
Error Recovery in Browser Context — When a browser action fails (element not found, page not loaded, unexpected state), the model recovers by re-analyzing the situation rather than blindly retrying.

Not a General-Purpose Enhancement

This model is not broadly reinforced across all task types. General reasoning, coding, math, and non-browser tool-calling capabilities remain at base model level. The training data and reward signal are specifically designed around browser agent sessions.

The training data consists of fewer than 500 entirely human-crafted samples (not synthetic, not auto-generated — every single sample was manually created from real browser agent sessions). Each sample follows the interleaved pattern:

thinking → action(browser_tool) → observation → thinking → action(browser_tool) → ... → thinking → final_response

Quantization Notes

Quantized using NVIDIA's official TensorRT-Model-Optimizer repository.
Not the latest release — a custom-patched version was used to adjust the quantization strategy for compatibility with this architecture.
MTP (Multi-Token Prediction) speculative decoding layers were excluded due to critically low acceptance rates after FP4 quantization, which degraded throughput rather than improving it.

Deployment

SGLang (Recommended)

Single-GPU deployment on any 96 GB card (RTX PRO 6000, A100-80GB with NVLink, etc.):

python -m sglang.launch_server \
  --model-path /path/to /Qwen3.5-122B-A10B-ReActsft-nvfp4 \
  --tp 1 \
  --context-length 262144 \
  --host 0.0.0.0 --port 8000

Measured throughput: ~83 tok/s (single RTX PRO 6000, FP4, TP1).

vLLM (Requires Code Modification)

vLLM does not natively support the NVFP4 LoRA merged weight format out of the box. You will need to patch the model loader to handle the quantized LoRA-merged checkpoints. This is not a drop-in replacement — expect to modify vllm/model_executor/ for weight deserialization.

Limitations

MTP disabled — No speculative decoding benefit; throughput is purely autoregressive.
Browser-focused — Reinforcement targets browser agent scenarios specifically. General reasoning, coding, and non-browser tool use remain at base model level.
Small training set — <500 samples, but every sample is fully human-crafted. Sufficient for the targeted browser agent domain but not a broad capability boost.
Quantization artifacts — FP4 introduces minor quality degradation on edge-case reasoning compared to FP16/BF16.
vLLM compatibility — Requires manual code patches; not officially supported.

Roadmap

Next target: Qwen3.5-27B with the same browser-focused reinforcement training, enabling deployment on consumer GPUs (24 GB VRAM with FP4).

License

Same as the base model: Qwen License.

Downloads last month: 269

Safetensors

Model size

62B params

Tensor type

BF16

F8_E4M3

Model tree for eddy1111111/Qwen3.5-122B-A10B-ReActsft-nvfp4

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(104)

this model