Qwen3.5-122B-A10B-ReActsft-NVFP4
A reinforcement-trained Qwen3.5-122B-A10B with enhanced browser agent interleaved reasoning capabilities, quantized to NVFP4 for efficient deployment.
Model Summary
| Attribute | Value |
|---|---|
| Base Model | Qwen3.5-122B-A10B (MoE, 122B total / 10B active) |
| Training Method | Reinforcement Learning (browser agent focus) |
| Quantization | NVFP4 (NVIDIA TensorRT-Model-Optimizer, custom-patched) |
| Training Data | <500 fully human-crafted samples |
| Training Time | ~5 hours |
| Training Hardware | NVIDIA RTX PRO 6000 (Blackwell, 96 GB) |
| Context Length | 262,144 tokens |
| MTP Layers | Excluded (acceptance rate too low post-quantization) |
What's Different
This model is reinforcement-trained over the base Qwen3.5-122B-A10B, with the primary focus on browser agent interleaved reasoning. The reinforcement is not a general-purpose enhancement — it specifically targets the model's ability to operate as a browser automation agent through multi-step thinking-action loops.
Core Strength: Browser Agent Operations
The base Qwen3.5-122B-A10B can call browser tools, but it does so in a shallow, rapid-fire manner — typically 5 steps with no intermediate reasoning, just blind tool calls. After reinforcement training, this model can sustain 20-30 consecutive browser action rounds in a single session, with reflective thinking between each step.
The key improvements:
- Deep Interleaved Thinking-Action Loops — Instead of firing off tool calls without thinking, the model now pauses to reason about what it observed, what went wrong, and what to do next before each browser action. This transforms shallow 5-step browsing into sustained 20-30 step workflows with reflection at every stage.
- Page State Comprehension — Improved ability to parse and reason about DOM snapshots, accessibility trees, and visual page structures returned by browser tools.
- Error Recovery in Browser Context — When a browser action fails (element not found, page not loaded, unexpected state), the model recovers by re-analyzing the situation rather than blindly retrying.
Not a General-Purpose Enhancement
This model is not broadly reinforced across all task types. General reasoning, coding, math, and non-browser tool-calling capabilities remain at base model level. The training data and reward signal are specifically designed around browser agent sessions.
The training data consists of fewer than 500 entirely human-crafted samples (not synthetic, not auto-generated — every single sample was manually created from real browser agent sessions). Each sample follows the interleaved pattern:
thinking → action(browser_tool) → observation → thinking → action(browser_tool) → ... → thinking → final_response
Quantization Notes
- Quantized using NVIDIA's official TensorRT-Model-Optimizer repository.
- Not the latest release — a custom-patched version was used to adjust the quantization strategy for compatibility with this architecture.
- MTP (Multi-Token Prediction) speculative decoding layers were excluded due to critically low acceptance rates after FP4 quantization, which degraded throughput rather than improving it.
Deployment
SGLang (Recommended)
Single-GPU deployment on any 96 GB card (RTX PRO 6000, A100-80GB with NVLink, etc.):
python -m sglang.launch_server \
--model-path /path/to /Qwen3.5-122B-A10B-ReActsft-nvfp4 \
--tp 1 \
--context-length 262144 \
--host 0.0.0.0 --port 8000
Measured throughput: ~83 tok/s (single RTX PRO 6000, FP4, TP1).
vLLM (Requires Code Modification)
vLLM does not natively support the NVFP4 LoRA merged weight format out of the box. You will need to patch the model loader to handle the quantized LoRA-merged checkpoints. This is not a drop-in replacement — expect to modify vllm/model_executor/ for weight deserialization.
Limitations
- MTP disabled — No speculative decoding benefit; throughput is purely autoregressive.
- Browser-focused — Reinforcement targets browser agent scenarios specifically. General reasoning, coding, and non-browser tool use remain at base model level.
- Small training set — <500 samples, but every sample is fully human-crafted. Sufficient for the targeted browser agent domain but not a broad capability boost.
- Quantization artifacts — FP4 introduces minor quality degradation on edge-case reasoning compared to FP16/BF16.
- vLLM compatibility — Requires manual code patches; not officially supported.
Roadmap
- Next target: Qwen3.5-27B with the same browser-focused reinforcement training, enabling deployment on consumer GPUs (24 GB VRAM with FP4).
License
Same as the base model: Qwen License.
- Downloads last month
- 269
Model tree for eddy1111111/Qwen3.5-122B-A10B-ReActsft-nvfp4
Base model
Qwen/Qwen3.5-122B-A10B