darkolorin
/

vibe_coding_router

@@ -9,7 +9,7 @@ tags:
 library_name: mlx
 ---
-# Vibe Coding Router v4
 A three-tier cascaded router for coding tasks that routes prompts between:
@@ -17,59 +17,74 @@ A three-tier cascaded router for coding tasks that routes prompts between:
 - **Sonnet**: Claude Sonnet 4.6 (medium-complexity cloud)
 - **Opus**: Claude Opus 4.6 (max-capability cloud)
 ## Architecture
 Two cascaded binary MLP routers trained with **Privileged Information Distillation (PID)**:
-- **Router A** (local vs cloud): 70-dim input -> [64, 32] -> 1, dropout=0.2
-- **Router B** (sonnet vs opus): 70-dim input -> [32, 16] -> 1, dropout=0.0
-Features: 38 handcrafted code features + 32 PCA-reduced sentence embeddings (all-MiniLM-L6-v2).
 ## Training
 - **Data**: 1,644 coding prompts with real quality scores from all three models
 - **Judge**: GPT-5.4 scoring correctness, completeness, code quality, explanation
-- **Loss**: PID (reward-weighted CE + KL divergence), β_kl=0.02
-- **Label smoothing**: epsilon=0.05, cost-aware margin for Router B (cost_premium=0.03)
 - **HP sweep**: 108 configurations, 3-way split (1150 train / 247 val / 247 test)
-- **Thresholds**: calibrated on validation set only
-## Test Set Results
-| Metric | Value |
-|--------|-------|
-| Utility | 0.6349 |
-| Regret | 0.0830 |
-| vs Always-Opus | +0.63% utility, 40.9% cost savings |
-## Routing Distribution (test set)
-| Tier | Rate | Use Case |
-|------|------|----------|
-| Local | 19.4% | Simple tasks, explanations, basic code gen |
-| Sonnet | 21.5% | Medium complexity, standard debugging |
-| Opus | 59.1% | Architecture, complex multi-file tasks |
-## Thresholds
-- Router A: 0.474 (p(cloud) >= threshold -> route to cloud)
-- Router B: 0.474 (p(opus) >= threshold -> route to Opus, else Sonnet)
 ## Files
-- `router_a.safetensors` - Router A weights (64x32 MLP)
-- `router_b.safetensors` - Router B weights (32x16 MLP)
-- `config.json` - Model config, thresholds, training results
-- `scaler.pkl` - StandardScaler for feature normalization
-- `embedding_extractor.pkl` - PCA-reduced sentence-transformers extractor
 ## Usage
 ```python
 from router.three_tier_inference import ThreeTierRouter
-router = ThreeTierRouter("models/three_tier_v4")
-tier, probs = router.route("Write a Python function to sort a list")
-# tier: "local", "sonnet", or "opus"
 ```

 library_name: mlx
 ---
+# Vibe Coding Router v5
 A three-tier cascaded router for coding tasks that routes prompts between:
 - **Sonnet**: Claude Sonnet 4.6 (medium-complexity cloud)
 - **Opus**: Claude Opus 4.6 (max-capability cloud)
+## What's New in v5
+v4 suffered from **inverted routing** — simple queries went to cloud while complex ones stayed local. Root cause: length-quality anti-correlation in training data combined with PID loss reward-weight amplification. v5 fixes this with:
+1. **7 new complexity features** (45 handcrafted total): `is_coding_task`, `junk_score`, `scope_breadth`, `imperative_verb_density`, `noun_phrase_density`, `interaction_complexity`, `requirement_clause_count`
+2. **Centered complexity premium**: Adjusts training margins by `premium * (complexity_score - center)` so complex tasks push toward cloud and simple tasks push toward local
+3. **Junk prompt clamping**: 75 junk/greeting prompts neutralized (p_teacher=0.5, margin=0.0)
+4. **Reward weight cap**: PID loss reward_weight capped at 0.5 to prevent outlier margin dominance
 ## Architecture
 Two cascaded binary MLP routers trained with **Privileged Information Distillation (PID)**:
+- **Router A** (local vs cloud): 77-dim → [32, 16] → 1, dropout=0.2, LayerNorm+ReLU
+- **Router B** (sonnet vs opus): 77-dim → [128, 64] → 1, dropout=0.0, LayerNorm+ReLU
+Features: 45 handcrafted code features + 32 PCA-reduced sentence embeddings (all-MiniLM-L6-v2).
 ## Training
 - **Data**: 1,644 coding prompts with real quality scores from all three models
 - **Judge**: GPT-5.4 scoring correctness, completeness, code quality, explanation
+- **Loss**: PID (reward-weighted CE + KL divergence), β_kl=0.02, reward_cap=0.5
+- **Label smoothing**: ε=0.05, cost-aware margin for Router B (cost_premium=0.03)
+- **Complexity premium**: 2.0, centered at 0.3
 - **HP sweep**: 108 configurations, 3-way split (1150 train / 247 val / 247 test)
+- **Threshold A**: 0.60 (manually tuned for routing behavior — see note below)
+- **Threshold B**: 0.474 (calibrated on validation set)
+### Threshold Note
+The utility-optimal Router A threshold (0.01) routes almost nothing to local because cloud quality is genuinely equal or better on nearly all prompts. The manual threshold of 0.60 trades ~1.4% utility for correct routing intuition: simple/fast tasks run locally with zero latency, while complex tasks go to cloud.
+## Real-World Routing (28 test queries, threshold_a=0.60)
+| Category | Local | Sonnet | Opus |
+|----------|-------|--------|------|
+| Simple (8) | 5 (62%) | 0 | 3 (38%) |
+| Medium (8) | 3 (38%) | 0 | 5 (62%) |
+| Complex (6) | 1 (17%) | 1 (17%) | 4 (67%) |
+v4 comparison: simple→local was 0/8 (now 5/8), complex→local was 6/6 (now 1/6).
+## Test Set Results (calibrated thresholds)
+| Metric | Value |
+|--------|-------|
+| Utility | 0.6205 |
+| Oracle Utility | 0.7179 |
+| Regret | 0.0973 |
 ## Files
+- `router_a.safetensors` — Router A weights (32×16 MLP, 13KB)
+- `router_b.safetensors` — Router B weights (128×64 MLP, 76KB)
+- `config.json` — Model config, thresholds, HP, training results
+- `scaler.pkl` — StandardScaler for feature normalization
+- `embedding_extractor.pkl` — PCA-reduced sentence-transformers extractor
+- `sweep_results.json` — Full 108-config HP sweep results
 ## Usage
 ```python
 from router.three_tier_inference import ThreeTierRouter
+router = ThreeTierRouter("models/three_tier_v5")
+result = router.route("Write a Python function to sort a list")
+# result.decision: "local", "sonnet", or "opus"
+# result.p_cloud: probability of cloud routing
+# result.p_opus: probability of opus (if routed to cloud)
 ```