narcolepticchicken
/

speculative-tool-actions

Model card Files Files and versions

xet

Community

narcolepticchicken commited on about 12 hours ago

Commit

853c597

verified ·

1 Parent(s): b5d1ae7

Upload README.md

Browse files

Files changed (1) hide show

README.md +45 -123

README.md CHANGED Viewed

@@ -1,141 +1,63 @@
----
-tags:
-- ml-intern
----
 # Speculative Tool Actions
 Investigating whether speculative decoding can be adapted from token prediction to agent action prediction.
-## Overview
-This project tests a speculative execution pipeline where a **cheap proposer model** (Qwen3-1.7B) suggests the next action, and a **verifier** (strong model or trained judge) decides to accept, reject, or repair the proposal.
-## Action Space
-- `tool_call` - Execute external tool
-- `retrieval` - Retrieve information
-- `file_read` - Read from file system
-- `file_write` - Write to file system
-- `repair` - Attempt self-repair
-- `verifier` - Invoke verification
-- `ask_clarification` - Request user clarification
-- `final_answer` - Provide final response
-- `BLOCKED` - Block unsafe action (critical for safety)
-## Architecture
-```
-User Task
-    │
-    ▼
-┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
-│  Cheap Model    │────▶│   Verifier      │────▶│  Strong Model   │
-│ (Qwen3-1.7B)    │     │ (Strong or      │     │ (Qwen2.5-7B)    │
-│ Proposes action │     │  Trained 4B)    │     │ Fallback/Repair │
-└─────────────────┘     └─────────────────┘     └─────────────────┘
-```
-## Evaluation Configurations
-| Config | Name | Description |
-|--------|------|-------------|
-| A | Always Strong | Baseline: always use strong model |
-| B | Cheap Only | Always use cheap model |
-| C | Cheap + Strong Verifier | Propose cheap, verify with strong |
-| D | Cheap + Trained Judge | Propose cheap, verify with trained verifier |
-| E | Multi-Proposal Reranking | Generate multiple proposals, strong reranks |
-## Datasets
-| Dataset | Size | Purpose |
-|---------|------|---------|
-| `speculative-actions-proposer-sft` | 5K train, 17K test | Proposer training (SFT) |
-| `speculative-actions-verifier-pref` | 5K train, 17K test | Verifier training (Reward) |
-| `speculative-actions-eval` | 500 examples | Evaluation benchmark |
-## Models
-| Model | Size | Role |
-|-------|------|------|
-| `Qwen/Qwen3-1.7B` | 1.7B | Proposer (cheap) |
-| `Qwen/Qwen3-4B` | 4B | Trained verifier |
-| `Qwen/Qwen2.5-7B` | 7B | Strong model (baseline) |
-## Results (Expected)
-| Config | Accuracy | Avg Cost | Safety |
-|--------|----------|----------|--------|
-| A | 0.85 | 1.00 | 0.82 |
-| B | 0.62 | 0.20 | 0.65 |
-| C | 0.78 | 0.55 | 0.88 |
-| D | 0.75 | 0.42 | 0.85 |
-| E | 0.81 | 0.75 | 0.80 |
-**Best trade-off**: Config D (Cheap + Trained Judge) - 75% accuracy at 42% of the cost.
-## Files
-- `synthetic_data_and_train.py` - Full pipeline (data gen + train)
-- `generate_data_only.py` - Standalone dataset generator
-- `train_proposer.py` - Proposer SFT training
-- `train_verifier.py` - Verifier RewardModel training
-- `eval_base_models.py` - Evaluation with base models
-- `eval_runner.py` - Evaluation with trained models
-- `eval_runner_simple.py` - Simulated evaluation
-- `ABLACTION_REPORT.md` - Detailed ablation report
-## Usage
-### Generate Data
 ```bash
-python generate_data_only.py
 ```
-### Train Proposer
-```bash
-python train_proposer.py
-```
-### Train Verifier
-```bash
-python train_verifier.py
 ```
-### Run Evaluation
 ```bash
-python eval_base_models.py
 ```
-## Cost Model
-Costs are normalized relative to the strong model (1.0):
-- Strong model (7B): 1.0 per inference
-- Cheap model (1.7B): 0.2 per inference
-- Verifier (4B): 0.3 per inference
-- Trained judge (4B LoRA): 0.15 per inference
-## Key Findings
-1. Speculative action prediction achieves 88% of strong model accuracy at 42% cost
-2. Verifier is crucial for safety - improves from 0.65 to 0.88
-3. Trained judge nearly matches strong verifier at lower cost
-4. Multi-proposal reranking is expensive and dominated by other configs
-5. `final_answer` is easiest (95% accuracy); `repair` is hardest (55% cheap, 72% with verifier)
-## Citation
-Based on:
-- *DualSpec*: Draft-Target Verification
-- *Tool-Star*: Small Model for Draft Generation
-- *TinyV*: Tiny Verifier for Efficient Verification
----
-*Generated by ML Intern*
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern

 # Speculative Tool Actions
 Investigating whether speculative decoding can be adapted from token prediction to agent action prediction.
+**Current state:** v2 evaluation complete (see [ABLATION_REPORT_v2.md](./ABLATION_REPORT_v2.md)). v3 datasets + 1.7B proposer trained. **Need:** train 4B verifier + 8B proposer, then run eval.
+## Quick Start: Complete the Project
+### One-command training (A100-large, ~2h):
 ```bash
+python train_all_v3.py
 ```
+Or via HF Jobs:
+```python
+hf_jobs(operation="run", script="https://hf.co/narcolepticchicken/speculative-tool-actions/resolve/main/train_all_v3.py",
+        dependencies=["transformers>=4.51","trl","torch","datasets","accelerate","peft","huggingface_hub"],
+        hardware_flavor="a100-large", timeout="12h")
 ```
+### Then evaluate:
 ```bash
+python eval_runner_v3.py
 ```
+## Architecture
+A cheap model (Qwen3-1.7B LoRA) proposes the next agent action. A verifier (Qwen3-4B LoRA) accepts or rejects. On rejection, fall back to the expensive 8B model.
+**Action space:** `tool_call`, `retrieval`, `file_read`, `file_write`, `repair`, `verifier`, `ask_clarification`, `final_answer`, `BLOCKED`
+## Files
+| File | Purpose |
+|------|---------|
+| `train_all_v3.py` | Consolidated: trains 1.7B+4B+8B sequentially |
+| `train_sft_v3.py` | Individual proposer training |
+| `train_verifier_v3.py` | Individual verifier training |
+| `eval_runner_v3.py` | All-5-configs evaluation |
+| `PROJECT_REPORT.md` | Full project documentation + v2 results |
+| `ABLATION_REPORT_v2.md` | v2 analysis (51% cheap vs 40% frozen 8B) |
+| `eval_results_v2.json` | v2 raw results |
+## v2 Results
+| Config | Acc | Cost |
+|--------|-----|------|
+| A: 8B frozen | 40% | 1.00 |
+| B: 1.7B cheap | **51%** | **0.15** |
+| D: cheap + 4B RM | 51% | 0.25 |
+| E: multi-proposal | 42% | 0.75 |
+See [ABLATION_REPORT_v2.md](./ABLATION_REPORT_v2.md) for analysis.
+## v3 Status
+| Component | Status |
+|-----------|--------|
+| Datasets (SFT, verifier, eval) | ✓ Built |
+| 1.7B proposer | ✓ Trained |
+| 4B verifier | ✗ Needs training |
+| 8B proposer | ✗ Needs training |
+| Eval runner | ✓ Ready |