Safetensors
qwen3
Chaox72 commited on
Commit
e5e8b76
·
verified ·
1 Parent(s): 8c6574b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md CHANGED
@@ -1,3 +1,56 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ # STT-Agent-SFT
6
+
7
+ This repository contains the **STT-Agent-RL** model throught online RL training based on **STT-Agent-SFT**.
8
+
9
+ ## 📊 Performance on STT-Arena
10
+
11
+ Below is the overall Pass@1 performance of STT-Agent compared to other frontier models:
12
+
13
+
14
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/66fa30dee6210a5175235a3c/jEVVEMz_uIFeGpNirY2vh.png)
15
+
16
+ ![STT-Arena Results](images/stt_arena_results.png)
17
+
18
+
19
+ ### Ablation: Effect of Iterative Trajectory Refinement
20
+
21
+ | Model | Easy | Medium | Hard | Impossible | Overall | Avg. Calls |
22
+ |-------|------|--------|------|------------|---------|-------------|
23
+ | Qwen-3-4B (baseline) | 18.31 | 9.46 | 2.82 | 10.00 | 10.57 | 7.63 |
24
+ | STT-Agent (w/o refine) | 28.17 | 16.92 | 11.86 | 47.01 | 23.10 | 32.70 |
25
+ | **{model_name} (with refine)** | **26.76** | **17.41** | **13.56** | **61.11** | **25.11** | **15.30** |
26
+
27
+ Trajectory refinement significantly improves both accuracy and efficiency (reduces average API calls).
28
+
29
+ ## 🚀 Usage
30
+
31
+ ```python
32
+ from transformers import AutoModelForCausalLM, AutoTokenizer
33
+
34
+ model_name = "{model_name}"
35
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
36
+ model = AutoModelForCausalLM.from_pretrained(model_name)
37
+
38
+ # Example tool-use prompt
39
+ prompt = "User: Book the cheapest flight from PVG to CDG.\n"
40
+ inputs = tokenizer(prompt, return_tensors="pt")
41
+ outputs = model.generate(**inputs)
42
+ print(tokenizer.decode(outputs[0]))
43
+ ```
44
+
45
+ ## 🧪 Training Details
46
+
47
+ Base model: Qwen-3-4B-Base
48
+ SFT: 2,212 refined trajectories
49
+ RL strategy: REINFORCE++
50
+ Compute: 4× NVIDIA H200 GPUs
51
+
52
+ ## 📄 Citation
53
+
54
+ ```bibtex
55
+ xxx
56
+ ```