Update README.md
Browse files
README.md
CHANGED
|
@@ -4,10 +4,14 @@ base_model:
|
|
| 4 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
EAPO (Exploration-Aware Policy Optimization) is a reinforcement learning framework for training agentic large language models to perform adaptive exploration during test-time interaction. Unlike prior methods that apply exploration uniformly across all states, EAPO enables agents to selectively explore only when environmental uncertainty is high, improving long-horizon reasoning and decision making in interactive environments such as GUI control, web navigation, and embodied tasks.
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
<img src="https://github.com/HansenHua/EAPO-ICML26/
|
| 11 |
</p>
|
| 12 |
|
| 13 |
Paper: Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization (https://arxiv.org/abs/2605.08978)
|
|
|
|
| 4 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
---
|
| 7 |
+
<p align="center">
|
| 8 |
+
<img src="https://github.com/HansenHua/EAPO-ICML26/raw/main/introduction.png" width="90%"></img>
|
| 9 |
+
</p>
|
| 10 |
+
|
| 11 |
EAPO (Exploration-Aware Policy Optimization) is a reinforcement learning framework for training agentic large language models to perform adaptive exploration during test-time interaction. Unlike prior methods that apply exploration uniformly across all states, EAPO enables agents to selectively explore only when environmental uncertainty is high, improving long-horizon reasoning and decision making in interactive environments such as GUI control, web navigation, and embodied tasks.
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="https://github.com/HansenHua/EAPO-ICML26/raw/main/performance.jpg" width="50%"></img>
|
| 15 |
</p>
|
| 16 |
|
| 17 |
Paper: Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization (https://arxiv.org/abs/2605.08978)
|