File size: 7,105 Bytes
84cdb6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe55589
84cdb6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe55589
 
 
 
 
 
84cdb6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
license: mit
base_model: Qwen/Qwen3-4B-Instruct-2507
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- hydrology
- agent
- tool-use
- grpo
- reinforcement-learning
- qwen3
- ef5
- crest
- function-calling
datasets:
- anonymousOwl/HydroAgent-dataset
---

# HydroAgent β€” Qwen3-4B-Instruct fine-tuned for hydrologic model calibration

**HydroAgent** is a tool-using language model that calibrates the
[EF5/CREST](https://github.com/HyDROSLab/EF5) distributed hydrologic model.
Given a USGS streamflow gage and a precipitation-driven simulation, the agent
iteratively proposes physically plausible parameter sets, runs the simulator,
inspects the resulting NSE / peak / volume metrics, and revises until the
model fits the observations.

This release is the **GRPO step-100 checkpoint** of the SFT + RL pipeline
described in [chrimerss/HydroLLM](https://github.com/chrimerss/HydroLLM).

- **Base model:** [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
- **Training:** full fine-tuning, BF16, FSDP, no LoRA
- **RL framework:** [verl 0.5](https://github.com/volcengine/verl) GRPO with [SGLang](https://github.com/sgl-project/sglang) rollouts
- **Tool format:** Hermes-style `<tool_call>` JSON (Qwen3-Instruct native)
- **Hardware:** 4Γ— H100, ~30 min/step, K=6 rollouts Γ— max 50 multi-turn calls

## How the agent works

The model has access to three tools and runs a multi-turn calibration loop:

| Tool | Purpose |
|---|---|
| `set_parameters` | Set 11 tunable CREST multipliers: `wm`, `b`, `im`, `ke`, `fc`, `under`, `leaki`, `alpha`, `beta`, `alpha0`, `iwu` |
| `run_simulation` | Execute EF5 with the current parameters and produce a hydrograph |
| `evaluate` | Score the latest run vs. observations: NSE, CC, KGE, peak ratio, lag |

Each rollout typically follows: `set_parameters β†’ run_simulation β†’ evaluate β†’ set_parameters β†’ …`
until NSE plateaus or the agent runs out of turns. Inputs to the agent are a
short system prompt describing the calibration task and a per-gage user
message with watershed metadata (basin area, lat/lon, time window).

## Training data

Training calibrates the agent on **10 CONUS USGS gages** (basin areas
539 – 2401 kmΒ²), each driven by **MRMS 1 km hourly precipitation** and
**hourly USGS streamflow observations** from 60-day windows selected to
contain a clear flood event (rising + receding limbs, edge-buffered).

| Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) |
|---|---:|---:|---:|---|
| 11383500 |  539 | 40.0140 | -121.9483 | 2018-05-19 β†’ 2018-07-17 |
| 11043000 |  575 | 33.4798 | -117.1439 | 2019-03-15 β†’ 2019-05-13 |
| 11152000 |  632 | 36.2805 | -121.3227 | 2018-05-29 β†’ 2018-07-27 |
| 02294781 | 1064 | 27.8245 |  -81.8017 | 2018-04-29 β†’ 2018-06-27 |
| 02312000 | 1476 | 28.4800 |  -82.1776 | 2018-11-15 β†’ 2019-01-13 |
| 07195430 | 1489 | 36.1086 |  -94.5333 | 2018-01-04 β†’ 2018-03-04 |
| 11179000 | 1639 | 37.5871 | -121.9608 | 2018-06-03 β†’ 2018-08-01 |
| 14301000 | 1727 | 45.7040 | -123.7554 | 2018-09-11 β†’ 2018-11-09 |
| 14207500 | 1828 | 45.3507 | -122.6762 | 2018-04-09 β†’ 2018-06-07 |
| 11376000 | 2401 | 40.3871 | -122.2386 | 2018-09-21 β†’ 2018-11-19 |

**Held-out evaluation gages** (never seen during training):

| Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) |
|---|---:|---:|---:|---|
| 02338660 |   329 | 33.2357 |  -84.9876 | 2018-07-01 β†’ 2018-08-31 |
| 01403060 |  2033 | 40.5511 |  -74.5483 | 2018-11-11 β†’ 2019-01-09 |
| 06279500 | 40792 | 44.7585 | -108.1816 | 2018-06-13 β†’ 2018-08-11 |
| 07144100 |  3209 | 37.8831 |  -97.4245 | 2019-03-30 β†’ 2019-05-28 |

The full training dataset β€” CONUS terrain rasters, per-gage MRMS hourly
precipitation clips, USGS hourly streamflow observations, daily PET, the
EF5 control template, and the 73 GPT-4o calibration trajectories that seed
the SFT phase β€” is published as
[**anonymousOwl/HydroAgent-dataset**](https://huggingface.co/datasets/anonymousOwl/HydroAgent-dataset).
See that repo's README for the per-folder layout and provenance.

## Reward

Two reward layers shape the policy:

**Per-turn (returned by tools):**

| Tool call | Reward |
|---|---|
| `set_parameters` (valid) | `+0.02` |
| `run_simulation` (valid) | `+0.05` |
| `evaluate` (valid) | `Ξ”NSE` (this turn βˆ’ previous best) |
| Any tool (invalid) | `βˆ’0.5` |

**Terminal (returned at end of trajectory):**

| Component | Value |
|---|---|
| Best NSE (clipped) | `[βˆ’1, 1]` |
| Target-met bonus | `+0.5` if best NSE > gage target |
| Iteration bonus | `+0.02 Γ— n_evaluates` |
| Improvement bonus | `+0.10 Γ— max(0, n_improvements βˆ’ 1)` |
| Empty-trajectory penalty | `βˆ’1.0` |

## GRPO settings

| Setting | Value |
|---|---|
| Algorithm | GRPO (group-relative advantages) |
| K (rollouts per prompt) | 6 |
| Train batch size | 4 prompts (24 trajectories per step) |
| Max assistant turns | 50 |
| Learning rate | 1e-6 with 5% warmup |
| Entropy coefficient | 0.01 |
| KL loss coefficient | 0.05 (anchored to base policy) |
| Sampling | `temperature=1.0`, `top_p=0.95` |
| Steps in this checkpoint | **100** |

## Quick start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "anonymousOwl/HydroAgent"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="bfloat16", device_map="auto")
```

The model emits Hermes-style tool calls, e.g.:

```
<tool_call>
{"name": "set_parameters", "arguments": {"wm": 1.0, "b": 1.0, "im": 0.5, ...}}
</tool_call>
```

Parse with `tokenizer.apply_chat_template(..., tools=HYDRO_TOOLS)` and
dispatch each call to your EF5 sandbox. See
[`modal_app/eval.py`](https://github.com/chrimerss/HydroLLM/blob/main/modal_app/eval.py)
for a reference SGLang loop with retry-on-parse-failure logic.

For full reproduction (image, EF5 binary, multi-turn rollout, reward
computation), use the
[HydroLLM repository](https://github.com/chrimerss/HydroLLM).

## Limitations

- Trained on **10 small/medium CONUS basins** (≀ 2401 kmΒ²) over short flood
  windows. Generalization to large basins (> 3000 kmΒ²), arid catchments, or
  out-of-CONUS regions is unverified.
- Calibrates **CREST parameter multipliers only** β€” does not modify routing,
  initial conditions, or sub-basin structure.
- The agent depends on a working EF5 toolchain; the weights alone do not
  perform calibration without the simulation environment in the loop.
- This is a research checkpoint, not a production tool. NSE on held-out
  gages varies substantially with basin and event.

## License

MIT β€” same as the upstream [HydroLLM repository](https://github.com/chrimerss/HydroLLM)
and the base [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).

## Citation

```bibtex
@software{hydrollm2026,
  title  = {HydroLLM: Reinforcement Learning Fine-Tuning of LLMs with Hydrologic Simulation Feedback},
  year   = {2026},
  url    = {https://github.com/chrimerss/HydroLLM}
}
```

## Acknowledgement

Compute for this research was sponsored by [Modal](https://modal.com).