File size: 7,105 Bytes
84cdb6f fe55589 84cdb6f fe55589 84cdb6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | ---
license: mit
base_model: Qwen/Qwen3-4B-Instruct-2507
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- hydrology
- agent
- tool-use
- grpo
- reinforcement-learning
- qwen3
- ef5
- crest
- function-calling
datasets:
- anonymousOwl/HydroAgent-dataset
---
# HydroAgent β Qwen3-4B-Instruct fine-tuned for hydrologic model calibration
**HydroAgent** is a tool-using language model that calibrates the
[EF5/CREST](https://github.com/HyDROSLab/EF5) distributed hydrologic model.
Given a USGS streamflow gage and a precipitation-driven simulation, the agent
iteratively proposes physically plausible parameter sets, runs the simulator,
inspects the resulting NSE / peak / volume metrics, and revises until the
model fits the observations.
This release is the **GRPO step-100 checkpoint** of the SFT + RL pipeline
described in [chrimerss/HydroLLM](https://github.com/chrimerss/HydroLLM).
- **Base model:** [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
- **Training:** full fine-tuning, BF16, FSDP, no LoRA
- **RL framework:** [verl 0.5](https://github.com/volcengine/verl) GRPO with [SGLang](https://github.com/sgl-project/sglang) rollouts
- **Tool format:** Hermes-style `<tool_call>` JSON (Qwen3-Instruct native)
- **Hardware:** 4Γ H100, ~30 min/step, K=6 rollouts Γ max 50 multi-turn calls
## How the agent works
The model has access to three tools and runs a multi-turn calibration loop:
| Tool | Purpose |
|---|---|
| `set_parameters` | Set 11 tunable CREST multipliers: `wm`, `b`, `im`, `ke`, `fc`, `under`, `leaki`, `alpha`, `beta`, `alpha0`, `iwu` |
| `run_simulation` | Execute EF5 with the current parameters and produce a hydrograph |
| `evaluate` | Score the latest run vs. observations: NSE, CC, KGE, peak ratio, lag |
Each rollout typically follows: `set_parameters β run_simulation β evaluate β set_parameters β β¦`
until NSE plateaus or the agent runs out of turns. Inputs to the agent are a
short system prompt describing the calibration task and a per-gage user
message with watershed metadata (basin area, lat/lon, time window).
## Training data
Training calibrates the agent on **10 CONUS USGS gages** (basin areas
539 β 2401 kmΒ²), each driven by **MRMS 1 km hourly precipitation** and
**hourly USGS streamflow observations** from 60-day windows selected to
contain a clear flood event (rising + receding limbs, edge-buffered).
| Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) |
|---|---:|---:|---:|---|
| 11383500 | 539 | 40.0140 | -121.9483 | 2018-05-19 β 2018-07-17 |
| 11043000 | 575 | 33.4798 | -117.1439 | 2019-03-15 β 2019-05-13 |
| 11152000 | 632 | 36.2805 | -121.3227 | 2018-05-29 β 2018-07-27 |
| 02294781 | 1064 | 27.8245 | -81.8017 | 2018-04-29 β 2018-06-27 |
| 02312000 | 1476 | 28.4800 | -82.1776 | 2018-11-15 β 2019-01-13 |
| 07195430 | 1489 | 36.1086 | -94.5333 | 2018-01-04 β 2018-03-04 |
| 11179000 | 1639 | 37.5871 | -121.9608 | 2018-06-03 β 2018-08-01 |
| 14301000 | 1727 | 45.7040 | -123.7554 | 2018-09-11 β 2018-11-09 |
| 14207500 | 1828 | 45.3507 | -122.6762 | 2018-04-09 β 2018-06-07 |
| 11376000 | 2401 | 40.3871 | -122.2386 | 2018-09-21 β 2018-11-19 |
**Held-out evaluation gages** (never seen during training):
| Gage ID | Basin (kmΒ²) | Lat | Lon | Window (UTC) |
|---|---:|---:|---:|---|
| 02338660 | 329 | 33.2357 | -84.9876 | 2018-07-01 β 2018-08-31 |
| 01403060 | 2033 | 40.5511 | -74.5483 | 2018-11-11 β 2019-01-09 |
| 06279500 | 40792 | 44.7585 | -108.1816 | 2018-06-13 β 2018-08-11 |
| 07144100 | 3209 | 37.8831 | -97.4245 | 2019-03-30 β 2019-05-28 |
The full training dataset β CONUS terrain rasters, per-gage MRMS hourly
precipitation clips, USGS hourly streamflow observations, daily PET, the
EF5 control template, and the 73 GPT-4o calibration trajectories that seed
the SFT phase β is published as
[**anonymousOwl/HydroAgent-dataset**](https://huggingface.co/datasets/anonymousOwl/HydroAgent-dataset).
See that repo's README for the per-folder layout and provenance.
## Reward
Two reward layers shape the policy:
**Per-turn (returned by tools):**
| Tool call | Reward |
|---|---|
| `set_parameters` (valid) | `+0.02` |
| `run_simulation` (valid) | `+0.05` |
| `evaluate` (valid) | `ΞNSE` (this turn β previous best) |
| Any tool (invalid) | `β0.5` |
**Terminal (returned at end of trajectory):**
| Component | Value |
|---|---|
| Best NSE (clipped) | `[β1, 1]` |
| Target-met bonus | `+0.5` if best NSE > gage target |
| Iteration bonus | `+0.02 Γ n_evaluates` |
| Improvement bonus | `+0.10 Γ max(0, n_improvements β 1)` |
| Empty-trajectory penalty | `β1.0` |
## GRPO settings
| Setting | Value |
|---|---|
| Algorithm | GRPO (group-relative advantages) |
| K (rollouts per prompt) | 6 |
| Train batch size | 4 prompts (24 trajectories per step) |
| Max assistant turns | 50 |
| Learning rate | 1e-6 with 5% warmup |
| Entropy coefficient | 0.01 |
| KL loss coefficient | 0.05 (anchored to base policy) |
| Sampling | `temperature=1.0`, `top_p=0.95` |
| Steps in this checkpoint | **100** |
## Quick start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "anonymousOwl/HydroAgent"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, torch_dtype="bfloat16", device_map="auto")
```
The model emits Hermes-style tool calls, e.g.:
```
<tool_call>
{"name": "set_parameters", "arguments": {"wm": 1.0, "b": 1.0, "im": 0.5, ...}}
</tool_call>
```
Parse with `tokenizer.apply_chat_template(..., tools=HYDRO_TOOLS)` and
dispatch each call to your EF5 sandbox. See
[`modal_app/eval.py`](https://github.com/chrimerss/HydroLLM/blob/main/modal_app/eval.py)
for a reference SGLang loop with retry-on-parse-failure logic.
For full reproduction (image, EF5 binary, multi-turn rollout, reward
computation), use the
[HydroLLM repository](https://github.com/chrimerss/HydroLLM).
## Limitations
- Trained on **10 small/medium CONUS basins** (β€ 2401 kmΒ²) over short flood
windows. Generalization to large basins (> 3000 kmΒ²), arid catchments, or
out-of-CONUS regions is unverified.
- Calibrates **CREST parameter multipliers only** β does not modify routing,
initial conditions, or sub-basin structure.
- The agent depends on a working EF5 toolchain; the weights alone do not
perform calibration without the simulation environment in the loop.
- This is a research checkpoint, not a production tool. NSE on held-out
gages varies substantially with basin and event.
## License
MIT β same as the upstream [HydroLLM repository](https://github.com/chrimerss/HydroLLM)
and the base [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507).
## Citation
```bibtex
@software{hydrollm2026,
title = {HydroLLM: Reinforcement Learning Fine-Tuning of LLMs with Hydrologic Simulation Feedback},
year = {2026},
url = {https://github.com/chrimerss/HydroLLM}
}
```
## Acknowledgement
Compute for this research was sponsored by [Modal](https://modal.com).
|