File size: 2,212 Bytes
3eae4cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Phase 3 Implementation Notes

Phase 3 goal: Recurrent PPO (LSTM policy) to capture temporal dependencies such as SLA trend and escalation history.

## Implemented Components

- `rl/train_recurrent.py`
  - RecurrentPPO training with `MlpLstmPolicy`
  - LSTM hidden size configurable (default 128)
  - curriculum sampling retained (easy -> medium -> hard)
  - optional transfer of compatible policy tensors from best Phase 2 checkpoint
- `rl/configs/recurrent.yaml`
  - declarative recurrent training and curriculum settings
- `rl/evaluate.py`
  - model loading modes: `auto`, `maskable`, `recurrent`
  - recurrent inference path with LSTM state handling + action-mask sanitization
  - helper `compare_recurrent_vs_flat(...)`
- `rl/callbacks.py`
  - `RecurrentEvalCallback` for periodic grader-based checkpointing in Phase 3
  - recurrent best checkpoints saved as `best_grader_recurrent_<task>.zip` (no collision with Phase 2 files)
- `rl/gym_wrapper.py`
  - optional `hard_action_mask` mode (default off) for safe action execution
- `tests/test_rl_evaluate.py`
  - recurrent hidden-state persistence
  - LSTM reset behavior on episode boundary
  - recurrent >= flat comparison utility check

## Commands (using existing .venv313)

- Train Phase 3:
  - `.\\.venv313\\Scripts\\python.exe -m rl.train_recurrent --timesteps 600000 --n-envs 4 --seed 42 --config rl/configs/recurrent.yaml`
- Train Phase 3-v2 (recommended tuning run):
  - `.\\.venv313\\Scripts\\python.exe -m rl.train_recurrent --timesteps 700000 --n-envs 4 --seed 42 --config rl/configs/recurrent_v2.yaml`
- Evaluate Phase 3 model:
  - `.\\.venv313\\Scripts\\python.exe -m rl.evaluate --model results/best_model/phase3_final.zip --episodes 3 --model-type recurrent`
- Evaluate best recurrent checkpoint (saved during Phase 3 eval):
  - `.\\.venv313\\Scripts\\python.exe -m rl.evaluate --model results/best_model/best_grader_recurrent_mixed_urgency_medium.zip --episodes 3 --model-type recurrent`
- Compare recurrent vs flat on medium task:
  - `.\\.venv313\\Scripts\\python.exe -c "from rl.evaluate import compare_recurrent_vs_flat; print(compare_recurrent_vs_flat('results/best_model/phase2_final.zip','results/best_model/phase3_final.zip'))"`