File size: 15,940 Bytes
488b143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# Project Mahoraga — Complete System Report

> **Version**: 1.0 (Post-Merge)
> **Branch**: `main` (fully merged from `phase1-env-setup`)
> **Tests**: 143/143 passing
> **Date**: 2026-04-25

---

## 1. Project Overview

**Project Mahoraga** is a reinforcement learning environment where an AI agent ("Mahoraga") learns adaptive combat through a resistance trade-off system. Named after Jujutsu Kaisen's Mahoraga — a shikigami that adapts to any attack — the system trains an LLM (Qwen 2.5 3B) to make tactical decisions in a turn-based combat loop.

**Core Loop**: `Observe → Adapt → Accumulate → Punish`

The agent must:
1. Observe enemy attack patterns
2. Build resistance to the correct attack category
3. Accumulate adaptation stacks
4. Execute Judgment Strike for burst damage at the right moment

**This is NOT a game.** It is a clean, testable RL environment designed for LLM fine-tuning via reward-weighted SFT.

---

## 2. Architecture Breakdown

```
project_mahoraga/
├── env/
│   ├── mahoraga_env.py      # Main environment orchestrator
│   ├── mechanics.py         # Resistance, damage, action math
│   ├── enemy.py             # CurriculumEnemy (3-phase AI)
│   ├── rewards.py           # 6-component composable reward system
│   ├── state.py             # State dict builder
│   └── gym_wrapper.py       # Gymnasium-compatible wrapper
├── utils/
│   ├── constants.py         # All game constants and mappings
│   └── validators.py        # Action validation
├── tests/
│   ├── test_env.py          # 110 core tests
│   └── test_gym_wrapper.py  # 33 wrapper tests
├── notebooks/
│   ├── mahoraga_training.py     # Training notebook (source)
│   └── mahoraga_training.ipynb  # Training notebook (Kaggle)
├── scripts/
│   └── random_agent_gym.py  # Random agent demo
├── app.py                   # Gradio interactive UI
├── main.py                  # CLI episode runner
└── README.md
```

### Module Details

#### `env/mahoraga_env.py` — Environment Orchestrator
- `MahoragaEnv(debug=False)` — main class
- `reset()` → returns state dict
- `step(action)` → returns `(state, reward, done, info)`
- Coordinates enemy attacks, agent actions, reward computation
- Tracks: HP, resistances, adaptation stack, heal cooldown, last adapted category

#### `env/mechanics.py` — Core Math
- `new_resistances()` — creates `{PHYSICAL: 0, CE: 0, TECHNIQUE: 0}`
- `apply_resistance_change(res, type)` — +40 target, -20 others, clamp [0,80]
- `compute_enemy_damage(category, res, ignore_armor)` — damage formula
- `compute_judgment_damage(last_adapted, enemy_cat)` — adaptation-match burst
- `apply_action_effects(...)` — dispatches action 0-4
- `check_correct_adaptation(action, category)` — validates adaptation

#### `env/enemy.py` — CurriculumEnemy
- Single `CurriculumEnemy` class with 3-phase behavior
- `get_attack(turn_number, resistances)``{category, subtype, damage, ignore_armor}`
- Phase selection based on turn number

#### `env/rewards.py` — Composable Rewards
- 6 independent functions + 1 aggregator
- Returns dict, NOT a single scalar
- `compute_rewards(info, state, action, done)` → dict

#### `env/state.py` — State Builder
- Converts internal uppercase keys to lowercase for RL observation
- `build_state_dict(...)` → dict with 7 keys

#### `env/gym_wrapper.py` — Gymnasium Interface
- `MahoragaGymEnv(gym.Env)` wraps `MahoragaEnv`
- `Discrete(5)` action space, `Dict` observation space
- Encodes categoricals to integers for neural networks

#### `app.py` — Gradio UI
- Interactive combat arena with 5 action buttons
- Displays HP, resistances, stack, cooldown, combat log
- Launch: `python app.py`

---

## 3. Core Mechanics

### Resistance System
Three categories: **PHYSICAL**, **CE**, **TECHNIQUE**. Range: [0, 80].

When agent adapts to a category:
- Target category: **+40**
- Other categories: **-20**
- All clamped to [0, 80]

Higher resistance = less damage from that category.

### Action Space (0–4)

| Action | Name | Effect |
|--------|------|--------|
| 0 | Adapt PHYSICAL | +40 PHYSICAL res, -20 others |
| 1 | Adapt CE | +40 CE res, -20 others |
| 2 | Adapt TECHNIQUE | +40 TECHNIQUE res, -20 others |
| 3 | Judgment Strike | Deal damage, consume stacks, reset res |
| 4 | Regeneration | +300 HP, 3-turn cooldown |

### Adaptation Stack
- +1 when agent correctly adapts to current enemy attack category
- Consumed by Judgment Strike: each stack adds +50 damage
- Reset to 0 after Judgment Strike

### Judgment Strike Logic
**Condition**: Burst (350 dmg) if `last_adapted_category == current_enemy_category`
**Otherwise**: Base (100 dmg)
**Total**: `burst_or_base + (stacks × 50)`
**After**: Resistances reset to 0, stacks reset to 0

### Heal Cooldown
- Heals +300 HP (capped at MAX_HP=1200)
- 3-turn cooldown after use
- Does NOT reset resistances
- If used while on cooldown → wasted turn (action nullified)

### Damage Formula
```
resistance = category_resistance
if ignore_armor:
    resistance = resistance × 0.8    # 20% bypass (PIERCE only)
damage = base_damage × (1 - resistance / 100)
```

### HP Configuration
| Entity | HP |
|--------|----|
| Agent (Mahoraga) | 1200 |
| Enemy | 1000 |

---

## 4. Enemy System — CurriculumEnemy

Three-phase curriculum designed for progressive learning:

### Phase 1: Tutorial (Turns 1–5)
- Always attacks with **PHYSICAL**
- Agent learns basic adaptation against a single category
- Predictable — builds confidence

### Phase 2: Pattern (Turns 6–15)
- Cycles: **PHYSICAL → CE → TECHNIQUE**
- 15% random injection (picks random category instead of pattern)
- Agent learns to predict cycling patterns and handle surprises

### Phase 3: Adaptive (Turns 16–25)
- **Targets the agent's lowest resistance category**
- Reads `resistances` dict, picks `min(resistances, key=resistances.get)`
- Agent must learn balanced defense or get exploited
- If no resistances provided, falls back to random

### Subtypes
Each category has 3 subtypes (visual/variation only):

| Category | Subtypes |
|----------|----------|
| PHYSICAL | SLASH, IMPACT, **PIERCE** |
| CE | BLAST, WAVE, BEAM |
| TECHNIQUE | SPIKE, DELAYED, PATTERN |

**PIERCE** is special: `ignore_armor=True` → bypasses 20% of resistance.

### Attack Dict Schema (LOCKED)
```python
{
    "category": "PHYSICAL" | "CE" | "TECHNIQUE",
    "subtype": "SLASH" | "IMPACT" | ... ,
    "damage": int,
    "ignore_armor": bool
}
```

---

## 5. Reward System

Six independent components computed per step. Final reward = sum of all components.

| Component | Formula | Purpose | Typical Range |
|-----------|---------|---------|---------------|
| **Survival** | `-(damage_taken / 100)` | Penalize taking damage | [-2.2, 0] |
| **Combat** | `+(damage_dealt / 100)` | Reward dealing damage | [0, 4.5] |
| **Adaptation** | `+1.5` if correct, else `0` | **Strongest signal** — correct resistance match | {0, 1.5} |
| **Anti-Cowardice** | `-1.0` if heal at >70% HP | Prevent heal spam exploit | {-1.0, 0} |
| **Efficiency** | `+0.5` if damage >= 200 | Encourage big hits | {0, 0.5} |
| **Terminal** | `+5.0` win / `-5.0` loss | Strong episode-end signal | {-5.0, 0, 5.0} |

### Why Each Exists
- **Survival**: Without it, agent ignores defense
- **Combat**: Without it, agent never attacks
- **Adaptation**: Core learning signal — the entire point of Mahoraga
- **Anti-Cowardice**: Agent discovers healing is "safe" and spams it; this prevents that
- **Efficiency**: Encourages building stacks before striking instead of weak Judgments
- **Terminal**: Large signal at episode boundary for credit assignment

### Reward Breakdown
Every `step()` returns `info["reward_breakdown"]` with all 6 components as a dict. This is critical for debugging and analysis.

---

## 6. Training Pipeline

### Model: Qwen 2.5 3B Instruct (via Unsloth)
- 4-bit quantized loading
- LoRA: r=16, targets q/k/v/o_proj, no bias
- max_seq_length: 1024

### Prompt Design
```
You are Mahoraga, an adaptive combat agent...
Current State: HP, resistances, last attack, turn
Available Actions: 0-4 with descriptions + strategy hints
→ Return ONLY a single integer (0-4)
```

### Rollout Loop
1. Reset env
2. For each turn: build prompt → generate → parse action → env.step()
3. Collect trajectory: `{prompt, response, action, reward, state, info}`
4. Track: total reward, correct adaptation rate, win/loss

### Reward-Weighted SFT (GRPO-style)
Instead of PPO (complex, unstable on T4s), uses reward-weighted supervised fine-tuning:
- Collect episodes with current model
- Weight actions by reward: **>1.0 → 3 copies**, **>0 → 2**, **>-1.5 → 1**, **else → skip**
- Fine-tune via SFTTrainer on weighted dataset
- Repeat for N iterations

### Training Loop
```
for iteration in range(5):
    episodes = collect_episodes(10)
    dataset = reward_weight(episodes)
    sft_train(model, dataset)
    save_checkpoint()
    log_metrics()
```

### Checkpoints & Metrics
- LoRA weights saved per iteration: `/kaggle/working/checkpoints/iteration_N/`
- Metrics JSON: avg_reward, win_rate, avg_steps, adapt_rate
- Plot: 3-panel chart (reward, win rate, adaptation rate vs iteration)

---

## 7. UI System (Gradio)

### Structure
- 5 action buttons (Adapt×3, Judgment, Heal) + Reset
- Two columns: Agent stats (HP, resistances, stack, cooldown) | Enemy stats (HP, turn, reward)
- Monospace combat log

### State Mapping
UI reads directly from `MahoragaEnv` instance — no intermediary layer.

### Log Format
```
Turn X:
  Enemy:
    → [Subtype] ([Category])
  Mahoraga:
    → [Action]
  Result:
    → Damage: Y | Correct Adaptation: YES/NO | Stack: Z
    → Reward: R.RR
```

---

## 8. Data Flow

```
┌─────────┐    ┌──────────┐    ┌───────┐    ┌────────┐    ┌─────┐
│  State   │───▶│  Prompt  │───▶│ Model │───▶│ Action │───▶│ Env │
│  Dict    │    │ Builder  │    │ (LLM) │    │ Parser │    │     │
└─────────┘    └──────────┘    └───────┘    └────────┘    └──┬──┘

     ┌───────────────────────────────────────────────────────┘


┌──────────┐    ┌──────────┐    ┌──────────────┐
│ Rewards  │───▶│ Dataset  │───▶│ SFT Trainer  │
│ (6 comp) │    │ (weight) │    │ (LoRA update)│
└──────────┘    └──────────┘    └──────────────┘
```

1. **State** → 7-key dict (HP, resistances, last attack, turn, etc.)
2. **Prompt** → Natural language with state + action descriptions
3. **Model** → Generates single integer 0-4
4. **Parser** → Extracts int, fallback to 0
5. **Env** → Applies action, computes damage, checks termination
6. **Rewards** → 6 independent components, summed to scalar
7. **Dataset** → High-reward actions duplicated, low-reward filtered
8. **Training** → SFT on weighted dataset updates LoRA weights

---

## 9. Key Design Decisions

| Decision | Rationale |
|----------|-----------|
| **Unified schema** (`category/damage/ignore_armor`) | Two teams used different field names; unified to prevent silent bugs |
| **CurriculumEnemy** | Progressive difficulty prevents early collapse; Phase 3 forces balanced play |
| **Adaptation-match Judgment** | Old threshold-based burst was exploitable; matching requires tactical awareness |
| **Composable rewards (NOT monolithic)** | Debugging, tuning, and analysis require visibility into individual signals |
| **Reward-weighted SFT over PPO** | PPO on T4 GPUs with LLMs is unstable; GRPO-style SFT is simpler and proven |
| **Asymmetric HP (1200 vs 1000)** | Slight agent advantage encourages exploration; symmetric HP led to agent always losing |
| **Heal does NOT reset resistances** | Prevents heal+reset exploit that nullifies adaptation investment |

---

## 10. Known Risks / Edge Cases

| Risk | Description | Mitigation |
|------|-------------|------------|
| **Reward imbalance** | Adaptation (+1.5) may dominate over combat signals | Monitor adapt_rate; if >80%, reduce adaptation reward |
| **Over-adaptation** | Agent may only adapt, never attack | Terminal reward (-5.0 loss) penalizes passive play |
| **Phase 3 exploit** | Agent could learn to keep all resistances equal to confuse Phase 3 | Phase 3 picks min, so equal res still gets attacked |
| **Training instability** | SFT on small datasets can overfit | Use gradient accumulation, low LR (2e-5), 1 epoch per iter |
| **Heal spam** | Agent learns heal is safe | Anti-cowardice penalty (-1.0) + cooldown (3 turns) |
| **Wasted turns** | Heal on cooldown wastes a turn | Action nullified, no positive rewards possible |
| **PIERCE bypass** | 20% resistance bypass can surprise agent | Only 1/3 chance of PIERCE subtype, negligible long-term |
| **Zero reward on notebook** | Cloning wrong branch (main vs phase1-env-setup) | Notebook has `--branch phase1-env-setup` + assertion check |

---

## 11. How to Run

### Local Environment
```bash
cd project_mahoraga
python main.py                  # Run random episode
python tests/test_env.py        # Run 110 core tests
python tests/test_gym_wrapper.py  # Run 33 gym tests
```

### Gradio UI
```bash
cd project_mahoraga
python app.py                   # Opens browser at localhost:7860
```

### Kaggle Training
1. Upload `notebooks/mahoraga_training.ipynb` to Kaggle
2. Enable **GPU** (2× T4)
3. Run all 14 cells in order
4. Model saves to `/kaggle/working/mahoraga_lora_final`

### Debug Mode
```python
env = MahoragaEnv(debug=True)
# Prints reward breakdown every step
```

---

## 12. Future Improvements

| Area | Improvement | Effort |
|------|-------------|--------|
| **Training** | Replace reward-weighted SFT with true GRPO/PPO | High |
| **Enemy** | Add Phase 4: combo attacks (multi-type per turn) | Medium |
| **Enemy** | Better randomness model (Markov chain instead of uniform) | Medium |
| **Rewards** | Dynamic reward scaling based on training progress | Medium |
| **Multi-agent** | Two Mahoraga agents competing | High |
| **Observation** | Add enemy history buffer (last N attacks) to state | Low |
| **UI** | Add resistance bar charts, HP progress graphs | Low |
| **Eval** | Automated benchmark suite (win rate vs each phase) | Medium |
| **Deploy** | HuggingFace Spaces deployment for Gradio UI | Low |

---

## 13. Git History

```
ec92cdd MERGE: Unified schema, CurriculumEnemy, Gradio UI
c8f2f7c CRITICAL FIX: Clone correct branch + debug mode
cfb710a Phase 5: Kaggle training notebook
e9f91da Phase 4: Gymnasium wrapper
fd4d842 Phase 3: Composable reward system
b27a5b7 Phase 2: Enemy subtypes
5ed57fe Patch: Judgment/heal/HP fixes
832e7c6 Phase 1: Core environment
22712d1 Initial commit
```

---

## 14. Constants Reference

```python
MAX_HP = 1200           # Agent HP
ENEMY_HP = 1000         # Enemy HP
MAX_TURNS = 25
ADAPT_INCREASE = 40     # Resistance gain on adapt
ADAPT_DECREASE = 20     # Resistance loss on others
RESISTANCE_MAX = 80
JUDGMENT_BASE_DAMAGE = 100
JUDGMENT_BURST_DAMAGE = 350
HEAL_AMOUNT = 300
HEAL_COOLDOWN = 3
ARMOR_BYPASS_RATIO = 0.2  # PIERCE effect
PHASE_1_END = 5
PHASE_2_END = 15
PHASE_2_DEVIATION = 0.15
```

---

*This report is a complete knowledge transfer document. A new engineer or AI model should be able to understand, modify, and extend the system using only this document and the source code.*