MridulNegi2005 commited on
Commit
4f85e94
·
1 Parent(s): a20c035

fix: correct GitHub repo link to Atishay9828/meta_Mahoraga

Browse files
Files changed (2) hide show
  1. BLOG.md +1 -1
  2. project_mahoraga_full_report.md +0 -429
BLOG.md CHANGED
@@ -15,7 +15,7 @@
15
  [![Reward](https://img.shields.io/badge/Avg_Reward-18.55-blue?style=for-the-badge)]()
16
  [![Survived](https://img.shields.io/badge/Players_Who_Survived-Good_Luck-black?style=for-the-badge)]()
17
 
18
- 📓 [**Training Notebook**](https://www.kaggle.com/code/atishay9828/meta-mahoraga/edit) · 🤗 [**Live Demo**](https://huggingface.co/spaces/MridulNegi2005/Project-Mahoraga) · 🏠 [**GitHub**](https://github.com/MridulNegi2005/Project_Mahoraga)
19
 
20
  </div>
21
 
 
15
  [![Reward](https://img.shields.io/badge/Avg_Reward-18.55-blue?style=for-the-badge)]()
16
  [![Survived](https://img.shields.io/badge/Players_Who_Survived-Good_Luck-black?style=for-the-badge)]()
17
 
18
+ 📓 [**Training Notebook**](https://www.kaggle.com/code/atishay9828/meta-mahoraga/edit) · 🤗 [**Live Demo**](https://huggingface.co/spaces/MridulNegi2005/Project-Mahoraga) · 🏠 [**GitHub**](https://github.com/Atishay9828/meta_Mahoraga)
19
 
20
  </div>
21
 
project_mahoraga_full_report.md DELETED
@@ -1,429 +0,0 @@
1
- # Project Mahoraga — Complete System Report
2
-
3
- > **Version**: 1.0 (Post-Merge)
4
- > **Branch**: `main` (fully merged from `phase1-env-setup`)
5
- > **Tests**: 143/143 passing
6
- > **Date**: 2026-04-25
7
-
8
- ---
9
-
10
- ## 1. Project Overview
11
-
12
- **Project Mahoraga** is a reinforcement learning environment where an AI agent ("Mahoraga") learns adaptive combat through a resistance trade-off system. Named after Jujutsu Kaisen's Mahoraga — a shikigami that adapts to any attack — the system trains an LLM (Qwen 2.5 3B) to make tactical decisions in a turn-based combat loop.
13
-
14
- **Core Loop**: `Observe → Adapt → Accumulate → Punish`
15
-
16
- The agent must:
17
- 1. Observe enemy attack patterns
18
- 2. Build resistance to the correct attack category
19
- 3. Accumulate adaptation stacks
20
- 4. Execute Judgment Strike for burst damage at the right moment
21
-
22
- **This is NOT a game.** It is a clean, testable RL environment designed for LLM fine-tuning via reward-weighted SFT.
23
-
24
- ---
25
-
26
- ## 2. Architecture Breakdown
27
-
28
- ```
29
- project_mahoraga/
30
- ├── env/
31
- │ ├── mahoraga_env.py # Main environment orchestrator
32
- │ ├── mechanics.py # Resistance, damage, action math
33
- │ ├── enemy.py # CurriculumEnemy (3-phase AI)
34
- │ ├── rewards.py # 6-component composable reward system
35
- │ ├── state.py # State dict builder
36
- │ └── gym_wrapper.py # Gymnasium-compatible wrapper
37
- ├── utils/
38
- │ ├── constants.py # All game constants and mappings
39
- │ └── validators.py # Action validation
40
- ├── tests/
41
- │ ├── test_env.py # 110 core tests
42
- │ └── test_gym_wrapper.py # 33 wrapper tests
43
- ├── notebooks/
44
- │ ├── mahoraga_training.py # Training notebook (source)
45
- │ └── mahoraga_training.ipynb # Training notebook (Kaggle)
46
- ├── scripts/
47
- │ └── random_agent_gym.py # Random agent demo
48
- ├── app.py # Gradio interactive UI
49
- ├── main.py # CLI episode runner
50
- └── README.md
51
- ```
52
-
53
- ### Module Details
54
-
55
- #### `env/mahoraga_env.py` — Environment Orchestrator
56
- - `MahoragaEnv(debug=False)` — main class
57
- - `reset()` → returns state dict
58
- - `step(action)` → returns `(state, reward, done, info)`
59
- - Coordinates enemy attacks, agent actions, reward computation
60
- - Tracks: HP, resistances, adaptation stack, heal cooldown, last adapted category
61
-
62
- #### `env/mechanics.py` — Core Math
63
- - `new_resistances()` — creates `{PHYSICAL: 0, CE: 0, TECHNIQUE: 0}`
64
- - `apply_resistance_change(res, type)` — +40 target, -20 others, clamp [0,80]
65
- - `compute_enemy_damage(category, res, ignore_armor)` — damage formula
66
- - `compute_judgment_damage(last_adapted, enemy_cat)` — adaptation-match burst
67
- - `apply_action_effects(...)` — dispatches action 0-4
68
- - `check_correct_adaptation(action, category)` — validates adaptation
69
-
70
- #### `env/enemy.py` — CurriculumEnemy
71
- - Single `CurriculumEnemy` class with 3-phase behavior
72
- - `get_attack(turn_number, resistances)` → `{category, subtype, damage, ignore_armor}`
73
- - Phase selection based on turn number
74
-
75
- #### `env/rewards.py` — Composable Rewards
76
- - 6 independent functions + 1 aggregator
77
- - Returns dict, NOT a single scalar
78
- - `compute_rewards(info, state, action, done)` → dict
79
-
80
- #### `env/state.py` — State Builder
81
- - Converts internal uppercase keys to lowercase for RL observation
82
- - `build_state_dict(...)` → dict with 7 keys
83
-
84
- #### `env/gym_wrapper.py` — Gymnasium Interface
85
- - `MahoragaGymEnv(gym.Env)` wraps `MahoragaEnv`
86
- - `Discrete(5)` action space, `Dict` observation space
87
- - Encodes categoricals to integers for neural networks
88
-
89
- #### `app.py` — Gradio UI
90
- - Interactive combat arena with 5 action buttons
91
- - Displays HP, resistances, stack, cooldown, combat log
92
- - Launch: `python app.py`
93
-
94
- ---
95
-
96
- ## 3. Core Mechanics
97
-
98
- ### Resistance System
99
- Three categories: **PHYSICAL**, **CE**, **TECHNIQUE**. Range: [0, 80].
100
-
101
- When agent adapts to a category:
102
- - Target category: **+40**
103
- - Other categories: **-20**
104
- - All clamped to [0, 80]
105
-
106
- Higher resistance = less damage from that category.
107
-
108
- ### Action Space (0–4)
109
-
110
- | Action | Name | Effect |
111
- |--------|------|--------|
112
- | 0 | Adapt PHYSICAL | +40 PHYSICAL res, -20 others |
113
- | 1 | Adapt CE | +40 CE res, -20 others |
114
- | 2 | Adapt TECHNIQUE | +40 TECHNIQUE res, -20 others |
115
- | 3 | Judgment Strike | Deal damage, consume stacks, reset res |
116
- | 4 | Regeneration | +300 HP, 3-turn cooldown |
117
-
118
- ### Adaptation Stack
119
- - +1 when agent correctly adapts to current enemy attack category
120
- - Consumed by Judgment Strike: each stack adds +50 damage
121
- - Reset to 0 after Judgment Strike
122
-
123
- ### Judgment Strike Logic
124
- **Condition**: Burst (350 dmg) if `last_adapted_category == current_enemy_category`
125
- **Otherwise**: Base (100 dmg)
126
- **Total**: `burst_or_base + (stacks × 50)`
127
- **After**: Resistances reset to 0, stacks reset to 0
128
-
129
- ### Heal Cooldown
130
- - Heals +300 HP (capped at MAX_HP=1200)
131
- - 3-turn cooldown after use
132
- - Does NOT reset resistances
133
- - If used while on cooldown → wasted turn (action nullified)
134
-
135
- ### Damage Formula
136
- ```
137
- resistance = category_resistance
138
- if ignore_armor:
139
- resistance = resistance × 0.8 # 20% bypass (PIERCE only)
140
- damage = base_damage × (1 - resistance / 100)
141
- ```
142
-
143
- ### HP Configuration
144
- | Entity | HP |
145
- |--------|----|
146
- | Agent (Mahoraga) | 1200 |
147
- | Enemy | 1000 |
148
-
149
- ---
150
-
151
- ## 4. Enemy System — CurriculumEnemy
152
-
153
- Three-phase curriculum designed for progressive learning:
154
-
155
- ### Phase 1: Tutorial (Turns 1–5)
156
- - Always attacks with **PHYSICAL**
157
- - Agent learns basic adaptation against a single category
158
- - Predictable — builds confidence
159
-
160
- ### Phase 2: Pattern (Turns 6–15)
161
- - Cycles: **PHYSICAL → CE → TECHNIQUE**
162
- - 15% random injection (picks random category instead of pattern)
163
- - Agent learns to predict cycling patterns and handle surprises
164
-
165
- ### Phase 3: Adaptive (Turns 16–25)
166
- - **Targets the agent's lowest resistance category**
167
- - Reads `resistances` dict, picks `min(resistances, key=resistances.get)`
168
- - Agent must learn balanced defense or get exploited
169
- - If no resistances provided, falls back to random
170
-
171
- ### Subtypes
172
- Each category has 3 subtypes (visual/variation only):
173
-
174
- | Category | Subtypes |
175
- |----------|----------|
176
- | PHYSICAL | SLASH, IMPACT, **PIERCE** |
177
- | CE | BLAST, WAVE, BEAM |
178
- | TECHNIQUE | SPIKE, DELAYED, PATTERN |
179
-
180
- **PIERCE** is special: `ignore_armor=True` → bypasses 20% of resistance.
181
-
182
- ### Attack Dict Schema (LOCKED)
183
- ```python
184
- {
185
- "category": "PHYSICAL" | "CE" | "TECHNIQUE",
186
- "subtype": "SLASH" | "IMPACT" | ... ,
187
- "damage": int,
188
- "ignore_armor": bool
189
- }
190
- ```
191
-
192
- ---
193
-
194
- ## 5. Reward System
195
-
196
- Six independent components computed per step. Final reward = sum of all components.
197
-
198
- | Component | Formula | Purpose | Typical Range |
199
- |-----------|---------|---------|---------------|
200
- | **Survival** | `-(damage_taken / 100)` | Penalize taking damage | [-2.2, 0] |
201
- | **Combat** | `+(damage_dealt / 100)` | Reward dealing damage | [0, 4.5] |
202
- | **Adaptation** | `+1.5` if correct, else `0` | **Strongest signal** — correct resistance match | {0, 1.5} |
203
- | **Anti-Cowardice** | `-1.0` if heal at >70% HP | Prevent heal spam exploit | {-1.0, 0} |
204
- | **Efficiency** | `+0.5` if damage >= 200 | Encourage big hits | {0, 0.5} |
205
- | **Terminal** | `+5.0` win / `-5.0` loss | Strong episode-end signal | {-5.0, 0, 5.0} |
206
-
207
- ### Why Each Exists
208
- - **Survival**: Without it, agent ignores defense
209
- - **Combat**: Without it, agent never attacks
210
- - **Adaptation**: Core learning signal — the entire point of Mahoraga
211
- - **Anti-Cowardice**: Agent discovers healing is "safe" and spams it; this prevents that
212
- - **Efficiency**: Encourages building stacks before striking instead of weak Judgments
213
- - **Terminal**: Large signal at episode boundary for credit assignment
214
-
215
- ### Reward Breakdown
216
- Every `step()` returns `info["reward_breakdown"]` with all 6 components as a dict. This is critical for debugging and analysis.
217
-
218
- ---
219
-
220
- ## 6. Training Pipeline
221
-
222
- ### Model: Qwen 2.5 3B Instruct (via Unsloth)
223
- - 4-bit quantized loading
224
- - LoRA: r=16, targets q/k/v/o_proj, no bias
225
- - max_seq_length: 1024
226
-
227
- ### Prompt Design
228
- ```
229
- You are Mahoraga, an adaptive combat agent...
230
- Current State: HP, resistances, last attack, turn
231
- Available Actions: 0-4 with descriptions + strategy hints
232
- → Return ONLY a single integer (0-4)
233
- ```
234
-
235
- ### Rollout Loop
236
- 1. Reset env
237
- 2. For each turn: build prompt → generate → parse action → env.step()
238
- 3. Collect trajectory: `{prompt, response, action, reward, state, info}`
239
- 4. Track: total reward, correct adaptation rate, win/loss
240
-
241
- ### Reward-Weighted SFT (GRPO-style)
242
- Instead of PPO (complex, unstable on T4s), uses reward-weighted supervised fine-tuning:
243
- - Collect episodes with current model
244
- - Weight actions by reward: **>1.0 → 3 copies**, **>0 → 2**, **>-1.5 → 1**, **else → skip**
245
- - Fine-tune via SFTTrainer on weighted dataset
246
- - Repeat for N iterations
247
-
248
- ### Training Loop
249
- ```
250
- for iteration in range(5):
251
- episodes = collect_episodes(10)
252
- dataset = reward_weight(episodes)
253
- sft_train(model, dataset)
254
- save_checkpoint()
255
- log_metrics()
256
- ```
257
-
258
- ### Checkpoints & Metrics
259
- - LoRA weights saved per iteration: `/kaggle/working/checkpoints/iteration_N/`
260
- - Metrics JSON: avg_reward, win_rate, avg_steps, adapt_rate
261
- - Plot: 3-panel chart (reward, win rate, adaptation rate vs iteration)
262
-
263
- ---
264
-
265
- ## 7. UI System (Gradio)
266
-
267
- ### Structure
268
- - 5 action buttons (Adapt×3, Judgment, Heal) + Reset
269
- - Two columns: Agent stats (HP, resistances, stack, cooldown) | Enemy stats (HP, turn, reward)
270
- - Monospace combat log
271
-
272
- ### State Mapping
273
- UI reads directly from `MahoragaEnv` instance — no intermediary layer.
274
-
275
- ### Log Format
276
- ```
277
- Turn X:
278
- Enemy:
279
- → [Subtype] ([Category])
280
- Mahoraga:
281
- → [Action]
282
- Result:
283
- → Damage: Y | Correct Adaptation: YES/NO | Stack: Z
284
- → Reward: R.RR
285
- ```
286
-
287
- ---
288
-
289
- ## 8. Data Flow
290
-
291
- ```
292
- ┌─────────┐ ┌──────────┐ ┌───────�� ┌────────┐ ┌─────┐
293
- │ State │───▶│ Prompt │───▶│ Model │───▶│ Action │───▶│ Env │
294
- │ Dict │ │ Builder │ │ (LLM) │ │ Parser │ │ │
295
- └─────────┘ └──────────┘ └───────┘ └────────┘ └──┬──┘
296
-
297
- ┌───────────────────────────────────────────────────────┘
298
-
299
-
300
- ┌──────────┐ ┌──────────┐ ┌──────────────┐
301
- │ Rewards │───▶│ Dataset │───▶│ SFT Trainer │
302
- │ (6 comp) │ │ (weight) │ │ (LoRA update)│
303
- └──────────┘ └──────────┘ └──────────────┘
304
- ```
305
-
306
- 1. **State** → 7-key dict (HP, resistances, last attack, turn, etc.)
307
- 2. **Prompt** → Natural language with state + action descriptions
308
- 3. **Model** → Generates single integer 0-4
309
- 4. **Parser** → Extracts int, fallback to 0
310
- 5. **Env** → Applies action, computes damage, checks termination
311
- 6. **Rewards** → 6 independent components, summed to scalar
312
- 7. **Dataset** → High-reward actions duplicated, low-reward filtered
313
- 8. **Training** → SFT on weighted dataset updates LoRA weights
314
-
315
- ---
316
-
317
- ## 9. Key Design Decisions
318
-
319
- | Decision | Rationale |
320
- |----------|-----------|
321
- | **Unified schema** (`category/damage/ignore_armor`) | Two teams used different field names; unified to prevent silent bugs |
322
- | **CurriculumEnemy** | Progressive difficulty prevents early collapse; Phase 3 forces balanced play |
323
- | **Adaptation-match Judgment** | Old threshold-based burst was exploitable; matching requires tactical awareness |
324
- | **Composable rewards (NOT monolithic)** | Debugging, tuning, and analysis require visibility into individual signals |
325
- | **Reward-weighted SFT over PPO** | PPO on T4 GPUs with LLMs is unstable; GRPO-style SFT is simpler and proven |
326
- | **Asymmetric HP (1200 vs 1000)** | Slight agent advantage encourages exploration; symmetric HP led to agent always losing |
327
- | **Heal does NOT reset resistances** | Prevents heal+reset exploit that nullifies adaptation investment |
328
-
329
- ---
330
-
331
- ## 10. Known Risks / Edge Cases
332
-
333
- | Risk | Description | Mitigation |
334
- |------|-------------|------------|
335
- | **Reward imbalance** | Adaptation (+1.5) may dominate over combat signals | Monitor adapt_rate; if >80%, reduce adaptation reward |
336
- | **Over-adaptation** | Agent may only adapt, never attack | Terminal reward (-5.0 loss) penalizes passive play |
337
- | **Phase 3 exploit** | Agent could learn to keep all resistances equal to confuse Phase 3 | Phase 3 picks min, so equal res still gets attacked |
338
- | **Training instability** | SFT on small datasets can overfit | Use gradient accumulation, low LR (2e-5), 1 epoch per iter |
339
- | **Heal spam** | Agent learns heal is safe | Anti-cowardice penalty (-1.0) + cooldown (3 turns) |
340
- | **Wasted turns** | Heal on cooldown wastes a turn | Action nullified, no positive rewards possible |
341
- | **PIERCE bypass** | 20% resistance bypass can surprise agent | Only 1/3 chance of PIERCE subtype, negligible long-term |
342
- | **Zero reward on notebook** | Cloning wrong branch (main vs phase1-env-setup) | Notebook has `--branch phase1-env-setup` + assertion check |
343
-
344
- ---
345
-
346
- ## 11. How to Run
347
-
348
- ### Local Environment
349
- ```bash
350
- cd project_mahoraga
351
- python main.py # Run random episode
352
- python tests/test_env.py # Run 110 core tests
353
- python tests/test_gym_wrapper.py # Run 33 gym tests
354
- ```
355
-
356
- ### Gradio UI
357
- ```bash
358
- cd project_mahoraga
359
- python app.py # Opens browser at localhost:7860
360
- ```
361
-
362
- ### Kaggle Training
363
- 1. Upload `notebooks/mahoraga_training.ipynb` to Kaggle
364
- 2. Enable **GPU** (2× T4)
365
- 3. Run all 14 cells in order
366
- 4. Model saves to `/kaggle/working/mahoraga_lora_final`
367
-
368
- ### Debug Mode
369
- ```python
370
- env = MahoragaEnv(debug=True)
371
- # Prints reward breakdown every step
372
- ```
373
-
374
- ---
375
-
376
- ## 12. Future Improvements
377
-
378
- | Area | Improvement | Effort |
379
- |------|-------------|--------|
380
- | **Training** | Replace reward-weighted SFT with true GRPO/PPO | High |
381
- | **Enemy** | Add Phase 4: combo attacks (multi-type per turn) | Medium |
382
- | **Enemy** | Better randomness model (Markov chain instead of uniform) | Medium |
383
- | **Rewards** | Dynamic reward scaling based on training progress | Medium |
384
- | **Multi-agent** | Two Mahoraga agents competing | High |
385
- | **Observation** | Add enemy history buffer (last N attacks) to state | Low |
386
- | **UI** | Add resistance bar charts, HP progress graphs | Low |
387
- | **Eval** | Automated benchmark suite (win rate vs each phase) | Medium |
388
- | **Deploy** | HuggingFace Spaces deployment for Gradio UI | Low |
389
-
390
- ---
391
-
392
- ## 13. Git History
393
-
394
- ```
395
- ec92cdd MERGE: Unified schema, CurriculumEnemy, Gradio UI
396
- c8f2f7c CRITICAL FIX: Clone correct branch + debug mode
397
- cfb710a Phase 5: Kaggle training notebook
398
- e9f91da Phase 4: Gymnasium wrapper
399
- fd4d842 Phase 3: Composable reward system
400
- b27a5b7 Phase 2: Enemy subtypes
401
- 5ed57fe Patch: Judgment/heal/HP fixes
402
- 832e7c6 Phase 1: Core environment
403
- 22712d1 Initial commit
404
- ```
405
-
406
- ---
407
-
408
- ## 14. Constants Reference
409
-
410
- ```python
411
- MAX_HP = 1200 # Agent HP
412
- ENEMY_HP = 1000 # Enemy HP
413
- MAX_TURNS = 25
414
- ADAPT_INCREASE = 40 # Resistance gain on adapt
415
- ADAPT_DECREASE = 20 # Resistance loss on others
416
- RESISTANCE_MAX = 80
417
- JUDGMENT_BASE_DAMAGE = 100
418
- JUDGMENT_BURST_DAMAGE = 350
419
- HEAL_AMOUNT = 300
420
- HEAL_COOLDOWN = 3
421
- ARMOR_BYPASS_RATIO = 0.2 # PIERCE effect
422
- PHASE_1_END = 5
423
- PHASE_2_END = 15
424
- PHASE_2_DEVIATION = 0.15
425
- ```
426
-
427
- ---
428
-
429
- *This report is a complete knowledge transfer document. A new engineer or AI model should be able to understand, modify, and extend the system using only this document and the source code.*