ScottzillaSystems commited on
Commit
8760c42
Β·
verified Β·
1 Parent(s): c6f9619

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +208 -0
README.md ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Self-Healing Training System (SHTS)
2
+
3
+ > **Fully autonomous debugging and error recovery for Hugging Face TRL trainers. Add one callback, wrap with `SelfHealingTrainer`, and cut debugging costs to near zero.**
4
+
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+ [![HF Hub](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Space-blue)](https://huggingface.co/ScottzillaSystems/self-healing-training)
7
+
8
+ ---
9
+
10
+ ## The Problem
11
+
12
+ ML training fails constantly:
13
+ - **CUDA OOM** kills jobs at step 847/1000 β€” restart from scratch
14
+ - **NaN loss** silently corrupts models β€” discovered hours later
15
+ - **Loss spikes** cascade into divergence β€” manual intervention required
16
+ - **DPO plateau** at 0.693 loss (= random chance) β€” wasted GPU hours
17
+ - **No postmortem** β€” "what step did it die on?"
18
+
19
+ Each failure costs **developer time + GPU credits + schedule delay**. At scale, this is millions in wasted compute.
20
+
21
+ ## The Solution
22
+
23
+ SHTS wraps any Hugging Face TRL trainer with four autonomous layers:
24
+
25
+ ```
26
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
27
+ β”‚ LAYER 4: ORCHESTRATION β”‚
28
+ β”‚ SelfHealingTrainer retry loop β”‚
29
+ β”‚ while not converged: try β†’ recover β”‚
30
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
31
+ β”‚ LAYER 3: RECOVERY β”‚
32
+ β”‚ HealingActions: rollback, halve LR, β”‚
33
+ β”‚ halve batch, reclip, clear cache β”‚
34
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
35
+ β”‚ LAYER 2: DIAGNOSIS β”‚
36
+ β”‚ Root-cause classifier: NaN/divergence/ β”‚
37
+ β”‚ OOM/data/API β€” with literature refs β”‚
38
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
39
+ β”‚ LAYER 1: DETECTION β”‚
40
+ β”‚ SelfHealingCallback: loss, gradients, β”‚
41
+ β”‚ memory, ZClip adaptive clipping β”‚
42
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
43
+ ```
44
+
45
+ ## Quick Start
46
+
47
+ ```bash
48
+ pip install git+https://huggingface.co/ScottzillaSystems/self-healing-training
49
+ ```
50
+
51
+ ```python
52
+ from self_healing import SelfHealingTrainer, HealingConfig
53
+ from trl import SFTTrainer, SFTConfig
54
+
55
+ # Your normal training setup
56
+ trainer = SFTTrainer(
57
+ model=model,
58
+ args=SFTConfig(
59
+ output_dir="./output",
60
+ learning_rate=2e-5,
61
+ per_device_train_batch_size=4,
62
+ ),
63
+ train_dataset=dataset,
64
+ tokenizer=tokenizer,
65
+ )
66
+
67
+ # Wrap with self-healing β€” that's it!
68
+ sh = SelfHealingTrainer(
69
+ trainer,
70
+ HealingConfig(
71
+ max_recovery_attempts=5,
72
+ zclip_enabled=True,
73
+ ),
74
+ )
75
+
76
+ # Optional: dry-run to catch config errors before full training
77
+ sh.dry_run(num_steps=2)
78
+
79
+ # Train with full autonomy
80
+ result = sh.train()
81
+ ```
82
+
83
+ ## What Handles What
84
+
85
+ | Failure | Detection | Recovery | Paper |
86
+ |---------|-----------|----------|-------|
87
+ | **NaN loss** | `math.isnan(loss)` after each step | Rollback β†’ halve LR β†’ enable grad clip | ZClip arxiv:2504.02507 |
88
+ | **CUDA OOM** | `on_exception` catches `OutOfMemoryError` | Halve batch (preserve effective via GA) β†’ gradient checkpointing β†’ clear cache | Unicron arxiv:2401.00134 |
89
+ | **Loss spike** | Loss > 5Γ— running mean over window | ZClip adaptive gradient clipping β†’ emergency checkpoint | ZClip arxiv:2504.02507 |
90
+ | **Divergence** | Loss increasing for N consecutive steps | Rollback β†’ halve LR | Pioneer Agent arxiv:2604.09791 |
91
+ | **Gradient explosion** | `grad_norm > 100` | ZClip β†’ enable max_grad_norm=1.0 | AdaGC arxiv:2502.11034 |
92
+ | **DPO plateau** | `loss β‰ˆ 0.693` (random chance) | Increase LR 2-5Γ— β†’ check data quality | Rafailov et al. (2023) |
93
+ | **Overfitting** | `eval_loss - train_loss > 2.0` | Alert with actionable recommendation | Standard practice |
94
+ | **API errors** | Exception with "api/network/timeout" | Exponential backoff (30s β†’ 60s β†’ 120s β†’ ...) | Standard pattern |
95
+ | **Data errors** | Exception with "shape/dimension/index" | Skip batch β†’ log bad sample | Deep Researcher arxiv:2604.05854 |
96
+ | **Crash postmortem** | Always | `postmortem.json` with exit reason, last step, metrics, recovery history | PTT pattern |
97
+
98
+ ## Crash Postmortem
99
+
100
+ Every training interruption produces a `postmortem.json`:
101
+
102
+ ```json
103
+ {
104
+ "exit_reason": "exception",
105
+ "exception_type": "OutOfMemoryError",
106
+ "last_step": 847,
107
+ "timestamp": "2026-04-30T15:26:04Z",
108
+ "final_metrics": {"loss": 2.15, "grad_norm": 42.3},
109
+ "recovery_actions": [
110
+ {
111
+ "failure": "oom",
112
+ "diagnosis": "CUDA Out of Memory. Batch size exceeds GPU capacity.",
113
+ "actions": ["halve_batch_size", "enable_gradient_checkpointing", "clear_cache"]
114
+ }
115
+ ],
116
+ "running_time_seconds": 1847.3
117
+ }
118
+ ```
119
+
120
+ ## Trackio Integration
121
+
122
+ Set `report_to="trackio"` in your training args. SHTS emits:
123
+
124
+ - **Alerts** at every decision point (INFO/WARN/ERROR)
125
+ - **Metrics**: `healing/recovery_attempts`, `healing/nan_count`, `healing/loss_spike_ratio`, `healing/eval_gap`
126
+ - **ZClip metrics**: `zclip/raw_grad_norm`, `zclip/clipped_grad_norm`, `zclip/z_score`, `zclip/total_clips`
127
+
128
+ Dashboard URL: `https://huggingface.co/spaces/<username>/<trackio-space>`
129
+
130
+ ## HealingConfig Presets
131
+
132
+ ```python
133
+ # Aggressive β€” for unstable training, low tolerance
134
+ config = HealingConfig.aggressive()
135
+ # nan_patience=1, zclip_z_threshold=2.0, max_recovery_attempts=10
136
+
137
+ # Conservative β€” only intervene on clear failures
138
+ config = HealingConfig.conservative()
139
+ # nan_patience=10, loss_spike_factor=10.0, zclip_z_threshold=4.0, max_recovery_attempts=2
140
+
141
+ # Custom
142
+ config = HealingConfig(
143
+ nan_patience=5,
144
+ loss_spike_factor=8.0,
145
+ divergence_patience=100,
146
+ max_recovery_attempts=3,
147
+ zclip_enabled=True,
148
+ zclip_z_threshold=3.0,
149
+ )
150
+ ```
151
+
152
+ ## Compatibility
153
+
154
+ | Trainer | Status | Notes |
155
+ |---------|--------|-------|
156
+ | `SFTTrainer` (TRL) | βœ… Full | All metrics captured |
157
+ | `DPOTrainer` (TRL) | βœ… Full | DPO plateau detection (lossβ‰ˆ0.693) |
158
+ | `GRPOTrainer` (TRL) | βœ… Full | Group reward monitoring |
159
+ | `PPOTrainer` (TRL) | βœ… Full | KL divergence tracking |
160
+ | `ORPOTrainer` (TRL) | βœ… Full | Odds ratio monitoring |
161
+ | `KTOTrainer` (TRL) | βœ… Full | Desirable/undesirable logps |
162
+ | `CPOTrainer` (TRL) | βœ… Full | Contrastive preference |
163
+ | `Trainer` (Transformers) | βœ… Full | Standard ML training |
164
+
165
+ ## Architecture
166
+
167
+ ```
168
+ SelfHealingTrainer.train()
169
+ β”‚
170
+ β”œβ”€β”€ dry_run() ← Validate setup first
171
+ β”‚
172
+ └── while not converged:
173
+ β”‚
174
+ β”œβ”€β”€ trainer.train() ← Run training
175
+ β”‚ β”‚
176
+ β”‚ β”œβ”€β”€ on_step_end ← Detect NaN, spikes, divergence
177
+ β”‚ β”œβ”€β”€ on_log ← Monitor gradients (ZClip)
178
+ β”‚ β”œβ”€β”€ on_evaluate ← Check overfitting
179
+ β”‚ └── on_exception ← Catch OOM, API, data errors
180
+ β”‚
181
+ β”œβ”€β”€ [recovery needed?]
182
+ β”‚ β”œβ”€β”€ diagnose ← Classify failure type
183
+ β”‚ β”œβ”€β”€ heal ← Apply recovery actions
184
+ β”‚ └── retry ← resume_from_checkpoint=True
185
+ β”‚
186
+ └── [converged] ← Done!
187
+ ```
188
+
189
+ ## References
190
+
191
+ | Paper | ID | Contribution |
192
+ |-------|-----|-------------|
193
+ | Unicron | arxiv:2401.00134 | Cost-aware self-healing at cluster scale, error taxonomy (4 types), elastic scaling |
194
+ | ZClip | arxiv:2504.02507 | Z-score adaptive gradient clipping, eliminates catastrophic loss spikes |
195
+ | AdaGC | arxiv:2502.11034 | Per-tensor adaptive gradient clipping, optimizer-agnostic |
196
+ | Pioneer Agent | arxiv:2604.09791 | Structured decision tree by score buckets for autonomous iteration |
197
+ | Deep Researcher | arxiv:2604.05854 | Dry-run validation, zero-cost monitoring, constant-size memory |
198
+ | CheckFree | arxiv:2506.15461 | Pipeline-parallel recovery via neighbor averaging |
199
+ | DPO | Rafailov et al. (2023) | DPO plateau at 0.693 = random chance (Section 4.2) |
200
+ | PTT | [post-training-toolkit](https://github.com/microsoft/post-training-toolkit) | DiagnosticsCallback + postmortem pattern |
201
+
202
+ ## License
203
+
204
+ MIT β€” use freely, attribution appreciated.
205
+
206
+ ---
207
+
208
+ Built autonomously by ML Intern. Questions? Open an issue on the Hub.