File size: 8,404 Bytes

8760c42

# Self-Healing Training System (SHTS)

> **Fully autonomous debugging and error recovery for Hugging Face TRL trainers. Add one callback, wrap with `SelfHealingTrainer`, and cut debugging costs to near zero.**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![HF Hub](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Space-blue)](https://huggingface.co/ScottzillaSystems/self-healing-training)

---

## The Problem

ML training fails constantly:
- **CUDA OOM** kills jobs at step 847/1000 — restart from scratch
- **NaN loss** silently corrupts models — discovered hours later
- **Loss spikes** cascade into divergence — manual intervention required
- **DPO plateau** at 0.693 loss (= random chance) — wasted GPU hours
- **No postmortem** — "what step did it die on?"

Each failure costs **developer time + GPU credits + schedule delay**. At scale, this is millions in wasted compute.

## The Solution

SHTS wraps any Hugging Face TRL trainer with four autonomous layers:

```
┌─────────────────────────────────────────┐
│  LAYER 4: ORCHESTRATION                 │
│  SelfHealingTrainer retry loop          │
│  while not converged: try → recover     │
├─────────────────────────────────────────┤
│  LAYER 3: RECOVERY                      │
│  HealingActions: rollback, halve LR,    │
│  halve batch, reclip, clear cache       │
├─────────────────────────────────────────┤
│  LAYER 2: DIAGNOSIS                     │
│  Root-cause classifier: NaN/divergence/ │
│  OOM/data/API — with literature refs    │
├─────────────────────────────────────────┤
│  LAYER 1: DETECTION                     │
│  SelfHealingCallback: loss, gradients,  │
│  memory, ZClip adaptive clipping        │
└─────────────────────────────────────────┘
```

## Quick Start

```bash
pip install git+https://huggingface.co/ScottzillaSystems/self-healing-training
```

```python
from self_healing import SelfHealingTrainer, HealingConfig
from trl import SFTTrainer, SFTConfig

# Your normal training setup
trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./output",
        learning_rate=2e-5,
        per_device_train_batch_size=4,
    ),
    train_dataset=dataset,
    tokenizer=tokenizer,
)

# Wrap with self-healing — that's it!
sh = SelfHealingTrainer(
    trainer,
    HealingConfig(
        max_recovery_attempts=5,
        zclip_enabled=True,
    ),
)

# Optional: dry-run to catch config errors before full training
sh.dry_run(num_steps=2)

# Train with full autonomy
result = sh.train()
```

## What Handles What

| Failure | Detection | Recovery | Paper |
|---------|-----------|----------|-------|
| **NaN loss** | `math.isnan(loss)` after each step | Rollback → halve LR → enable grad clip | ZClip arxiv:2504.02507 |
| **CUDA OOM** | `on_exception` catches `OutOfMemoryError` | Halve batch (preserve effective via GA) → gradient checkpointing → clear cache | Unicron arxiv:2401.00134 |
| **Loss spike** | Loss > 5× running mean over window | ZClip adaptive gradient clipping → emergency checkpoint | ZClip arxiv:2504.02507 |
| **Divergence** | Loss increasing for N consecutive steps | Rollback → halve LR | Pioneer Agent arxiv:2604.09791 |
| **Gradient explosion** | `grad_norm > 100` | ZClip → enable max_grad_norm=1.0 | AdaGC arxiv:2502.11034 |
| **DPO plateau** | `loss ≈ 0.693` (random chance) | Increase LR 2-5× → check data quality | Rafailov et al. (2023) |
| **Overfitting** | `eval_loss - train_loss > 2.0` | Alert with actionable recommendation | Standard practice |
| **API errors** | Exception with "api/network/timeout" | Exponential backoff (30s → 60s → 120s → ...) | Standard pattern |
| **Data errors** | Exception with "shape/dimension/index" | Skip batch → log bad sample | Deep Researcher arxiv:2604.05854 |
| **Crash postmortem** | Always | `postmortem.json` with exit reason, last step, metrics, recovery history | PTT pattern |

## Crash Postmortem

Every training interruption produces a `postmortem.json`:

```json
{
  "exit_reason": "exception",
  "exception_type": "OutOfMemoryError",
  "last_step": 847,
  "timestamp": "2026-04-30T15:26:04Z",
  "final_metrics": {"loss": 2.15, "grad_norm": 42.3},
  "recovery_actions": [
    {
      "failure": "oom",
      "diagnosis": "CUDA Out of Memory. Batch size exceeds GPU capacity.",
      "actions": ["halve_batch_size", "enable_gradient_checkpointing", "clear_cache"]
    }
  ],
  "running_time_seconds": 1847.3
}
```

## Trackio Integration

Set `report_to="trackio"` in your training args. SHTS emits:

- **Alerts** at every decision point (INFO/WARN/ERROR)
- **Metrics**: `healing/recovery_attempts`, `healing/nan_count`, `healing/loss_spike_ratio`, `healing/eval_gap`
- **ZClip metrics**: `zclip/raw_grad_norm`, `zclip/clipped_grad_norm`, `zclip/z_score`, `zclip/total_clips`

Dashboard URL: `https://huggingface.co/spaces/<username>/<trackio-space>`

## HealingConfig Presets

```python
# Aggressive — for unstable training, low tolerance
config = HealingConfig.aggressive()
# nan_patience=1, zclip_z_threshold=2.0, max_recovery_attempts=10

# Conservative — only intervene on clear failures
config = HealingConfig.conservative()
# nan_patience=10, loss_spike_factor=10.0, zclip_z_threshold=4.0, max_recovery_attempts=2

# Custom
config = HealingConfig(
    nan_patience=5,
    loss_spike_factor=8.0,
    divergence_patience=100,
    max_recovery_attempts=3,
    zclip_enabled=True,
    zclip_z_threshold=3.0,
)
```

## Compatibility

| Trainer | Status | Notes |
|---------|--------|-------|
| `SFTTrainer` (TRL) | ✅ Full | All metrics captured |
| `DPOTrainer` (TRL) | ✅ Full | DPO plateau detection (loss≈0.693) |
| `GRPOTrainer` (TRL) | ✅ Full | Group reward monitoring |
| `PPOTrainer` (TRL) | ✅ Full | KL divergence tracking |
| `ORPOTrainer` (TRL) | ✅ Full | Odds ratio monitoring |
| `KTOTrainer` (TRL) | ✅ Full | Desirable/undesirable logps |
| `CPOTrainer` (TRL) | ✅ Full | Contrastive preference |
| `Trainer` (Transformers) | ✅ Full | Standard ML training |

## Architecture

```
SelfHealingTrainer.train()
  │
  ├── dry_run()              ← Validate setup first
  │
  └── while not converged:
      │
      ├── trainer.train()    ← Run training
      │     │
      │     ├── on_step_end  ← Detect NaN, spikes, divergence
      │     ├── on_log       ← Monitor gradients (ZClip)
      │     ├── on_evaluate  ← Check overfitting
      │     └── on_exception ← Catch OOM, API, data errors
      │
      ├── [recovery needed?]
      │     ├── diagnose     ← Classify failure type
      │     ├── heal         ← Apply recovery actions
      │     └── retry        ← resume_from_checkpoint=True
      │
      └── [converged]        ← Done!
```

## References

| Paper | ID | Contribution |
|-------|-----|-------------|
| Unicron | arxiv:2401.00134 | Cost-aware self-healing at cluster scale, error taxonomy (4 types), elastic scaling |
| ZClip | arxiv:2504.02507 | Z-score adaptive gradient clipping, eliminates catastrophic loss spikes |
| AdaGC | arxiv:2502.11034 | Per-tensor adaptive gradient clipping, optimizer-agnostic |
| Pioneer Agent | arxiv:2604.09791 | Structured decision tree by score buckets for autonomous iteration |
| Deep Researcher | arxiv:2604.05854 | Dry-run validation, zero-cost monitoring, constant-size memory |
| CheckFree | arxiv:2506.15461 | Pipeline-parallel recovery via neighbor averaging |
| DPO | Rafailov et al. (2023) | DPO plateau at 0.693 = random chance (Section 4.2) |
| PTT | [post-training-toolkit](https://github.com/microsoft/post-training-toolkit) | DiagnosticsCallback + postmortem pattern |

## License

MIT — use freely, attribution appreciated.

---

Built autonomously by ML Intern. Questions? Open an issue on the Hub.