File size: 1,499 Bytes
c343cc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"""
Self-Healing Training System (SHTS)
===================================

A fully autonomous self-healing layer for Hugging Face TRL trainers.

Architecture:
  1. DETECTION  — SelfHealingCallback monitors loss, gradients, OOM, memory
  2. DIAGNOSIS  — Root-cause classifier: NaN/divergence/OOM/data/API errors
  3. RECOVERY   — HealingActions applies fixes: rollback, reduce LR, halve batch
  4. ORCHESTRATION — SelfHealingTrainer retry loop with state persistence

Based on:
  - Unicron (arxiv:2401.00134): Cost-aware self-healing at cluster scale
  - ZClip (arxiv:2504.02507): Z-score adaptive gradient clipping
  - PTT (post-training-toolkit): DiagnosticsCallback + postmortem pattern
  - Pioneer Agent (arxiv:2604.09791): Structured decision tree for iteration
  - Deep Researcher (arxiv:2604.05854): Dry-run validation pattern

Usage:
    from self_healing import SelfHealingTrainer, HealingConfig
    from trl import SFTTrainer, SFTConfig

    trainer = SFTTrainer(model=model, args=training_args, train_dataset=ds, tokenizer=tok)
    sh = SelfHealingTrainer(trainer, HealingConfig())
    sh.train()

Author: Autonomous ML Intern
"""

from .core import (
    HealingConfig,
    SelfHealingCallback,
    HealingActions,
    SelfHealingTrainer,
    ZClip,
    FailureType,
    FAILURE_RECIPES,
)

__version__ = "1.0.0"
__all__ = [
    "HealingConfig",
    "SelfHealingCallback", 
    "HealingActions",
    "SelfHealingTrainer",
    "ZClip",
    "FailureType",
    "FAILURE_RECIPES",
]