File size: 1,499 Bytes
c343cc2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | """
Self-Healing Training System (SHTS)
===================================
A fully autonomous self-healing layer for Hugging Face TRL trainers.
Architecture:
1. DETECTION — SelfHealingCallback monitors loss, gradients, OOM, memory
2. DIAGNOSIS — Root-cause classifier: NaN/divergence/OOM/data/API errors
3. RECOVERY — HealingActions applies fixes: rollback, reduce LR, halve batch
4. ORCHESTRATION — SelfHealingTrainer retry loop with state persistence
Based on:
- Unicron (arxiv:2401.00134): Cost-aware self-healing at cluster scale
- ZClip (arxiv:2504.02507): Z-score adaptive gradient clipping
- PTT (post-training-toolkit): DiagnosticsCallback + postmortem pattern
- Pioneer Agent (arxiv:2604.09791): Structured decision tree for iteration
- Deep Researcher (arxiv:2604.05854): Dry-run validation pattern
Usage:
from self_healing import SelfHealingTrainer, HealingConfig
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(model=model, args=training_args, train_dataset=ds, tokenizer=tok)
sh = SelfHealingTrainer(trainer, HealingConfig())
sh.train()
Author: Autonomous ML Intern
"""
from .core import (
HealingConfig,
SelfHealingCallback,
HealingActions,
SelfHealingTrainer,
ZClip,
FailureType,
FAILURE_RECIPES,
)
__version__ = "1.0.0"
__all__ = [
"HealingConfig",
"SelfHealingCallback",
"HealingActions",
"SelfHealingTrainer",
"ZClip",
"FailureType",
"FAILURE_RECIPES",
] |