mlops-openenv / README.md
trretretret's picture
Add HF Spaces config metadata
a97c7e7
metadata
title: MLOps Pipeline Debugger
emoji: πŸ”§
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: '1.0'
python_version: '3.11'
app_file: app.py
pinned: false

MLOps Pipeline Debugger

OpenEnv Python 3.11 License: MIT

Latest Baseline Scores

Task Score
Easy 0.91
Medium 0.85
Hard 1.00
Average 0.92

Tested with Gemini 2.5 Flash + Gemini 3.1 Pro Preview fallback for hard task

An OpenEnv-compatible reinforcement learning environment where an AI agent acts as a senior ML engineer diagnosing a broken training run.


What Is This?

Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. An engineer must systematically investigate β€” reading logs, checking configs, inspecting preprocessing code, running sanity checks β€” to find the root cause.

This environment simulates that investigation. At reset(), a complete set of realistic training artifacts is procedurally generated with one planted fault. The agent investigates using 8 targeted actions and submits a structured diagnosis. The grader checks against the planted ground truth β€” fully deterministic, no LLM judge needed.

9 distinct bug types across 3 tasks. Every episode can have a different bug. Scores vary continuously 0.0 β†’ 1.0 based on diagnosis precision.


Environment Design

Procedural Artifact Generation

Every episode generates 6 realistic training artifacts from scratch:

Artifact Contents
config.yaml Model arch, optimizer, LR, batch size, scheduler, augmentation
train.log Epoch-by-epoch loss/accuracy/gradient norms with realistic timestamps
dataset_stats.json Split sizes, class distribution, overlap counts, feature statistics
preprocessing.py Full sklearn/PyTorch preprocessing pipeline code
eval_results.json Final val/test metrics with hardware info
model_card.json Architecture summary, tokenizer version, preprocessing config

Artifacts are internally consistent β€” config matches logs, dataset stats match preprocessing code β€” except for the one planted fault. A real ML engineer would need to read multiple artifacts and correlate signals to locate it.


Action Space

class MLOpsAction(BaseModel):
    action_type: Literal[
        "read_config",          # Full config.yaml
        "read_logs",            # Training logs (filterable: keyword or "epoch:N-M")
        "check_dataset_stats",  # Split sizes, class distribution, overlap counts
        "inspect_preprocessing",# Full preprocessing pipeline code
        "read_eval_results",    # Final val/test metrics
        "run_sanity_check",     # Computed diagnostic (see types below)
        "query_artifact",       # Specific field from any artifact (dot notation)
        "submit_diagnosis",     # Final answer β€” triggers grading
    ]
    
    # Sanity check types:
    # label_consistency | data_leakage | gradient_norms | class_balance
    # feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
    
    # submit_diagnosis fields:
    # failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix

Observation Space

class MLOpsObservation(BaseModel):
    task_id: str                          # easy | medium | hard
    task_description: str                 # Full task brief with investigation strategy
    run_id: str                           # Unique run identifier
    run_summary: Dict[str, Any]           # Model, dataset, training status
    available_artifacts: List[ArtifactMeta]  # What can be read
    artifacts_read: List[str]             # Investigation progress
    last_action_result: Dict[str, Any]    # Full content of last action
    step_count: int
    max_steps: int
    done: bool
    messages: List[str]                   # System warnings (duplicate reads, etc.)

Tasks

Task 1 β€” Config Error Diagnosis (easy)

Bug pool (one picked randomly per episode):

  • exploding_lr β€” learning_rate: 50.0 causes loss β†’ NaN by epoch 3
  • wrong_optimizer β€” SGD(momentum=0.99) causes oscillation with no convergence
  • batch_size_overflow β€” batch_size: 4096 exceeds dataset size, val accuracy 99.9% trivially

Signal: Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.

Optimal strategy: read_logs β†’ run_sanity_check(loss_trajectory) β†’ read_config β†’ submit_diagnosis

Max steps: 20 | Expected baseline score: ~0.42


Task 2 β€” Data Leakage Detection (medium)

Bug pool:

  • data_leakage_scaler β€” StandardScaler.fit_transform(X_full) called before train/val split
  • data_leakage_overlap β€” train_test_split(random_state=None) produces non-deterministic overlapping splits
  • wrong_split_ratio β€” test_size=0.8 trains on 20% and evaluates on 80% (inverted)

Signal: Val accuracy suspiciously high from epoch 1 in logs; val/test gap in eval results; sample overlap count in dataset stats.

Optimal strategy: read_logs β†’ read_eval_results β†’ run_sanity_check(data_leakage) β†’ inspect_preprocessing β†’ submit_diagnosis

Max steps: 30 | Expected baseline score: ~0.28


Task 3 β€” Silent Evaluation Bug (hard)

Bug pool:

  • label_encoder_mismatch β€” Train/eval use different LabelEncoder.fit() orderings β†’ silent wrong predictions
  • silent_metric_swap β€” val_accuracy and test_accuracy assignments are swapped in eval code
  • tokenizer_version_drift β€” Training uses tokenizer v2, eval uses v1 β†’ 847 tokens map to [UNK]

Signal: Training logs look completely normal. Only the val/test metric gap in eval results is suspicious β€” no errors, no warnings, no exceptions.

Asymmetric penalty: Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5Γ— β€” mirroring real incident severity weighting.

Optimal strategy: read_eval_results β†’ run_sanity_check(metric_gap_analysis) β†’ inspect_preprocessing β†’ run_sanity_check(label_consistency OR encoder_version_match) β†’ submit_diagnosis

Max steps: 40 | Expected baseline score: ~0.15


Reward Function

Dense per-step rewards (not sparse):

+0.02  First time reading an artifact (rewards systematic exploration)
-0.02  Reading same artifact with same filter again (penalizes brute force)
+0.01  Running a new sanity check (rewards diagnostic reasoning)

At submit_diagnosis:
+0.15  Correct failure_category  (config_error / data_leakage / evaluation_bug / ...)
+0.25  Correct root_cause_file   (exact match)
+0.30  Correct root_cause_field  (substring match, case-insensitive)
+0.30  Correct proposed_fix      (keyword overlap with gold fix)

Task 3 modifier: if score < 0.70, additional 0.5Γ— penalty on missed components

Score spectrum (verified):

All wrong            β†’ 0.00
Category only        β†’ 0.10–0.15
Category + file      β†’ 0.35–0.40
Category + file + field β†’ 0.65
Perfect diagnosis    β†’ 0.90–1.00

Setup & Usage

Docker (recommended)

docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health

Local Python

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860

Python Client

# Sync usage
from client import MLOpsDebugEnv
from models import MLOpsAction

with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
    obs = env.reset(task_id="hard", seed=1)
    print(obs.task_description)

    # Investigate systematically
    r = env.step(MLOpsAction(action_type="read_eval_results"))
    print(r.observation.last_action_result["content"])

    r = env.step(MLOpsAction(
        action_type="run_sanity_check",
        sanity_check_type="metric_gap_analysis"
    ))
    # Reveals val/test gap anomaly

    r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
    # Shows the buggy pipeline code

    r = env.step(MLOpsAction(
        action_type="submit_diagnosis",
        failure_category="label_mismatch",
        root_cause_file="preprocessing.py",
        root_cause_field="LabelEncoder.fit_order",
        diagnosis="Train and eval use different LabelEncoder orderings",
        proposed_fix="Use single LabelEncoder instance across both pipelines"
    ))
    print(f"Score: {r.info['score']}")

Baseline Inference Script

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_BASE_URL="http://localhost:7860"

python inference.py          # all 3 tasks, seed=42
python inference.py --task easy --seed 42

Output format:

[START] task=easy env=mlops-debug-env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.95 done=true error=null
[END] success=true steps=4 rewards=0.02,0.01,0.02,0.95

Baseline scores (Qwen2.5-72B-Instruct, seed=42):

Task Score Notes
easy ~0.42 Gets category right, struggles with exact field name
medium ~0.28 Often identifies leakage but misidentifies exact mechanism
hard ~0.15 Silent bugs with normal training logs are genuinely hard

Why This Environment

Real problem. Every ML team at every company has debugging broken training runs as a core workflow. The three bug categories in this environment β€” config errors, data leakage, silent evaluation bugs β€” are the actual top-3 failure modes in production ML pipelines.

Deterministic grading. The planted bug is ground truth. Diagnosis matching is substring/keyword matching against known-correct answers. Zero subjectivity, zero LLM-as-judge, reproducible across runs.

Genuinely hard for frontier models. Task 3 (silent evaluation bugs) requires reasoning about what's absent β€” no error signals, normal training logs β€” and tracing backwards from a metric anomaly to a pipeline version mismatch. State-of-the-art models score ~0.15 without careful prompting.

Seed-based reproducibility. reset(seed=42) always produces the same bug, same artifacts, same grading. Baseline scores are reproducible to 4 decimal places.


Environment Variables

Variable Description
API_BASE_URL LLM API endpoint (OpenAI-compatible)
MODEL_NAME Model identifier
HF_TOKEN Hugging Face / API token
ENV_BASE_URL Environment server URL (default: http://localhost:7860)

License

MIT β€” see LICENSE