Spaces:
Sleeping
title: MLOps Pipeline Debugger
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: '1.0'
python_version: '3.11'
app_file: app.py
pinned: false
MLOps Pipeline Debugger
Latest Baseline Scores
| Task | Score |
|---|---|
| Easy | 0.91 |
| Medium | 0.85 |
| Hard | 1.00 |
| Average | 0.92 |
Tested with Gemini 2.5 Flash + Gemini 3.1 Pro Preview fallback for hard task
An OpenEnv-compatible reinforcement learning environment where an AI agent acts as a senior ML engineer diagnosing a broken training run.
What Is This?
Every ML team has experienced it: a training job finishes overnight and something is wrong. Loss exploded to NaN. Validation accuracy is suspiciously perfect at epoch 1. Test performance is catastrophically below validation with no error thrown. An engineer must systematically investigate β reading logs, checking configs, inspecting preprocessing code, running sanity checks β to find the root cause.
This environment simulates that investigation. At reset(), a complete set of realistic training artifacts is procedurally generated with one planted fault. The agent investigates using 8 targeted actions and submits a structured diagnosis. The grader checks against the planted ground truth β fully deterministic, no LLM judge needed.
9 distinct bug types across 3 tasks. Every episode can have a different bug. Scores vary continuously 0.0 β 1.0 based on diagnosis precision.
Environment Design
Procedural Artifact Generation
Every episode generates 6 realistic training artifacts from scratch:
| Artifact | Contents |
|---|---|
config.yaml |
Model arch, optimizer, LR, batch size, scheduler, augmentation |
train.log |
Epoch-by-epoch loss/accuracy/gradient norms with realistic timestamps |
dataset_stats.json |
Split sizes, class distribution, overlap counts, feature statistics |
preprocessing.py |
Full sklearn/PyTorch preprocessing pipeline code |
eval_results.json |
Final val/test metrics with hardware info |
model_card.json |
Architecture summary, tokenizer version, preprocessing config |
Artifacts are internally consistent β config matches logs, dataset stats match preprocessing code β except for the one planted fault. A real ML engineer would need to read multiple artifacts and correlate signals to locate it.
Action Space
class MLOpsAction(BaseModel):
action_type: Literal[
"read_config", # Full config.yaml
"read_logs", # Training logs (filterable: keyword or "epoch:N-M")
"check_dataset_stats", # Split sizes, class distribution, overlap counts
"inspect_preprocessing",# Full preprocessing pipeline code
"read_eval_results", # Final val/test metrics
"run_sanity_check", # Computed diagnostic (see types below)
"query_artifact", # Specific field from any artifact (dot notation)
"submit_diagnosis", # Final answer β triggers grading
]
# Sanity check types:
# label_consistency | data_leakage | gradient_norms | class_balance
# feature_statistics | encoder_version_match | loss_trajectory | metric_gap_analysis
# submit_diagnosis fields:
# failure_category | root_cause_file | root_cause_field | diagnosis | proposed_fix
Observation Space
class MLOpsObservation(BaseModel):
task_id: str # easy | medium | hard
task_description: str # Full task brief with investigation strategy
run_id: str # Unique run identifier
run_summary: Dict[str, Any] # Model, dataset, training status
available_artifacts: List[ArtifactMeta] # What can be read
artifacts_read: List[str] # Investigation progress
last_action_result: Dict[str, Any] # Full content of last action
step_count: int
max_steps: int
done: bool
messages: List[str] # System warnings (duplicate reads, etc.)
Tasks
Task 1 β Config Error Diagnosis (easy)
Bug pool (one picked randomly per episode):
exploding_lrβlearning_rate: 50.0causes loss β NaN by epoch 3wrong_optimizerβSGD(momentum=0.99)causes oscillation with no convergencebatch_size_overflowβbatch_size: 4096exceeds dataset size, val accuracy 99.9% trivially
Signal: Visible immediately in training logs. Loss curve or accuracy values are obviously wrong.
Optimal strategy: read_logs β run_sanity_check(loss_trajectory) β read_config β submit_diagnosis
Max steps: 20 | Expected baseline score: ~0.42
Task 2 β Data Leakage Detection (medium)
Bug pool:
data_leakage_scalerβStandardScaler.fit_transform(X_full)called before train/val splitdata_leakage_overlapβtrain_test_split(random_state=None)produces non-deterministic overlapping splitswrong_split_ratioβtest_size=0.8trains on 20% and evaluates on 80% (inverted)
Signal: Val accuracy suspiciously high from epoch 1 in logs; val/test gap in eval results; sample overlap count in dataset stats.
Optimal strategy: read_logs β read_eval_results β run_sanity_check(data_leakage) β inspect_preprocessing β submit_diagnosis
Max steps: 30 | Expected baseline score: ~0.28
Task 3 β Silent Evaluation Bug (hard)
Bug pool:
label_encoder_mismatchβ Train/eval use differentLabelEncoder.fit()orderings β silent wrong predictionssilent_metric_swapβval_accuracyandtest_accuracyassignments are swapped in eval codetokenizer_version_driftβ Training uses tokenizer v2, eval uses v1 β 847 tokens map to[UNK]
Signal: Training logs look completely normal. Only the val/test metric gap in eval results is suspicious β no errors, no warnings, no exceptions.
Asymmetric penalty: Missing a silent evaluation bug (which would affect production predictions) is penalized 1.5Γ β mirroring real incident severity weighting.
Optimal strategy: read_eval_results β run_sanity_check(metric_gap_analysis) β inspect_preprocessing β run_sanity_check(label_consistency OR encoder_version_match) β submit_diagnosis
Max steps: 40 | Expected baseline score: ~0.15
Reward Function
Dense per-step rewards (not sparse):
+0.02 First time reading an artifact (rewards systematic exploration)
-0.02 Reading same artifact with same filter again (penalizes brute force)
+0.01 Running a new sanity check (rewards diagnostic reasoning)
At submit_diagnosis:
+0.15 Correct failure_category (config_error / data_leakage / evaluation_bug / ...)
+0.25 Correct root_cause_file (exact match)
+0.30 Correct root_cause_field (substring match, case-insensitive)
+0.30 Correct proposed_fix (keyword overlap with gold fix)
Task 3 modifier: if score < 0.70, additional 0.5Γ penalty on missed components
Score spectrum (verified):
All wrong β 0.00
Category only β 0.10β0.15
Category + file β 0.35β0.40
Category + file + field β 0.65
Perfect diagnosis β 0.90β1.00
Setup & Usage
Docker (recommended)
docker build -t mlops-debug-env .
docker run -p 7860:7860 mlops-debug-env
curl http://localhost:7860/health
Local Python
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
Python Client
# Sync usage
from client import MLOpsDebugEnv
from models import MLOpsAction
with MLOpsDebugEnv(base_url="http://localhost:7860").sync() as env:
obs = env.reset(task_id="hard", seed=1)
print(obs.task_description)
# Investigate systematically
r = env.step(MLOpsAction(action_type="read_eval_results"))
print(r.observation.last_action_result["content"])
r = env.step(MLOpsAction(
action_type="run_sanity_check",
sanity_check_type="metric_gap_analysis"
))
# Reveals val/test gap anomaly
r = env.step(MLOpsAction(action_type="inspect_preprocessing"))
# Shows the buggy pipeline code
r = env.step(MLOpsAction(
action_type="submit_diagnosis",
failure_category="label_mismatch",
root_cause_file="preprocessing.py",
root_cause_field="LabelEncoder.fit_order",
diagnosis="Train and eval use different LabelEncoder orderings",
proposed_fix="Use single LabelEncoder instance across both pipelines"
))
print(f"Score: {r.info['score']}")
Baseline Inference Script
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="hf_your_token_here"
export ENV_BASE_URL="http://localhost:7860"
python inference.py # all 3 tasks, seed=42
python inference.py --task easy --seed 42
Output format:
[START] task=easy env=mlops-debug-env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=read_logs reward=0.02 done=false error=null
[STEP] step=2 action=run_sanity_check reward=0.01 done=false error=null
[STEP] step=3 action=read_config reward=0.02 done=false error=null
[STEP] step=4 action=submit_diagnosis reward=0.95 done=true error=null
[END] success=true steps=4 rewards=0.02,0.01,0.02,0.95
Baseline scores (Qwen2.5-72B-Instruct, seed=42):
| Task | Score | Notes |
|---|---|---|
| easy | ~0.42 | Gets category right, struggles with exact field name |
| medium | ~0.28 | Often identifies leakage but misidentifies exact mechanism |
| hard | ~0.15 | Silent bugs with normal training logs are genuinely hard |
Why This Environment
Real problem. Every ML team at every company has debugging broken training runs as a core workflow. The three bug categories in this environment β config errors, data leakage, silent evaluation bugs β are the actual top-3 failure modes in production ML pipelines.
Deterministic grading. The planted bug is ground truth. Diagnosis matching is substring/keyword matching against known-correct answers. Zero subjectivity, zero LLM-as-judge, reproducible across runs.
Genuinely hard for frontier models. Task 3 (silent evaluation bugs) requires reasoning about what's absent β no error signals, normal training logs β and tracing backwards from a metric anomaly to a pipeline version mismatch. State-of-the-art models score ~0.15 without careful prompting.
Seed-based reproducibility. reset(seed=42) always produces the same bug, same artifacts, same grading. Baseline scores are reproducible to 4 decimal places.
Environment Variables
| Variable | Description |
|---|---|
API_BASE_URL |
LLM API endpoint (OpenAI-compatible) |
MODEL_NAME |
Model identifier |
HF_TOKEN |
Hugging Face / API token |
ENV_BASE_URL |
Environment server URL (default: http://localhost:7860) |
License
MIT β see LICENSE