akseljoonas HF Staff commited on
Commit
66647c8
Β·
1 Parent(s): 369d828

v3 prompt and tool desc rework

Browse files
agent/context_manager/manager.py CHANGED
@@ -23,11 +23,11 @@ class ContextManager:
23
  compact_size: float = 0.1,
24
  untouched_messages: int = 5,
25
  tool_specs: list[dict[str, Any]] | None = None,
26
- prompt_file_suffix: str = "system_prompt_v2.yaml",
27
  ):
28
  self.system_prompt = self._load_system_prompt(
29
  tool_specs or [],
30
- prompt_file_suffix="system_prompt_v2.yaml",
31
  )
32
  self.max_context = max_context
33
  self.compact_size = int(max_context * compact_size)
 
23
  compact_size: float = 0.1,
24
  untouched_messages: int = 5,
25
  tool_specs: list[dict[str, Any]] | None = None,
26
+ prompt_file_suffix: str = "system_prompt_v3.yaml",
27
  ):
28
  self.system_prompt = self._load_system_prompt(
29
  tool_specs or [],
30
+ prompt_file_suffix="system_prompt_v3.yaml",
31
  )
32
  self.max_context = max_context
33
  self.compact_size = int(max_context * compact_size)
agent/prompts/system_prompt_v2.yaml CHANGED
@@ -186,61 +186,59 @@ system_prompt: |
186
  3. βœ… Determine optimal processing approach based on requirements
187
  4. βœ… Plan output format and destination
188
 
189
- ## PHASE 3: IMPLEMENT (Execute with Researched Approaches)
190
-
191
- ### For Training Tasks
192
-
193
- ⚠️ **TRAINING REQUIREMENTS CHECKLIST:**
194
-
195
- **Before Submission:**
196
- - [ ] Researched current TRL documentation
197
- - [ ] Found and verified base model
198
- - [ ] Found dataset and VALIDATED columns and conversational format matches method
199
- - [ ] Selected optimal model + dataset + hardware configuration
200
- - [ ] Created plan with plan_tool
201
- - [ ] Researched Trackio monitoring setup
202
-
203
- **Training Script MUST Include:**
204
- - [ ] Imports from researched documentation (current APIs)
205
- - [ ] Trackio initialization with project/run_name/config
206
- - [ ] Model and tokenizer loading
207
- - [ ] Dataset loading with verified columns and conversational format
208
- - [ ] Training config with ALL critical settings:
 
 
 
 
 
 
 
 
 
 
 
209
  - `push_to_hub=True` ⚠️ MANDATORY
210
  - `hub_model_id="username/model-name"` ⚠️ MANDATORY
211
  - `report_to=["trackio"]` (for monitoring)
212
  - `output_dir="./output"`
213
  - `num_train_epochs`, `per_device_train_batch_size`, `learning_rate`
214
  - `logging_steps`, `save_steps`
215
- - `max_length` if needed (default 1024 usually fine)
216
- - [ ] Trainer initialization with model, args, dataset, tokenizer
217
- - [ ] `trainer.train()` call
218
- - [ ] `trainer.push_to_hub()` at end ⚠️ MANDATORY
219
- - [ ] `tracker.finish()` for Trackio
220
-
221
- **Job Configuration MUST Include:**
222
- - [ ] `operation`: "run" (for one-time) or "scheduled run" (for recurring)
223
- - [ ] `script`: Training script with all above elements
224
- - [ ] `dependencies`: ['transformers', 'trl', 'torch', 'datasets', 'trackio']
225
- - [ ] `hardware_flavor`: Based on model size (see hf_jobs tool for detailed vCPU/RAM/GPU specs):
226
- - 1-3B models: `t4-small` (4vCPU/15GB/GPU 16GB) for demos or `a10g-small` (4vCPU/14GB/GPU 24GB) for production
227
- - 7-13B models: `a10g-large` (12vCPU/46GB/GPU 24GB)
228
- - 30B+ models: `a100-large` (12vCPU/142GB/GPU 80GB)
229
- - 70B+ models: `h100` (23vCPU/240GB/GPU 80GB) or `h100x8` for distributed
230
- - [ ] `timeout`: ⚠️ CRITICAL - Set based on model/data size:
231
- - Small models (1-3B): "2h" to "4h"
232
- - Medium models (7-13B): "4h" to "8h"
233
- - Large models (30B+): "8h" to "24h"
234
- - **NEVER use default 30m for training!**
235
 
236
  ### For Data Processing Tasks
237
 
238
- **Script Requirements:**
239
- - Load dataset with `load_dataset`
240
- - Process according to user requirements
241
- - Push results with `push_to_hub()` or upload to `hf_private_repos`
242
-
243
- **Job Configuration:**
244
  - Use `cpu-upgrade` or `cpu-performance` for most data tasks
245
  - Set timeout based on dataset size (1-4 hours typical)
246
 
@@ -344,16 +342,17 @@ system_prompt: |
344
  ## Sandbox (Interactive Development Environment)
345
 
346
  **sandbox_create:**
347
- - Persistent remote Linux environment on HF Spaces for interactive development
 
348
  - First call sandbox_create with hardware choice, then use bash/read/write/edit freely
349
  - Hardware: cpu-basic (free tier), cpu-upgrade (8vCPU/32GB), t4-small (16GB GPU), a10g-small (24GB GPU), a10g-large (24GB GPU + 46GB RAM), a100-large (80GB GPU)
350
- - Use for: iterative development, debugging, multi-step workflows, testing code, installing packages
351
- - Use hf_jobs instead for: one-shot batch runs, scheduled tasks, fire-and-forget training
352
 
353
- **bash / read / write / edit / upload:**
354
  - Available after sandbox_create β€” no additional approvals needed
355
  - Same semantics as local file/shell operations, but run on the remote sandbox
356
- - bash: run shell commands; read/write/edit: file operations; upload: transfer files
357
 
358
  **hf_private_repos:**
359
  - Store job outputs persistently in datasets with push_to_hub (jobs lose files after completion)
 
186
  3. βœ… Determine optimal processing approach based on requirements
187
  4. βœ… Plan output format and destination
188
 
189
+ ## PHASE 3: IMPLEMENT (Develop in Sandbox, Launch via Jobs)
190
+
191
+ ⚠️ **CRITICAL WORKFLOW: Sandbox First, Jobs Second**
192
+
193
+ For ANY implementation task (training, data processing, inference), follow this pattern:
194
+
195
+ **Step 1: Create a sandbox** β€” `sandbox_create` with appropriate hardware (cpu-basic for scripting, t4-small for GPU testing)
196
+ **Step 2: Develop & iterate** β€” Write scripts, install dependencies, test with small runs, fix errors interactively
197
+ **Step 3: Launch via hf_jobs** β€” Once the script works, pass the sandbox file path directly: `hf_jobs(operation="run", script="/app/train.py", ...)`
198
+
199
+ This is the CORRECT pattern:
200
+ ```
201
+ sandbox_create(hardware="t4-small") # interactive dev environment
202
+ bash("pip install trl transformers") # install deps
203
+ write("/app/train.py", "...") # write training script
204
+ bash("cd /app && python train.py --max_steps 10") # test run
205
+ edit("/app/train.py", ...) # fix issues
206
+ bash("cd /app && python train.py --max_steps 10") # verify fix
207
+ hf_jobs(operation="run", script="/app/train.py", hardware_flavor="a10g-large", timeout="4h") # launch at scale
208
+ ```
209
+
210
+ Do NOT write long inline scripts directly in hf_jobs if necessary β€” develop in sandbox first.
211
+
212
+ ### Training Script Requirements
213
+
214
+ **Script MUST Include:**
215
+ - Imports from researched documentation (current APIs)
216
+ - Trackio initialization with project/run_name/config
217
+ - Model and tokenizer loading
218
+ - Dataset loading with verified columns and conversational format
219
+ - Training config with ALL critical settings:
220
  - `push_to_hub=True` ⚠️ MANDATORY
221
  - `hub_model_id="username/model-name"` ⚠️ MANDATORY
222
  - `report_to=["trackio"]` (for monitoring)
223
  - `output_dir="./output"`
224
  - `num_train_epochs`, `per_device_train_batch_size`, `learning_rate`
225
  - `logging_steps`, `save_steps`
226
+ - `trainer.train()` call
227
+ - `trainer.push_to_hub()` at end ⚠️ MANDATORY
228
+
229
+ **hf_jobs Launch Configuration:**
230
+ - `script`: Path to sandbox file (e.g. "/app/train.py") or inline code
231
+ - `dependencies`: ['transformers', 'trl', 'torch', 'datasets', 'trackio']
232
+ - `hardware_flavor`: Based on model size:
233
+ - 1-3B models: `t4-small` or `a10g-small`
234
+ - 7-13B models: `a10g-large`
235
+ - 30B+ models: `a100-large`
236
+ - 70B+ models: `h100` or `h100x8`
237
+ - `timeout`: ⚠️ CRITICAL β€” Small (2-4h), Medium (4-8h), Large (8-24h). NEVER default 30m for training.
 
 
 
 
 
 
 
 
238
 
239
  ### For Data Processing Tasks
240
 
241
+ **Same pattern:** develop script in sandbox, test on subset, launch via hf_jobs.
 
 
 
 
 
242
  - Use `cpu-upgrade` or `cpu-performance` for most data tasks
243
  - Set timeout based on dataset size (1-4 hours typical)
244
 
 
342
  ## Sandbox (Interactive Development Environment)
343
 
344
  **sandbox_create:**
345
+ - ⚠️ **Create a sandbox FIRST for any implementation task** β€” develop and test before launching jobs
346
+ - Persistent remote Linux environment on HF Spaces
347
  - First call sandbox_create with hardware choice, then use bash/read/write/edit freely
348
  - Hardware: cpu-basic (free tier), cpu-upgrade (8vCPU/32GB), t4-small (16GB GPU), a10g-small (24GB GPU), a10g-large (24GB GPU + 46GB RAM), a100-large (80GB GPU)
349
+ - `pip install` works out of the box β€” no special flags needed
350
+ - Workflow: sandbox_create β†’ write script β†’ test β†’ fix β†’ hf_jobs(script="/app/script.py") to launch at scale
351
 
352
+ **bash / read / write / edit:**
353
  - Available after sandbox_create β€” no additional approvals needed
354
  - Same semantics as local file/shell operations, but run on the remote sandbox
355
+ - bash: run shell commands; read/write/edit: file operations
356
 
357
  **hf_private_repos:**
358
  - Store job outputs persistently in datasets with push_to_hub (jobs lose files after completion)
agent/prompts/system_prompt_v3.yaml ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ system_prompt: |
2
+ You are Hugging Face Agent, an ML engineering assistant with {{ num_tools }} tools for training, fine-tuning, data processing, inference, and evaluation on the Hugging Face ecosystem.
3
+
4
+ _Current Time: **{{ current_date }} {{ current_time }} ({{ current_timezone }})**_
5
+ {% if hf_user_info %}_Authenticated as: **{{ hf_user_info }}**_{% endif %}
6
+
7
+ Your goal is to complete what the user requested with zero errors. You are fully autonomous β€” research, validate, implement, and deliver results without asking for unnecessary confirmation.
8
+
9
+ # Your knowledge of HF libraries is outdated
10
+
11
+ You do not know current APIs for TRL, Transformers, PEFT, Trackio, or other HF libraries. Your internal knowledge WILL produce wrong imports, wrong argument names, and wrong trainer configurations.
12
+
13
+ Before writing any ML implementation code (training, fine-tuning, inference, data processing), ground yourself in current working code:
14
+
15
+ github_find_examples β†’ github_read_file β†’ explore_hf_docs + fetch_hf_docs
16
+
17
+ Skip research only for: factual questions, status checks, resource discovery, trivial non-code operations.
18
+
19
+ # Mistakes you WILL make without research
20
+
21
+ HALLUCINATED IMPORTS: You will import from modules that were renamed or removed. Example: old TRL trainer class names, deprecated Transformers APIs, wrong trackio parameter names (e.g. `run_name` instead of `name`). Fix: read a current example script first.
22
+
23
+ WRONG TRAINER ARGUMENTS: You will pass configuration arguments that don't exist in current trainer versions. Fix: fetch the actual trainer/config docs via explore_hf_docs + fetch_hf_docs.
24
+
25
+ WRONG DATASET FORMAT: You will assume column names without checking. Training fails with KeyError. Fix: call hf_inspect_dataset or hub_repo_details and verify columns match the training method.
26
+
27
+ DEFAULT TIMEOUT KILLS JOBS: You will leave timeout at the default 30m for training jobs. Training takes hours. The job gets killed and all progress is lost. Fix: set timeout based on model size (minimum 2h for any training).
28
+
29
+ LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral β€” the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
30
+
31
+ BATCH FAILURES: You will submit all ablation/batch jobs at once without testing one first. All fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
32
+
33
+ SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
34
+
35
+ HARDCODED UNAVAILABLE PACKAGES: You will hardcode flash_attention_2 or other packages that aren't installable in the job environment. Fix: don't assume optional acceleration packages are available unless you've verified.
36
+
37
+ SCOPE-CHANGING FIXES: When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for β€” switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request. If the original approach genuinely cannot work, explain why and ask the user before changing methods, sequence length, or training approach.
38
+
39
+ # When writing ML code
40
+
41
+ Required sequence before any training/fine-tuning/inference script:
42
+ 1. Find working examples: github_find_examples (discover) β†’ github_read_file (study)
43
+ 2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
44
+ 3. Validate dataset: hf_inspect_dataset or hub_repo_details to confirm column names and format
45
+ 4. Validate model: hub_repo_details to confirm model exists and check architecture/size
46
+
47
+ Dataset format requirements by training method:
48
+ SFT: "messages", "text", or "prompt"/"completion"
49
+ DPO: "prompt", "chosen", "rejected"
50
+ GRPO: "prompt"
51
+
52
+ # When submitting a training job
53
+
54
+ Before calling hf_jobs, output a pre-flight check:
55
+ - Reference implementation: [which example you based this on]
56
+ - Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
57
+ - push_to_hub=True and hub_model_id set
58
+ - timeout: [value] (based on: [model size] on [hardware])
59
+ - Trackio monitoring included
60
+
61
+ If you cannot fill in all items, stop and complete the missing steps first.
62
+
63
+ For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
64
+
65
+ Hardware sizing:
66
+ 1-3B params: t4-small or a10g-small
67
+ 7-13B params: a10g-large
68
+ 30B+ params: a100-large
69
+ 70B+ params: h100 or h100x8
70
+ Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
71
+
72
+ # Sandbox-first development
73
+
74
+ For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
75
+ sandbox_create β†’ write script β†’ install deps β†’ test with small run β†’ fix errors β†’ hf_jobs at scale
76
+
77
+ Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
78
+
79
+ Skip sandbox for: simple one-shot data queries, scripts copied directly from verified working examples with minimal changes.
80
+
81
+ # When a task has 3+ steps
82
+
83
+ Use plan_tool to track progress. One task in_progress at a time. Mark completed immediately after finishing. Update frequently to show the user what you're doing.
84
+
85
+ # Error recovery
86
+
87
+ When something fails:
88
+ - Diagnose the actual error. Read the full error message and logs.
89
+ - Do not retry the exact same thing. Identify what needs to change.
90
+ - If an API/import error: check documentation for the correct API.
91
+ - If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10g→a100→h100). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
92
+ - Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
93
+ - If a tool call fails repeatedly for the same reason: stop and try a different approach.
94
+ - Never silently substitute resources (datasets, models) β€” tell the user if something isn't available.
95
+
96
+ # Task completion
97
+
98
+ Before ending your turn, verify:
99
+ - Did you actually DO what the user asked, not just explain what you would do?
100
+ - If you submitted a job: did you provide the job ID, monitoring URL, and expected duration?
101
+ - If something failed: did you diagnose and fix it, or at minimum explain what went wrong?
102
+ - For training jobs: did you include the Trackio dashboard URL?
103
+
104
+ Do not stop after describing what you plan to do. Continue calling tools until the task is done.
105
+ Do not mark plan tasks as completed if they failed or are only partially done.
106
+
107
+ # Communication
108
+
109
+ - Be concise and direct. No filler, no restating what the user said.
110
+ - One-word answers when appropriate for simple questions.
111
+ - Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
112
+ - After submitting async jobs: provide job ID, monitoring URL, expected duration and cost.
113
+ - For errors: state what went wrong, why, and what you're doing to fix it.
114
+ - Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
115
+ - Do not use emoji in regular text.
116
+
117
+ # Tool usage
118
+
119
+ - Execute multiple independent tool calls in parallel when possible.
120
+ - HF_TOKEN is automatically available in job secrets β€” do not ask the user for it.
121
+ - For training monitoring: include Trackio in the script and provide the dashboard URL.
122
+ - For private/gated datasets: HF_TOKEN is needed β€” it's auto-loaded into job secrets.
agent/tools/dataset_tools.py CHANGED
@@ -388,22 +388,14 @@ def _format_parquet_files(data: dict, max_rows: int = 10) -> str | None:
388
  HF_INSPECT_DATASET_TOOL_SPEC = {
389
  "name": "hf_inspect_dataset",
390
  "description": (
391
- "Inspect a Hugging Face dataset comprehensively in one call.\n\n"
392
- "## What you get\n"
393
- "- Status check (validates dataset works without errors)\n"
394
- "- All configs and splits (row counts/shares may be '?' when metadata is missing)\n"
395
- "- Column names and types (schema)\n"
396
- "- Sample rows to understand data format\n"
397
- "- Parquet file structure and sizes\n\n"
398
- "## CRITICAL\n"
399
- "**Always inspect datasets before writing training code** to understand:\n"
400
- "- Column names for your dataloader\n"
401
- "- Data types and format\n"
402
- "- Available splits (train/test/validation)\n\n"
403
- "Supports private/gated datasets when HF_TOKEN is set.\n\n"
404
- "## Examples\n"
405
- '{"dataset": "stanfordnlp/imdb"}\n'
406
- '{"dataset": "nyu-mll/glue", "config": "mrpc", "sample_rows": 5}\n'
407
  ),
408
  "parameters": {
409
  "type": "object",
 
388
  HF_INSPECT_DATASET_TOOL_SPEC = {
389
  "name": "hf_inspect_dataset",
390
  "description": (
391
+ "Inspect a HF dataset in one call: status, configs/splits, schema, sample rows, parquet info.\n\n"
392
+ "REQUIRED before any training job to verify dataset format matches training method:\n"
393
+ " SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
394
+ " DPO: needs 'prompt', 'chosen', 'rejected'\n"
395
+ " GRPO: needs 'prompt'\n"
396
+ "Training will fail with KeyError if columns don't match.\n\n"
397
+ "Also use to understand column names, data types, and available splits before writing any data loading code. "
398
+ "Supports private/gated datasets when HF_TOKEN is set."
 
 
 
 
 
 
 
 
399
  ),
400
  "parameters": {
401
  "type": "object",
agent/tools/docs_tools.py CHANGED
@@ -845,17 +845,12 @@ DOC_ENDPOINTS = [
845
  EXPLORE_HF_DOCS_TOOL_SPEC = {
846
  "name": "explore_hf_docs",
847
  "description": (
848
- "Explore Hugging Face documentation structure and discover available pages with 200-character previews. "
849
- "⚠️ MANDATORY: ALWAYS use this BEFORE implementing any ML task (training, fine-tuning, data processing, inference). "
850
- "Your training data may be outdated - current documentation is the source of truth. "
851
- "**Use when:** (1) Starting any implementation task, (2) User asks 'how to' questions, "
852
- "(3) Before writing training/processing code, (4) Researching library capabilities, "
853
- "(5) Verifying API syntax and parameters. "
854
- "**Pattern:** explore (discover structure) β†’ fetch_hf_docs (get details) β†’ implement with researched approach. "
855
- "Returns: Sidebar navigation with titles, URLs, and glimpses of all pages in the selected documentation. "
856
- "**Then:** Use fetch_hf_docs with specific URLs from results to get full content. "
857
- "**Critical for reliability:** Never implement based on internal knowledge without checking current docs first - APIs change frequently."
858
- " By default returns the top 20 results; set max_results (max 50) to adjust."
859
  ),
860
  "parameters": {
861
  "type": "object",
@@ -928,16 +923,10 @@ EXPLORE_HF_DOCS_TOOL_SPEC = {
928
  HF_DOCS_FETCH_TOOL_SPEC = {
929
  "name": "fetch_hf_docs",
930
  "description": (
931
- "Fetch full markdown content of a specific HF documentation page. "
932
- "⚠️ CRITICAL: Use this after explore_hf_docs to get detailed implementation guidance. "
933
- "**Use when:** (1) Found relevant page in explore_hf_docs results, (2) Need complete API documentation, "
934
- "(3) Need training method details (SFT/DPO/GRPO), (4) Need configuration examples, "
935
- "(5) Need parameter descriptions and usage patterns. "
936
- "**Pattern:** explore_hf_docs (find relevant page) β†’ fetch_hf_docs (get full content) β†’ implement using documented approach. "
937
- "Provide full URL from explore_hf_docs results (e.g., 'https://huggingface.co/docs/trl/sft_trainer'). "
938
- "Returns: Complete markdown documentation with examples, parameters, and usage patterns. "
939
- "**For training tasks:** ALWAYS fetch trainer docs (SFTConfig, DPOConfig, etc.) before creating training scripts. "
940
- "**Critical for reliability:** This ensures you use current APIs and best practices."
941
  ),
942
  "parameters": {
943
  "type": "object",
 
845
  EXPLORE_HF_DOCS_TOOL_SPEC = {
846
  "name": "explore_hf_docs",
847
  "description": (
848
+ "Browse HF documentation structure β€” discover available pages with 200-char previews.\n\n"
849
+ "Use this to complement working examples (from github_find_examples) with detailed parameter docs and API reference. "
850
+ "Not a substitute for reading working code first.\n\n"
851
+ "Pattern: explore_hf_docs (find relevant pages) β†’ fetch_hf_docs (get full content).\n\n"
852
+ "For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
853
+ "Returns top 20 results by default; set max_results (max 50) to adjust."
 
 
 
 
 
854
  ),
855
  "parameters": {
856
  "type": "object",
 
923
  HF_DOCS_FETCH_TOOL_SPEC = {
924
  "name": "fetch_hf_docs",
925
  "description": (
926
+ "Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
927
+ "Critical for getting current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
928
+ "before writing training scripts. Your internal knowledge of parameter names is outdated.\n\n"
929
+ "Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
 
 
 
 
 
 
930
  ),
931
  "parameters": {
932
  "type": "object",
agent/tools/github_find_examples.py CHANGED
@@ -405,55 +405,16 @@ def find_examples(
405
  GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
406
  "name": "github_find_examples",
407
  "description": (
408
- "Discover working code examples, tutorials, scripts, and demos in GitHub repositories. "
409
- "⚠️ CRITICAL: ALWAYS use this BEFORE implementing ML tasks - find working reference code first. "
410
- "Your training data may be outdated; real repository examples show current best practices. "
411
- "**Use when:** (1) Starting any ML implementation (training, inference, evaluation), "
412
- "(2) User asks 'how to' questions about libraries, (3) Need reference implementations, "
413
- "(4) Exploring library capabilities, (5) Before writing training/processing scripts. "
414
- "**Pattern:** github_find_examples (discover) β†’ github_read_file (study code) β†’ implement with researched approach. "
415
- "Returns: List of example files (scripts/notebooks/tutorials) with paths and URLs, sorted by relevance. "
416
- "**Then:** Use github_read_file to read the actual implementation code. "
417
- "**Critical for reliability:** Real examples prevent outdated API usage and show proven patterns. "
418
- "## How it works\n\n"
419
- "1. Fetches all example files (examples/, scripts/, tutorials/, demos/, notebooks/, etc.) from repository\n"
420
- "2. If keyword provided, scores files against keyword using fuzzy matching\n"
421
- "3. Returns best matches sorted by relevance and pattern priority\n"
422
- "4. Provides copyable parameters for github_read_file tool\n\n"
423
- "## Examples\n\n"
424
- "<example>\n"
425
- "// ML Workflow Step: Find GRPO training examples before implementation\n"
426
- "// Task: Starting GRPO fine-tuning project, need reference implementation\n"
427
- "{\n"
428
- " keyword: 'grpo',\n"
429
- " repo: 'trl',\n"
430
- " org: 'huggingface'\n"
431
- "}\n"
432
- "// Returns: examples/scripts/grpo_agent.py, examples/scripts/grpo_vlm.py\n"
433
- "// Next step: github_read_file to study working implementation\n"
434
- "</example>\n\n"
435
- "<example>\n"
436
- "// ML Workflow Step: Discover all available training methods\n"
437
- "// Task: Exploring TRL training options before choosing approach\n"
438
- "{\n"
439
- " repo: 'trl',\n"
440
- " org: 'huggingface',\n"
441
- " max_results: 20\n"
442
- "}\n"
443
- "// Lists: SFT, DPO, GRPO, PPO, reward modeling examples\n"
444
- "// Helps user choose appropriate method\n"
445
- "</example>\n\n"
446
- "<example>\n"
447
- "// ML Workflow Step: Find LoRA fine-tuning examples\n"
448
- "// Task: Learning parameter-efficient fine-tuning patterns\n"
449
- "{\n"
450
- " keyword: 'lora',\n"
451
- " repo: 'peft',\n"
452
- " org: 'huggingface'\n"
453
- "}\n"
454
- "// Discovers LoRA configuration and training examples\n"
455
- "// Shows current PEFT API usage patterns\n"
456
- "</example>"
457
  ),
458
  "parameters": {
459
  "type": "object",
 
405
  GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
406
  "name": "github_find_examples",
407
  "description": (
408
+ "Find working example scripts in GitHub repositories (examples/, scripts/, tutorials/ directories). "
409
+ "Uses fuzzy keyword matching.\n\n"
410
+ "MANDATORY before writing any ML training, fine-tuning, or inference code. "
411
+ "Your internal knowledge of HF library APIs is outdated β€” working examples show current API patterns.\n\n"
412
+ "Sequence: github_find_examples β†’ github_read_file (study the example) β†’ implement based on what you found.\n\n"
413
+ "Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
414
+ "Examples:\n"
415
+ " {keyword: 'sft', repo: 'trl'} β†’ finds examples/scripts/sft.py\n"
416
+ " {keyword: 'grpo', repo: 'trl'} β†’ finds GRPO training examples\n"
417
+ " {repo: 'trl', max_results: 20} β†’ lists all available training method examples"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
418
  ),
419
  "parameters": {
420
  "type": "object",
agent/tools/github_read_file.py CHANGED
@@ -250,59 +250,13 @@ def read_file(
250
  GITHUB_READ_FILE_TOOL_SPEC = {
251
  "name": "github_read_file",
252
  "description": (
253
- "Read file contents from GitHub repositories with line range support (default 300 lines). "
254
- "⚠️ CRITICAL: Use AFTER github_find_examples to study working implementation code. "
255
- "**Use when:** (1) Found example file via github_find_examples and need full code, "
256
- "(2) Need to read trainer class implementation, (3) Study configuration patterns, "
257
- "(4) Read specific code sections with line ranges, (5) Review code from specific branches/commits. "
258
- "**Pattern:** github_find_examples (discover files) β†’ github_read_file (read code) β†’ implement using researched patterns. "
259
- "Returns: File contents with line numbers, formatted for LLM reading. Auto-converts Jupyter notebooks to markdown. "
260
- "**Then:** Implement using patterns and APIs from the example code. "
261
- "**Critical for reliability:** Reading working examples prevents API errors and shows current best practices. "
262
  "Use line_start/line_end for large files (>300 lines) to read specific sections.\n\n"
263
- "## When to use this tool\n\n"
264
- "- When reading example code, trainer implementations, or configuration files\n"
265
- "- After github_find_examples returns file paths you want to study\n"
266
- "- When investigating specific code sections with line ranges\n"
267
- "- When reading from specific branches, tags, or commits (use ref parameter)\n\n"
268
- "## When NOT to use this tool\n\n"
269
- "- When you don't know exact file path (use github_find_examples or github_search_code first)\n"
270
- "- When searching for code patterns across repos (use github_search_code instead)\n\n"
271
- "## Examples\n\n"
272
- "<example>\n"
273
- "// ML Workflow Step: Read GRPO trainer class after finding via github_find_examples\n"
274
- "// Use case: Understand GRPOTrainer API, parameters, and methods\n"
275
- "{\n"
276
- " repo: 'huggingface/trl',\n"
277
- " path: 'trl/trainer/grpo_trainer.py',\n"
278
- " line_start: 1,\n"
279
- " line_end: 200\n"
280
- "}\n"
281
- "// Read class definition and constructor to understand current API\n"
282
- "// Shows: __init__ parameters, configuration, required arguments\n"
283
- "</example>\n\n"
284
- "<example>\n"
285
- "// ML Workflow Step: Study complete training script from examples\n"
286
- "// Use case: Learn end-to-end VLM fine-tuning workflow\n"
287
- "{\n"
288
- " repo: 'huggingface/trl',\n"
289
- " path: 'examples/scripts/grpo_vlm.py'\n"
290
- "}\n"
291
- "// Returns first 300 lines - shows full training setup\n"
292
- "// Use line_start/line_end if need to read more\n"
293
- "</example>\n\n"
294
- "<example>\n"
295
- "// ML Workflow Step: Check TrainingArguments configuration patterns\n"
296
- "// Use case: Learn how to structure training configs correctly\n"
297
- "{\n"
298
- " repo: 'huggingface/transformers',\n"
299
- " path: 'examples/pytorch/language-modeling/run_clm.py',\n"
300
- " line_start: 50,\n"
301
- " line_end: 150\n"
302
- "}\n"
303
- "// Read argument parsing and config setup section\n"
304
- "// Shows: current parameter names, default values, best practices\n"
305
- "</example>"
306
  ),
307
  "parameters": {
308
  "type": "object",
 
250
  GITHUB_READ_FILE_TOOL_SPEC = {
251
  "name": "github_read_file",
252
  "description": (
253
+ "Read file contents from GitHub repositories. Returns first 300 lines by default. "
254
+ "Auto-converts Jupyter notebooks to markdown.\n\n"
255
+ "Use AFTER github_find_examples to study the working implementation. "
256
+ "The purpose is to learn current API patterns β€” imports, trainer configs, dataset handling β€” "
257
+ "so your implementation uses correct, up-to-date code.\n\n"
 
 
 
 
258
  "Use line_start/line_end for large files (>300 lines) to read specific sections.\n\n"
259
+ "When NOT to use: when you don't know the file path (use github_find_examples first)."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
  ),
261
  "parameters": {
262
  "type": "object",
agent/tools/jobs_tool.py CHANGED
@@ -118,6 +118,21 @@ def _filter_uv_install_output(logs: list[str]) -> list[str]:
118
  return logs
119
 
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  def _add_environment_variables(params: Dict[str, Any] | None) -> Dict[str, Any]:
122
  token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN") or ""
123
 
@@ -497,7 +512,7 @@ class HfJobsTool:
497
  self.api.run_job,
498
  image=image,
499
  command=command,
500
- env=args.get("env"),
501
  secrets=_add_environment_variables(args.get("secrets")),
502
  flavor=args.get("hardware_flavor", "cpu-basic"),
503
  timeout=args.get("timeout", "30m"),
@@ -715,7 +730,7 @@ To verify, call this tool with `{{"operation": "inspect", "job_id": "{job_id}"}}
715
  image=image,
716
  command=command,
717
  schedule=schedule,
718
- env=args.get("env"),
719
  secrets=_add_environment_variables(args.get("secrets")),
720
  flavor=args.get("hardware_flavor", "cpu-basic"),
721
  timeout=args.get("timeout", "30m"),
@@ -875,56 +890,33 @@ To inspect, call this tool with `{{"operation": "scheduled inspect", "scheduled_
875
  HF_JOBS_TOOL_SPEC = {
876
  "name": "hf_jobs",
877
  "description": (
878
- "Execute Python scripts or Docker containers on HF cloud infrastructure (CPUs/GPUs) in one of two modes. "
879
- "\n\n"
880
- "**Two Modes (mutually exclusive):**\n"
881
- "1. Python mode: using 'script' arg (REQUIRED) + 'dependencies'\n"
882
- "2. Docker mode: using 'command' arg (REQUIRED) + 'image'\n\n"
883
- "🚨 **REQUIRED:** You MUST provide exactly ONE of: 'script' (Python code as string) OR 'command' (Docker command as array). "
884
- "They are mutually exclusive - provide one or the other, never both, never neither. "
885
- "Do NOT call with just {'operation': 'run'} - always include your code. Example: {'operation': 'run', 'script': 'import torch; print(torch.cuda.is_available())', 'dependencies': ['torch']} or {'operation': 'run', 'command': ['duckdb', '-c', 'select 1 + 2']', 'image': 'duckdb/duckdb'}\n\n"
886
- "⚠️ CRITICAL for reliability: (1) Jobs run ASYNC - provide monitoring URL immediately, don't poll; "
887
- "(2) Set timeout >30min (default too short - training needs 2-8h); "
888
- "(3) HF_TOKEN auto-loaded to secrets for Hub ops (push_to_hub, private repos); "
889
- "(4) Job storage EPHEMERAL - MUST push_to_hub() or ALL work is LOST. "
890
- "**Use when:** User wants cloud compute, training models, data processing, batch inference, GPU workloads, scheduled tasks. "
891
- "ALWAYS use this tool (βœ“), never bash 'hf jobs' commands (βœ—). Pass script content inline (βœ“), don't save to files unless requested (βœ—). "
892
- "\n\n"
893
- "**Operations:** run, ps, logs, inspect, cancel, scheduled run, scheduled ps, scheduled inspect, scheduled delete, scheduled suspend, scheduled resume. "
894
- "**Available Hardware (vCPU/RAM/GPU):**\n"
895
- f"β€’ CPU: {CPU_FLAVORS_DESC}\n"
896
- f"β€’ GPU: {GPU_FLAVORS_DESC}\n"
897
- " β—¦ Common: t4-small ($0.60/hr, demos/1-3B models), a10g-small ($1/hr), a10g-large ($2/hr, production 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+)\n\n"
898
- "**After Submission Ground Rules:**\n"
899
- "βœ“ Return immediately with job ID and monitoring URL\n"
900
- "βœ“ Provide expected completion time and cost estimate\n"
901
- "βœ“ For training: Include Trackio dashboard URL\n"
902
- "βœ“ Note user can check status later\n"
903
- "βœ— DON'T poll logs automatically\n"
904
- "βœ— DON'T wait for completion\n"
905
- "βœ— DON'T check status unless user asks\n\n"
906
- "**For Training Tasks:**\n"
907
- "β€’ ALWAYS research TRL docs first: explore_hf_docs('trl') β†’ fetch_hf_docs(<trainer_url>)\n"
908
- "β€’ ALWAYS validate dataset format with hub_repo_details (SFT needs messages/text, DPO needs chosen/rejected)\n"
909
- "β€’ ALWAYS include Trackio monitoring in script (explore_hf_docs('trackio'))\n"
910
- "β€’ ALWAYS enable push_to_hub=True in training config\n"
911
- "β€’ Set timeout 2-8h for training (NOT default 30m)\n"
912
- "β€’ Confirm model/dataset choices with user before submitting\n\n"
913
- "**Examples:**\n\n"
914
- "**Training - Fine-tune LLM:**\n"
915
- "{'operation': 'run', 'script': '# Training script with TRL\\nfrom trl import SFTConfig, SFTTrainer\\nfrom transformers import AutoModelForCausalLM\\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen3-4B\")\\n# ... researched implementation from docs ...\\ntrainer.train()\\ntrainer.push_to_hub(\"user-name/my-model\")', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a10g-large', 'timeout': '4h'}\n\n"
916
- "**Data Processing:**\n"
917
- "{'operation': 'run', 'script': 'from datasets import load_dataset\\nds = load_dataset(\"data\")\\n# process...\\nds.push_to_hub(\"user/processed\")', 'dependencies': ['datasets', 'pandas'], 'hardware_flavor': 'cpu-upgrade', 'timeout': '2h'}\n\n"
918
- "**Scheduled Daily Job:**\n"
919
- "{'operation': 'scheduled run', 'schedule': '@daily', 'script': 'from datasets import Dataset\\nimport pandas as pd\\n# scrape/generate data\\ndf = pd.DataFrame(data)\\nds = Dataset.from_pandas(df)\\nds.push_to_hub(\"user-name/daily-dataset\")', 'dependencies': ['datasets', 'pandas'], 'hardware_flavor': 'cpu-basic'}\n\n"
920
- "**Docker Mode:**\n"
921
- "{'operation': 'run', 'image': 'pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime', 'command': ['python', 'train.py', '--epochs', '10'], 'hardware_flavor': 'a100-large'}\n\n"
922
- "**Monitor Operations:**\n"
923
- "{'operation': 'ps'} - List all jobs\n"
924
- "{'operation': 'logs', 'job_id': 'xxx'} - Stream logs (only when user requests)\n"
925
- "{'operation': 'inspect', 'job_id': 'xxx'} - Get job details\n"
926
- "{'operation': 'cancel', 'job_id': 'xxx'} - Stop job\n\n"
927
- "⚠️ CRITICAL: Files created during execution are DELETED when job finishes. MUST push_to_hub() all outputs (models, datasets, artifacts) in script. For logs/scripts, use hf_private_repos after completion."
928
  ),
929
  "parameters": {
930
  "type": "object",
@@ -944,58 +936,65 @@ HF_JOBS_TOOL_SPEC = {
944
  "scheduled suspend",
945
  "scheduled resume",
946
  ],
947
- "description": (
948
- "Operation to execute. Valid values: [run, ps, logs, inspect, cancel, "
949
- "scheduled run, scheduled ps, scheduled inspect, scheduled delete, "
950
- "scheduled suspend, scheduled resume]"
951
- ),
952
  },
953
- # Python/UV specific parameters
954
  "script": {
955
  "type": "string",
956
- "description": "Python code to execute. Triggers Python mode (auto pip install). Use with 'run'/'scheduled run'. Mutually exclusive with 'command'.",
 
 
 
 
957
  },
958
  "dependencies": {
959
  "type": "array",
960
  "items": {"type": "string"},
961
- "description": "Pip packages to install. Example: ['trl', 'torch', 'datasets', 'transformers']. Only used with 'script'.",
 
 
 
 
962
  },
963
- # Docker specific parameters
964
  "image": {
965
  "type": "string",
966
- "description": "Docker image. Example: 'pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime'. Use with 'run'/'scheduled run'. Optional (auto-selected if not provided).",
967
  },
968
  "command": {
969
  "type": "array",
970
  "items": {"type": "string"},
971
- "description": "Command to execute as list. Example: ['python', 'train.py', '--epochs', '10']. Triggers Docker mode. Use with 'run'/'scheduled run'. Mutually exclusive with 'script'.",
972
  },
973
- # Hardware and environment
974
  "hardware_flavor": {
975
  "type": "string",
976
- "description": f"Hardware type. Available CPU flavors: {CPU_FLAVORS}. Available GPU flavors: {GPU_FLAVORS}. Use with 'run'/'scheduled run'.",
 
 
 
 
977
  },
978
  "timeout": {
979
  "type": "string",
980
- "description": "Max runtime. Examples: '30m', '2h', '4h'. Default: '30m'. Important for long training jobs. Use with 'run'/'scheduled run'.",
 
 
 
 
981
  },
982
  "env": {
983
  "type": "object",
984
- "description": "Environment variables. Format: {'KEY': 'VALUE'}. HF_TOKEN is automatically included from your auth. Use with 'run'/'scheduled run'.",
985
  },
986
- # Job management parameters
987
  "job_id": {
988
  "type": "string",
989
- "description": "Job ID to operate on. Required for: 'logs', 'inspect', 'cancel'.",
990
  },
991
- # Scheduled job parameters
992
  "scheduled_job_id": {
993
  "type": "string",
994
- "description": "Scheduled job ID. Required for: 'scheduled inspect', 'scheduled delete', 'scheduled suspend', 'scheduled resume'.",
995
  },
996
  "schedule": {
997
  "type": "string",
998
- "description": "Schedule for recurring job. Presets: '@hourly', '@daily', '@weekly', '@monthly'. Cron: '0 9 * * 1' (Mon 9am). Required for: 'scheduled run'.",
999
  },
1000
  },
1001
  "required": ["operation"],
 
118
  return logs
119
 
120
 
121
+ _DEFAULT_ENV = {
122
+ "HF_HUB_DISABLE_PROGRESS_BARS": "1",
123
+ "TQDM_DISABLE": "1",
124
+ "TRANSFORMERS_VERBOSITY": "warning",
125
+ "HF_HUB_ENABLE_HF_TRANSFER": "1",
126
+ }
127
+
128
+
129
+ def _add_default_env(params: Dict[str, Any] | None) -> Dict[str, Any]:
130
+ """Inject default env vars for clean, agent-friendly output."""
131
+ result = dict(_DEFAULT_ENV)
132
+ result.update(params or {}) # user-provided values override defaults
133
+ return result
134
+
135
+
136
  def _add_environment_variables(params: Dict[str, Any] | None) -> Dict[str, Any]:
137
  token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN") or ""
138
 
 
512
  self.api.run_job,
513
  image=image,
514
  command=command,
515
+ env=_add_default_env(args.get("env")),
516
  secrets=_add_environment_variables(args.get("secrets")),
517
  flavor=args.get("hardware_flavor", "cpu-basic"),
518
  timeout=args.get("timeout", "30m"),
 
730
  image=image,
731
  command=command,
732
  schedule=schedule,
733
+ env=_add_default_env(args.get("env")),
734
  secrets=_add_environment_variables(args.get("secrets")),
735
  flavor=args.get("hardware_flavor", "cpu-basic"),
736
  timeout=args.get("timeout", "30m"),
 
890
  HF_JOBS_TOOL_SPEC = {
891
  "name": "hf_jobs",
892
  "description": (
893
+ "Execute Python scripts or Docker containers on HF cloud infrastructure.\n\n"
894
+ "Two modes (mutually exclusive): Python mode (script + dependencies) or Docker mode (command + image). "
895
+ "Provide exactly ONE of 'script' or 'command'.\n\n"
896
+ "BEFORE submitting training/fine-tuning jobs:\n"
897
+ "- You MUST have called github_find_examples + github_read_file to find a working reference implementation. "
898
+ "Scripts based on your internal knowledge WILL use outdated APIs and fail.\n"
899
+ "- You MUST have validated dataset format via hf_inspect_dataset or hub_repo_details.\n"
900
+ "- Training config MUST include push_to_hub=True and hub_model_id. "
901
+ "Job storage is EPHEMERAL β€” all files are deleted when the job ends. Without push_to_hub, trained models are lost permanently.\n"
902
+ "- Include trackio monitoring and provide the dashboard URL to the user.\n\n"
903
+ "BATCH/ABLATION JOBS: Submit ONE job first. Check logs to confirm it starts training successfully. "
904
+ "Only then submit the remaining jobs. Never submit all at once β€” if there's a bug, all jobs fail.\n\n"
905
+ "Operations: run, ps, logs, inspect, cancel, scheduled run/ps/inspect/delete/suspend/resume.\n\n"
906
+ f"Hardware: CPU: {CPU_FLAVORS_DESC}. GPU: {GPU_FLAVORS_DESC}.\n"
907
+ "Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
908
+ "Note: a10g-small and a10g-large have the SAME 24GB GPU β€” the difference is CPU/RAM only.\n\n"
909
+ "OOM RECOVERY: When a training job fails with CUDA OOM:\n"
910
+ "1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keeps effective batch size identical)\n"
911
+ "2. Enable gradient_checkpointing=True\n"
912
+ "3. Upgrade to larger GPU (a10g→a100→h100)\n"
913
+ "Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length β€” those change what the user gets and require explicit approval.\n\n"
914
+ "After submission: return immediately with job ID, monitoring URL, expected duration and cost. "
915
+ "Do not poll logs unless the user asks.\n\n"
916
+ "Examples:\n"
917
+ "Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a10g-large', 'timeout': '4h'}\n"
918
+ "Data processing: {'operation': 'run', 'script': '<inline>', 'dependencies': ['datasets'], 'hardware_flavor': 'cpu-upgrade', 'timeout': '2h'}\n"
919
+ "Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
920
  ),
921
  "parameters": {
922
  "type": "object",
 
936
  "scheduled suspend",
937
  "scheduled resume",
938
  ],
939
+ "description": "Operation to execute.",
 
 
 
 
940
  },
 
941
  "script": {
942
  "type": "string",
943
+ "description": (
944
+ "Python code or sandbox file path (e.g. '/app/train.py') or URL. "
945
+ "Triggers Python mode. For ML training: base this on a working example found via github_find_examples, not on internal knowledge. "
946
+ "Mutually exclusive with 'command'."
947
+ ),
948
  },
949
  "dependencies": {
950
  "type": "array",
951
  "items": {"type": "string"},
952
+ "description": (
953
+ "Pip packages to install. Include ALL required packages. "
954
+ "Common training set: ['transformers', 'trl', 'torch', 'datasets', 'trackio', 'accelerate']. "
955
+ "Only used with 'script'."
956
+ ),
957
  },
 
958
  "image": {
959
  "type": "string",
960
+ "description": "Docker image. Optional β€” auto-selected if not provided. Use with 'command'.",
961
  },
962
  "command": {
963
  "type": "array",
964
  "items": {"type": "string"},
965
+ "description": "Command to execute as list. Triggers Docker mode. Mutually exclusive with 'script'.",
966
  },
 
967
  "hardware_flavor": {
968
  "type": "string",
969
+ "description": (
970
+ "Hardware type. Sizing guide: 1-3B params β†’ t4-small/a10g-small, "
971
+ "7-13B β†’ a10g-large, 30B+ β†’ a100-large, 70B+ β†’ h100/h100x8. "
972
+ f"All options: CPU: {CPU_FLAVORS}. GPU: {GPU_FLAVORS}."
973
+ ),
974
  },
975
  "timeout": {
976
  "type": "string",
977
+ "description": (
978
+ "Maximum job runtime. MUST be >2h for any training job β€” default 30m kills training mid-run. "
979
+ "Guidelines: 1-3B models: 3-4h, 7-13B: 6-8h, 30B+: 12-24h. "
980
+ "Use 30m-1h only for quick data processing or inference tasks. Default: '30m'."
981
+ ),
982
  },
983
  "env": {
984
  "type": "object",
985
+ "description": "Environment variables {'KEY': 'VALUE'}. HF_TOKEN is auto-included.",
986
  },
 
987
  "job_id": {
988
  "type": "string",
989
+ "description": "Job ID. Required for: logs, inspect, cancel.",
990
  },
 
991
  "scheduled_job_id": {
992
  "type": "string",
993
+ "description": "Scheduled job ID. Required for: scheduled inspect/delete/suspend/resume.",
994
  },
995
  "schedule": {
996
  "type": "string",
997
+ "description": "Cron schedule or preset (@hourly, @daily, @weekly, @monthly). Required for: scheduled run.",
998
  },
999
  },
1000
  "required": ["operation"],
agent/tools/plan_tool.py CHANGED
@@ -85,18 +85,11 @@ def get_current_plan() -> List[Dict[str, str]]:
85
  PLAN_TOOL_SPEC = {
86
  "name": "plan_tool",
87
  "description": (
88
- "Manage task planning and progress tracking with todo list (pending/in_progress/completed statuses). "
89
- "⚠️ CRITICAL: ALWAYS use for multi-step tasks (3+ steps) and MUST update frequently to show progress. "
90
- "**Use when:** (1) User provides multiple tasks, (2) Complex workflows (training, evaluation, data processing), "
91
- "(3) Tasks requiring multiple tool calls, (4) Need to communicate progress clearly to user, "
92
- "(5) Breaking down ambiguous requests into concrete steps. "
93
- "**Pattern:** Create plan at start β†’ Mark in_progress when starting task β†’ Mark completed immediately after finishing β†’ User sees clear progress. "
94
- "Each call replaces entire plan (full list required). "
95
- "**Critical for reliability:** Exactly ONE task in_progress at a time (not zero, not multiple). "
96
- "Mark tasks completed IMMEDIATELY after finishing - don't batch completions. "
97
- "**For long-running tasks:** Update plan after each major step to keep user informed. "
98
- "**Only mark completed when:** Task fully accomplished, no errors, all requirements met. "
99
- "Keep tasks pending if blocked/errors occur - create new task to resolve blockers."
100
  ),
101
  "parameters": {
102
  "type": "object",
 
85
  PLAN_TOOL_SPEC = {
86
  "name": "plan_tool",
87
  "description": (
88
+ "Track progress on multi-step tasks with a todo list (pending/in_progress/completed).\n\n"
89
+ "Use for tasks with 3+ steps. Each call replaces the entire plan (send full list).\n\n"
90
+ "Rules: exactly ONE task in_progress at a time. Mark completed immediately after finishing. "
91
+ "Only mark completed when the task fully succeeded β€” keep in_progress if there are errors. "
92
+ "Update frequently so the user sees progress."
 
 
 
 
 
 
 
93
  ),
94
  "parameters": {
95
  "type": "object",
agent/tools/sandbox_client.py CHANGED
@@ -83,7 +83,11 @@ USER user
83
 
84
  ENV HOME=/home/user \\
85
  PATH=/home/user/.local/bin:$PATH \\
86
- PIP_USER=1
 
 
 
 
87
 
88
  WORKDIR /app
89
  COPY --chown=user . /app
 
83
 
84
  ENV HOME=/home/user \\
85
  PATH=/home/user/.local/bin:$PATH \\
86
+ PIP_USER=1 \\
87
+ HF_HUB_DISABLE_PROGRESS_BARS=1 \\
88
+ TQDM_DISABLE=1 \\
89
+ TRANSFORMERS_VERBOSITY=warning \\
90
+ HF_HUB_ENABLE_HF_TRANSFER=1
91
 
92
  WORKDIR /app
93
  COPY --chown=user . /app
agent/tools/sandbox_tool.py CHANGED
@@ -77,17 +77,16 @@ async def _ensure_sandbox(
77
  SANDBOX_CREATE_TOOL_SPEC = {
78
  "name": "sandbox_create",
79
  "description": (
80
- "Create a persistent remote Linux sandbox on HF Spaces for interactive development.\n"
81
- "YOU MUST DO THIS BEFORE USING bash/read/write/edit tools.\n"
82
- "\n"
83
- "Spins up a new sandbox with a given hardware tier where you can run commands, read/write/edit files, "
84
- "install packages, and debug iteratively. The sandbox persists across tool calls within the session."
85
- "\n"
86
- "You can choose from the following hardware tiers (GPU is required for model development or other tasks that benefit from and utilize the GPU): "
87
- + ", ".join([e.value for e in SpaceHardware])
88
- + ".\n"
89
- "Use sandbox for: iterative development, debugging, multi-step workflows, testing code.\n"
90
- "Use hf_jobs instead for: one-shot batch runs, scheduled tasks, fire-and-forget training.\n"
91
  ),
92
  "parameters": {
93
  "type": "object",
 
77
  SANDBOX_CREATE_TOOL_SPEC = {
78
  "name": "sandbox_create",
79
  "description": (
80
+ "Create a persistent remote Linux environment for developing and testing scripts.\n\n"
81
+ "Workflow: sandbox_create β†’ write script β†’ pip install β†’ test with small run β†’ fix errors β†’ hf_jobs at scale.\n"
82
+ "The sandbox persists across tool calls within the session. pip install works out of the box.\n\n"
83
+ "Use this when: you need to develop, test, and iterate on scripts before launching via hf_jobs. "
84
+ "Especially for training scripts where you need to verify imports, test on a small subset, and fix errors interactively.\n\n"
85
+ "Skip this when: the task is a simple one-shot operation (status check, resource search, quick data query), "
86
+ "or the script is copied from a verified working example with minimal changes.\n\n"
87
+ "For ML code that uses CUDA, bf16, or model loading: use GPU hardware (t4-small minimum). "
88
+ "CPU sandboxes cannot run GPU code paths β€” your test will not catch GPU-related errors.\n\n"
89
+ "Hardware: " + ", ".join([e.value for e in SpaceHardware]) + ".\n"
 
90
  ),
91
  "parameters": {
92
  "type": "object",