Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
ef41473
1
Parent(s): 66647c8
rewording
Browse files- agent/prompts/system_prompt_v3.yaml +17 -21
- agent/tools/dataset_tools.py +2 -1
- agent/tools/docs_tools.py +5 -5
- agent/tools/github_find_examples.py +2 -2
- agent/tools/jobs_tool.py +21 -25
agent/prompts/system_prompt_v3.yaml
CHANGED
|
@@ -14,7 +14,7 @@ system_prompt: |
|
|
| 14 |
|
| 15 |
github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
|
| 16 |
|
| 17 |
-
Skip research only for
|
| 18 |
|
| 19 |
# Mistakes you WILL make without research
|
| 20 |
|
|
@@ -28,21 +28,21 @@ system_prompt: |
|
|
| 28 |
|
| 29 |
LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
|
| 30 |
|
| 31 |
-
BATCH FAILURES: You will submit all ablation/batch jobs at once without testing one first. All fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
|
| 32 |
|
| 33 |
SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
|
| 34 |
|
| 35 |
-
HARDCODED UNAVAILABLE PACKAGES: You will
|
| 36 |
|
| 37 |
-
SCOPE-CHANGING FIXES: When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request. If the original approach genuinely cannot work, explain why and ask the user before changing methods, sequence length,
|
| 38 |
|
| 39 |
# When writing ML code
|
| 40 |
|
| 41 |
Required sequence before any training/fine-tuning/inference script:
|
| 42 |
1. Find working examples: github_find_examples (discover) → github_read_file (study)
|
| 43 |
2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
|
| 44 |
-
3. Validate dataset: hf_inspect_dataset
|
| 45 |
-
4. Validate model: hub_repo_details to confirm model exists
|
| 46 |
|
| 47 |
Dataset format requirements by training method:
|
| 48 |
SFT: "messages", "text", or "prompt"/"completion"
|
|
@@ -56,27 +56,26 @@ system_prompt: |
|
|
| 56 |
- Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
|
| 57 |
- push_to_hub=True and hub_model_id set
|
| 58 |
- timeout: [value] (based on: [model size] on [hardware])
|
| 59 |
-
- Trackio monitoring included
|
| 60 |
|
| 61 |
If you cannot fill in all items, stop and complete the missing steps first.
|
| 62 |
|
| 63 |
For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
|
| 64 |
|
| 65 |
Hardware sizing:
|
| 66 |
-
1-3B params:
|
| 67 |
-
7-13B params:
|
| 68 |
-
30B+ params:
|
| 69 |
-
70B+ params:
|
| 70 |
Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
|
| 71 |
|
| 72 |
# Sandbox-first development
|
| 73 |
|
| 74 |
For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
|
| 75 |
-
sandbox_create →
|
| 76 |
|
| 77 |
Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
|
| 78 |
|
| 79 |
-
Skip sandbox for: simple one-shot data queries, scripts copied directly from verified working examples with minimal changes.
|
| 80 |
|
| 81 |
# When a task has 3+ steps
|
| 82 |
|
|
@@ -88,7 +87,7 @@ system_prompt: |
|
|
| 88 |
- Diagnose the actual error. Read the full error message and logs.
|
| 89 |
- Do not retry the exact same thing. Identify what needs to change.
|
| 90 |
- If an API/import error: check documentation for the correct API.
|
| 91 |
-
- If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (
|
| 92 |
- Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
|
| 93 |
- If a tool call fails repeatedly for the same reason: stop and try a different approach.
|
| 94 |
- Never silently substitute resources (datasets, models) — tell the user if something isn't available.
|
|
@@ -97,11 +96,10 @@ system_prompt: |
|
|
| 97 |
|
| 98 |
Before ending your turn, verify:
|
| 99 |
- Did you actually DO what the user asked, not just explain what you would do?
|
| 100 |
-
- If
|
| 101 |
-
-
|
| 102 |
-
- For training jobs: did you include the Trackio dashboard URL?
|
| 103 |
|
| 104 |
-
Do not stop after describing what you plan to do. Continue calling tools until the task is done.
|
| 105 |
Do not mark plan tasks as completed if they failed or are only partially done.
|
| 106 |
|
| 107 |
# Communication
|
|
@@ -109,14 +107,12 @@ system_prompt: |
|
|
| 109 |
- Be concise and direct. No filler, no restating what the user said.
|
| 110 |
- One-word answers when appropriate for simple questions.
|
| 111 |
- Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
|
| 112 |
-
- After submitting async jobs: provide job ID, monitoring URL, expected duration and cost.
|
| 113 |
- For errors: state what went wrong, why, and what you're doing to fix it.
|
| 114 |
- Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
|
| 115 |
-
- Do not use emoji in regular text.
|
| 116 |
|
| 117 |
# Tool usage
|
| 118 |
|
| 119 |
- Execute multiple independent tool calls in parallel when possible.
|
| 120 |
-
- HF_TOKEN is automatically available in job secrets —
|
| 121 |
- For training monitoring: include Trackio in the script and provide the dashboard URL.
|
| 122 |
- For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.
|
|
|
|
| 14 |
|
| 15 |
github_find_examples → github_read_file → explore_hf_docs + fetch_hf_docs
|
| 16 |
|
| 17 |
+
Skip research only for trivial non-code operations.
|
| 18 |
|
| 19 |
# Mistakes you WILL make without research
|
| 20 |
|
|
|
|
| 28 |
|
| 29 |
LOST MODELS: You will forget push_to_hub=True and hub_model_id in training config. Job storage is ephemeral — the filesystem is deleted when the job ends. Without push_to_hub, the trained model is permanently lost.
|
| 30 |
|
| 31 |
+
BATCH FAILURES: You will submit all ablation/batch jobs at once without testing that one works first. All will fail for the same bug. Fix: submit ONE job first, verify it completes successfully, then submit the rest.
|
| 32 |
|
| 33 |
SILENT DATASET SUBSTITUTION: When a requested dataset fails to load, you will silently switch to a different one without telling the user. Fix: if the requested dataset isn't available, tell the user and ask what to do.
|
| 34 |
|
| 35 |
+
HARDCODED UNAVAILABLE PACKAGES: You will forget to install necessary packages like 'flash-attn' for flash_attention_2 or other packages that aren't automatically installed in the job environment. Fix: install necessary packages before running the job.
|
| 36 |
|
| 37 |
+
SCOPE-CHANGING FIXES: Avoid at all costs! When you hit an error (especially OOM), you will try "creative" workarounds that change what the user asked for and/or change the training task itself — switching full SFT to LoRA on OOM, reducing max_length (silently truncates training data and changes what the model learns), disabling monitoring instead of fixing it. Do not do this. Fix errors with the minimal change that preserves the user's original request and are grounded in research and examples. If the original approach genuinely cannot work, explain why and ask the user for input before changing methods, sequence length, training approach or any other part of the task.
|
| 38 |
|
| 39 |
# When writing ML code
|
| 40 |
|
| 41 |
Required sequence before any training/fine-tuning/inference script:
|
| 42 |
1. Find working examples: github_find_examples (discover) → github_read_file (study)
|
| 43 |
2. Check documentation: explore_hf_docs + fetch_hf_docs for trainer configs and parameters
|
| 44 |
+
3. Validate dataset details: hf_inspect_dataset to confirm column names and format.
|
| 45 |
+
4. Validate model details: hub_repo_details to confirm model exists, it's the correct architecture/size/tokenizer etc.
|
| 46 |
|
| 47 |
Dataset format requirements by training method:
|
| 48 |
SFT: "messages", "text", or "prompt"/"completion"
|
|
|
|
| 56 |
- Dataset format verified: [columns confirmed via hf_inspect_dataset/hub_repo_details]
|
| 57 |
- push_to_hub=True and hub_model_id set
|
| 58 |
- timeout: [value] (based on: [model size] on [hardware])
|
| 59 |
+
- Trackio monitoring included and working
|
| 60 |
|
| 61 |
If you cannot fill in all items, stop and complete the missing steps first.
|
| 62 |
|
| 63 |
For batch/ablation jobs: submit ONE job first. Check logs to confirm it starts training successfully. Only then submit the remaining jobs. Never submit all at once.
|
| 64 |
|
| 65 |
Hardware sizing:
|
| 66 |
+
1-3B params: a10g-largex2
|
| 67 |
+
7-13B params: a100-large
|
| 68 |
+
30B+ params: l40sx4 or a100x4
|
| 69 |
+
70B+ params: a100x8
|
| 70 |
Note: a10g-small and a10g-large have the SAME 24GB GPU memory. The difference is CPU/RAM only.
|
| 71 |
|
| 72 |
# Sandbox-first development
|
| 73 |
|
| 74 |
For non-trivial scripts, develop and test in a sandbox before launching via hf_jobs:
|
| 75 |
+
sandbox_create → install deps → write script → test with small run → fix errors → launch via hf_jobs at scale
|
| 76 |
|
| 77 |
Use GPU sandbox (t4-small minimum) when testing code that uses CUDA, bf16, or model loading. CPU sandboxes cannot test GPU code paths.
|
| 78 |
|
|
|
|
| 79 |
|
| 80 |
# When a task has 3+ steps
|
| 81 |
|
|
|
|
| 87 |
- Diagnose the actual error. Read the full error message and logs.
|
| 88 |
- Do not retry the exact same thing. Identify what needs to change.
|
| 89 |
- If an API/import error: check documentation for the correct API.
|
| 90 |
+
- If an OOM error: (1) reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally to keep effective batch size identical, (2) enable gradient_checkpointing=True, (3) upgrade to larger GPU (a10gx4→a100→a100x4→a100x8). Do NOT switch training methods (e.g. SFT→LoRA) or reduce max_length — those change what the user gets. If OOM happens in sandbox, create a new sandbox with larger GPU hardware.
|
| 91 |
- Never change the user's requested approach (training method, dataset, model, sequence length) without explicit approval.
|
| 92 |
- If a tool call fails repeatedly for the same reason: stop and try a different approach.
|
| 93 |
- Never silently substitute resources (datasets, models) — tell the user if something isn't available.
|
|
|
|
| 96 |
|
| 97 |
Before ending your turn, verify:
|
| 98 |
- Did you actually DO what the user asked, not just explain what you would do?
|
| 99 |
+
- If something failed: did you diagnose and fix it, or at minimum explain what went wrong and ask for user input?
|
| 100 |
+
- For training jobs: did you include a working Trackio dashboard URL?
|
|
|
|
| 101 |
|
| 102 |
+
Do not stop after describing what you plan to do. Continue calling tools until the task is verifiably done.
|
| 103 |
Do not mark plan tasks as completed if they failed or are only partially done.
|
| 104 |
|
| 105 |
# Communication
|
|
|
|
| 107 |
- Be concise and direct. No filler, no restating what the user said.
|
| 108 |
- One-word answers when appropriate for simple questions.
|
| 109 |
- Always include direct Hub URLs when referencing models, datasets, Spaces, or jobs.
|
|
|
|
| 110 |
- For errors: state what went wrong, why, and what you're doing to fix it.
|
| 111 |
- Do not over-explain or present elaborate option menus for simple tasks. When the user's intent is clear, act on it. Present options only when there's genuine ambiguity.
|
|
|
|
| 112 |
|
| 113 |
# Tool usage
|
| 114 |
|
| 115 |
- Execute multiple independent tool calls in parallel when possible.
|
| 116 |
+
- HF_TOKEN is automatically available in job secrets — no need to include it extra.
|
| 117 |
- For training monitoring: include Trackio in the script and provide the dashboard URL.
|
| 118 |
- For private/gated datasets: HF_TOKEN is needed — it's auto-loaded into job secrets.
|
agent/tools/dataset_tools.py
CHANGED
|
@@ -393,8 +393,9 @@ HF_INSPECT_DATASET_TOOL_SPEC = {
|
|
| 393 |
" SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
|
| 394 |
" DPO: needs 'prompt', 'chosen', 'rejected'\n"
|
| 395 |
" GRPO: needs 'prompt'\n"
|
|
|
|
| 396 |
"Training will fail with KeyError if columns don't match.\n\n"
|
| 397 |
-
"Also use to understand column names, data types, and available splits before writing any data loading code. "
|
| 398 |
"Supports private/gated datasets when HF_TOKEN is set."
|
| 399 |
),
|
| 400 |
"parameters": {
|
|
|
|
| 393 |
" SFT: needs 'messages', 'text', or 'prompt'/'completion'\n"
|
| 394 |
" DPO: needs 'prompt', 'chosen', 'rejected'\n"
|
| 395 |
" GRPO: needs 'prompt'\n"
|
| 396 |
+
"All datasets used for training have to be in conversational ChatML format to be compatible with HF libraries.'\n"
|
| 397 |
"Training will fail with KeyError if columns don't match.\n\n"
|
| 398 |
+
"Also use to get example datapoints, understand column names, data types, and available splits before writing any data loading code. "
|
| 399 |
"Supports private/gated datasets when HF_TOKEN is set."
|
| 400 |
),
|
| 401 |
"parameters": {
|
agent/tools/docs_tools.py
CHANGED
|
@@ -845,9 +845,9 @@ DOC_ENDPOINTS = [
|
|
| 845 |
EXPLORE_HF_DOCS_TOOL_SPEC = {
|
| 846 |
"name": "explore_hf_docs",
|
| 847 |
"description": (
|
| 848 |
-
"Browse HF documentation structure — discover available
|
| 849 |
-
"Use this to
|
| 850 |
-
"
|
| 851 |
"Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
|
| 852 |
"For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
|
| 853 |
"Returns top 20 results by default; set max_results (max 50) to adjust."
|
|
@@ -924,8 +924,8 @@ HF_DOCS_FETCH_TOOL_SPEC = {
|
|
| 924 |
"name": "fetch_hf_docs",
|
| 925 |
"description": (
|
| 926 |
"Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
|
| 927 |
-
"Critical for
|
| 928 |
-
"before writing training scripts. Your internal knowledge
|
| 929 |
"Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
|
| 930 |
),
|
| 931 |
"parameters": {
|
|
|
|
| 845 |
EXPLORE_HF_DOCS_TOOL_SPEC = {
|
| 846 |
"name": "explore_hf_docs",
|
| 847 |
"description": (
|
| 848 |
+
"Browse HF documentation structure — discover all available documentation with 200-char previews.\n\n"
|
| 849 |
+
"Use this to find relevant documentation and/or examples with detailed parameter docs and API reference. "
|
| 850 |
+
"To be used together with github_find_examples and github_read_file to find working examples and documentation.\n\n"
|
| 851 |
"Pattern: explore_hf_docs (find relevant pages) → fetch_hf_docs (get full content).\n\n"
|
| 852 |
"For training tasks: fetch the trainer config docs (SFTConfig, DPOConfig, GRPOConfig) to verify parameter names. "
|
| 853 |
"Returns top 20 results by default; set max_results (max 50) to adjust."
|
|
|
|
| 924 |
"name": "fetch_hf_docs",
|
| 925 |
"description": (
|
| 926 |
"Fetch full markdown content of an HF documentation page. Use after explore_hf_docs.\n\n"
|
| 927 |
+
"Critical for finding documentation e.g. current trainer configuration parameters (SFTConfig, DPOConfig, etc.) "
|
| 928 |
+
"Use for researching solutions and before writing training scripts. Your internal knowledge is outdated.\n\n"
|
| 929 |
"Provide the full URL from explore_hf_docs results. The .md extension is added automatically."
|
| 930 |
),
|
| 931 |
"parameters": {
|
agent/tools/github_find_examples.py
CHANGED
|
@@ -405,10 +405,10 @@ def find_examples(
|
|
| 405 |
GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
|
| 406 |
"name": "github_find_examples",
|
| 407 |
"description": (
|
| 408 |
-
"Find working example scripts in GitHub repositories (examples/, scripts/, tutorials/
|
| 409 |
"Uses fuzzy keyword matching.\n\n"
|
| 410 |
"MANDATORY before writing any ML training, fine-tuning, or inference code. "
|
| 411 |
-
"Your internal knowledge of
|
| 412 |
"Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
|
| 413 |
"Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
|
| 414 |
"Examples:\n"
|
|
|
|
| 405 |
GITHUB_FIND_EXAMPLES_TOOL_SPEC = {
|
| 406 |
"name": "github_find_examples",
|
| 407 |
"description": (
|
| 408 |
+
"Find working example scripts in GitHub repositories (from a list of predetermined directories e.g. examples/, scripts/, tutorials/, etc.). "
|
| 409 |
"Uses fuzzy keyword matching.\n\n"
|
| 410 |
"MANDATORY before writing any ML training, fine-tuning, or inference code. "
|
| 411 |
+
"Your internal knowledge of library APIs is outdated — working examples show current API patterns.\n\n"
|
| 412 |
"Sequence: github_find_examples → github_read_file (study the example) → implement based on what you found.\n\n"
|
| 413 |
"Skip this only for: simple data queries, status checks, non-code tasks.\n\n"
|
| 414 |
"Examples:\n"
|
agent/tools/jobs_tool.py
CHANGED
|
@@ -9,7 +9,7 @@ import base64
|
|
| 9 |
import http.client
|
| 10 |
import os
|
| 11 |
import re
|
| 12 |
-
from typing import Any,
|
| 13 |
|
| 14 |
import httpx
|
| 15 |
from huggingface_hub import HfApi
|
|
@@ -25,38 +25,33 @@ from agent.tools.utilities import (
|
|
| 25 |
)
|
| 26 |
|
| 27 |
# Hardware flavors
|
| 28 |
-
CPU_FLAVORS = ["cpu-basic", "cpu-upgrade"
|
| 29 |
GPU_FLAVORS = [
|
| 30 |
-
"sprx8",
|
| 31 |
-
"zero-a10g",
|
| 32 |
"t4-small",
|
| 33 |
"t4-medium",
|
| 34 |
-
"l4x1",
|
| 35 |
-
"l4x4",
|
| 36 |
-
"l40sx1",
|
| 37 |
-
"l40sx4",
|
| 38 |
-
"l40sx8",
|
| 39 |
"a10g-small",
|
| 40 |
"a10g-large",
|
| 41 |
"a10g-largex2",
|
| 42 |
"a10g-largex4",
|
| 43 |
"a100-large",
|
| 44 |
-
"
|
| 45 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
]
|
| 47 |
|
| 48 |
# Detailed specs for display (vCPU/RAM/GPU VRAM)
|
| 49 |
-
CPU_FLAVORS_DESC = (
|
| 50 |
-
"cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB), cpu-performance, cpu-xl"
|
| 51 |
-
)
|
| 52 |
GPU_FLAVORS_DESC = (
|
| 53 |
"t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
|
| 54 |
-
"
|
| 55 |
-
"l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB), "
|
| 56 |
-
"a10g-small(4vCPU/14GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
|
| 57 |
"a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
|
| 58 |
-
"a100-large(12vCPU/142GB/GPU 80GB),
|
| 59 |
-
"
|
|
|
|
| 60 |
)
|
| 61 |
SPECIALIZED_FLAVORS = ["inf2x6"]
|
| 62 |
ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
|
|
@@ -389,7 +384,9 @@ class HfJobsTool:
|
|
| 389 |
def log_producer():
|
| 390 |
try:
|
| 391 |
# fetch_job_logs is a blocking sync generator
|
| 392 |
-
logs_gen = self.api.fetch_job_logs(
|
|
|
|
|
|
|
| 393 |
for line in logs_gen:
|
| 394 |
# Push line to queue thread-safely
|
| 395 |
loop.call_soon_threadsafe(queue.put_nowait, line)
|
|
@@ -907,16 +904,14 @@ HF_JOBS_TOOL_SPEC = {
|
|
| 907 |
"Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
|
| 908 |
"Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
|
| 909 |
"OOM RECOVERY: When a training job fails with CUDA OOM:\n"
|
| 910 |
-
"1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (
|
| 911 |
"2. Enable gradient_checkpointing=True\n"
|
| 912 |
"3. Upgrade to larger GPU (a10g→a100→h100)\n"
|
| 913 |
"Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
|
| 914 |
-
"After submission: return immediately with job ID, monitoring URL, expected duration and cost. "
|
| 915 |
-
"Do not poll logs unless the user asks.\n\n"
|
| 916 |
"Examples:\n"
|
| 917 |
-
"Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': '
|
| 918 |
-
"Data processing: {'operation': 'run', 'script': '<inline>', 'dependencies': ['datasets'], 'hardware_flavor': 'cpu-upgrade', 'timeout': '2h'}\n"
|
| 919 |
"Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
|
|
|
|
| 920 |
),
|
| 921 |
"parameters": {
|
| 922 |
"type": "object",
|
|
@@ -1030,6 +1025,7 @@ async def hf_jobs_handler(
|
|
| 1030 |
)
|
| 1031 |
if is_path:
|
| 1032 |
import shlex
|
|
|
|
| 1033 |
result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
|
| 1034 |
if not result.success:
|
| 1035 |
return f"Failed to read {script} from sandbox: {result.error}", False
|
|
|
|
| 9 |
import http.client
|
| 10 |
import os
|
| 11 |
import re
|
| 12 |
+
from typing import Any, Awaitable, Callable, Dict, Literal, Optional
|
| 13 |
|
| 14 |
import httpx
|
| 15 |
from huggingface_hub import HfApi
|
|
|
|
| 25 |
)
|
| 26 |
|
| 27 |
# Hardware flavors
|
| 28 |
+
CPU_FLAVORS = ["cpu-basic", "cpu-upgrade"]
|
| 29 |
GPU_FLAVORS = [
|
|
|
|
|
|
|
| 30 |
"t4-small",
|
| 31 |
"t4-medium",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
"a10g-small",
|
| 33 |
"a10g-large",
|
| 34 |
"a10g-largex2",
|
| 35 |
"a10g-largex4",
|
| 36 |
"a100-large",
|
| 37 |
+
"a100x4",
|
| 38 |
+
"a100x8",
|
| 39 |
+
"l4x1",
|
| 40 |
+
"l4x4",
|
| 41 |
+
"l40sx1",
|
| 42 |
+
"l40sx4",
|
| 43 |
+
"l40sx8",
|
| 44 |
]
|
| 45 |
|
| 46 |
# Detailed specs for display (vCPU/RAM/GPU VRAM)
|
| 47 |
+
CPU_FLAVORS_DESC = "cpu-basic(2vCPU/16GB), cpu-upgrade(8vCPU/32GB)"
|
|
|
|
|
|
|
| 48 |
GPU_FLAVORS_DESC = (
|
| 49 |
"t4-small(4vCPU/15GB/GPU 16GB), t4-medium(8vCPU/30GB/GPU 16GB), "
|
| 50 |
+
"a10g-small(4vCPU/15GB/GPU 24GB), a10g-large(12vCPU/46GB/GPU 24GB), "
|
|
|
|
|
|
|
| 51 |
"a10g-largex2(24vCPU/92GB/GPU 48GB), a10g-largex4(48vCPU/184GB/GPU 96GB), "
|
| 52 |
+
"a100-large(12vCPU/142GB/GPU 80GB), a100x4(48vCPU/568GB/GPU 320GB), a100x8(96vCPU/1136GB/GPU 640GB), "
|
| 53 |
+
"l4x1(8vCPU/30GB/GPU 24GB), l4x4(48vCPU/186GB/GPU 96GB), "
|
| 54 |
+
"l40sx1(8vCPU/62GB/GPU 48GB), l40sx4(48vCPU/382GB/GPU 192GB), l40sx8(192vCPU/1534GB/GPU 384GB)"
|
| 55 |
)
|
| 56 |
SPECIALIZED_FLAVORS = ["inf2x6"]
|
| 57 |
ALL_FLAVORS = CPU_FLAVORS + GPU_FLAVORS + SPECIALIZED_FLAVORS
|
|
|
|
| 384 |
def log_producer():
|
| 385 |
try:
|
| 386 |
# fetch_job_logs is a blocking sync generator
|
| 387 |
+
logs_gen = self.api.fetch_job_logs(
|
| 388 |
+
job_id=job_id, namespace=namespace
|
| 389 |
+
)
|
| 390 |
for line in logs_gen:
|
| 391 |
# Push line to queue thread-safely
|
| 392 |
loop.call_soon_threadsafe(queue.put_nowait, line)
|
|
|
|
| 904 |
"Common picks: t4-small ($0.60/hr, 1-3B), a10g-large ($2/hr, 7-13B), a100-large ($4/hr, 30B+), h100 ($6/hr, 70B+). "
|
| 905 |
"Note: a10g-small and a10g-large have the SAME 24GB GPU — the difference is CPU/RAM only.\n\n"
|
| 906 |
"OOM RECOVERY: When a training job fails with CUDA OOM:\n"
|
| 907 |
+
"1. Reduce per_device_train_batch_size and increase gradient_accumulation_steps proportionally (keep effective batch size identical)\n"
|
| 908 |
"2. Enable gradient_checkpointing=True\n"
|
| 909 |
"3. Upgrade to larger GPU (a10g→a100→h100)\n"
|
| 910 |
"Do NOT switch training methods (e.g. full SFT to LoRA) or reduce max_length — those change what the user gets and require explicit approval.\n\n"
|
|
|
|
|
|
|
| 911 |
"Examples:\n"
|
| 912 |
+
"Training: {'operation': 'run', 'script': '/app/train.py', 'dependencies': ['transformers', 'trl', 'torch', 'datasets', 'trackio'], 'hardware_flavor': 'a100-large', 'timeout': '8h'}\n"
|
|
|
|
| 913 |
"Monitor: {'operation': 'ps'}, {'operation': 'logs', 'job_id': 'xxx'}, {'operation': 'cancel', 'job_id': 'xxx'}"
|
| 914 |
+
"Docker: {'operation': 'run', 'command': ['duckdb', '-c', 'select 1 + 2'], 'image': 'duckdb/duckdb', 'hardware_flavor': 'cpu-basic', 'timeout': '1h'}\n"
|
| 915 |
),
|
| 916 |
"parameters": {
|
| 917 |
"type": "object",
|
|
|
|
| 1025 |
)
|
| 1026 |
if is_path:
|
| 1027 |
import shlex
|
| 1028 |
+
|
| 1029 |
result = await asyncio.to_thread(sandbox.bash, f"cat {shlex.quote(script)}")
|
| 1030 |
if not result.success:
|
| 1031 |
return f"Failed to read {script} from sandbox: {result.error}", False
|