Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 8 days ago

Commit

f89906d

1 Parent(s): a12a88b

feat(civo): L40S 48GB training launcher with auto-teardown

User has $250 Civo credit (expires 1 month). L40S 48GB single-GPU is
the sweet spot for our hardware-aware picker:

per_gpu_gb=48 ≥30 → Qwen2.5-Coder-32B-Instruct (no ZeRO-3 needed)

vs Kaggle T4×2 16GB/card which OOM'd on 32B (V4) and forced fallback
to 14B (V5). L40S unlocks the proper 32B v1.5 target.

Cost projection at ~$1.50-2/hr:
v1.5 32B SFT 12-15 hr $24-30
v1.5 32B SDFT 6-8 hr $12-16
Tool-SFT 6-8 hr $12-16
Code-DPO 4-6 hr $8-12
Bench 3-way 6-8 hr $12-16
─────────────────────────────────
Subtotal ~35-45 hr $70-90
Remaining buffer $160 → v2 72B trial on H100

Workflow:
1. Provision L40S instance via civo CLI
2. Cloud-init installs transformers/peft/trl/bnb stack
3. Pulls train.py from axentx/surrogate-1 Space (latest sha)
4. Runs SFTTrainer in tmux session (survives ssh disconnect)
5. Pushes LoRA to axentx/surrogate-1-coder-32B-v1.5 on every save_step
6. Auto-teardown instance when job exits (TEARDOWN=0 to keep alive)

Discord notification on instance-up so the user knows where to ssh
if they want to watch loss curve real-time.

Usage:
CIVO_API_KEY=... bash bin/v2/civo-train-launcher.sh

Override SHAPE=gpu-h100-80 + BASE_MODEL=Qwen/Qwen2.5-72B-Instruct +
HUB_MODEL_ID=axentx/surrogate-1-coder-72b-v2-sft for v2 training.

Files changed (1) hide show

bin/v2/civo-train-launcher.sh +132 -0

bin/v2/civo-train-launcher.sh ADDED Viewed

	@@ -0,0 +1,132 @@

+#!/usr/bin/env bash
+# Surrogate-1 — Civo GPU instance launcher for v1.5/v2 training.
+#
+# Provisions an L40S 48GB instance on Civo (~$1.50-2/hr), bootstraps the
+# training stack via cloud-init, runs the kaggle-trainer.sh embedded
+# train.py against it, pushes the LoRA back to HF Hub, then tears the
+# instance down so credit isn't burned overnight.
+#
+# Why L40S not Kaggle T4×2:
+#   - L40S 48GB per-card → fits 32B QLoRA single-GPU (no ZeRO-3 needed)
+#   - 5-10× faster training step than T4 (Ada vs Turing)
+#   - $1.50-2/hr vs Kaggle's free-but-9hr-quota
+#   - $250 Civo credit covers ~125 hr training = full v1.5 chain + v2 trial
+#
+# Usage:
+#   CIVO_API_KEY=... bash civo-train-launcher.sh \
+#     --shape gpu-l40s-48 \
+#     --base Qwen/Qwen2.5-Coder-32B-Instruct \
+#     --hub axentx/surrogate-1-coder-32B-v1.5 \
+#     --max-samples 100000 --epochs 1
+#
+# Auto-shutdown after job completes (set TEARDOWN=0 to keep alive).
+set -uo pipefail
+CIVO_API_KEY="${CIVO_API_KEY:?need CIVO_API_KEY env}"
+SHAPE="${SHAPE:-gpu-l40s-48}"
+REGION="${REGION:-NYC1}"
+BASE_MODEL="${BASE_MODEL:-Qwen/Qwen2.5-Coder-32B-Instruct}"
+HUB_MODEL_ID="${HUB_MODEL_ID:-axentx/surrogate-1-coder-32B-v1.5}"
+MAX_SAMPLES="${MAX_SAMPLES:-100000}"
+EPOCHS="${EPOCHS:-1}"
+SEQ_LEN="${SEQ_LEN:-4096}"
+TEARDOWN="${TEARDOWN:-1}"
+SSH_KEY="${SSH_KEY:-$HOME/.ssh/surrogate.key}"
+# Resolve HF_TOKEN from env file
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HF_TOKEN_VAL="${HF_TOKEN_PRO_WRITE:-${HF_TOKEN}}"
+[[ -z "$HF_TOKEN_VAL" ]] && { echo "  ❌ no HF_TOKEN — abort"; exit 1; }
+LOG="$HOME/.surrogate/logs/civo-train.log"
+mkdir -p "$(dirname "$LOG")"
+log() { echo "[$(date +%H:%M:%S)] $*" | tee -a "$LOG"; }
+log "civo-train-launcher start: $BASE_MODEL → $HUB_MODEL_ID on $SHAPE/$REGION"
+# Use civo CLI with key from env
+export CIVO_TOKEN="$CIVO_API_KEY"
+civo apikey save surrogate-1 "$CIVO_API_KEY" >/dev/null 2>&1 || true
+civo apikey current surrogate-1 >/dev/null 2>&1 || true
+civo region current "$REGION" >/dev/null 2>&1
+# SSH key (reuse or create)
+if [[ ! -f "$SSH_KEY" ]]; then
+    mkdir -p "$(dirname "$SSH_KEY")"
+    ssh-keygen -t ed25519 -N "" -f "$SSH_KEY" -C "surrogate-1-civo"
+fi
+PUBKEY=$(cat "${SSH_KEY}.pub")
+SSH_KEY_NAME="surrogate-1-key"
+civo sshkey create "$SSH_KEY_NAME" --key "${SSH_KEY}.pub" --region "$REGION" 2>/dev/null || true
+# Cloud-init: install training stack, run job, push, shutdown
+USER_DATA=$(base64 -w0 <<EOF
+#cloud-config
+package_update: true
+package_upgrade: true
+packages:
+  - python3-pip
+  - python3-venv
+  - git
+  - tmux
+  - htop
+  - build-essential
+  - nvidia-cuda-toolkit
+runcmd:
+  # 1. install training deps
+  - pip install --upgrade pip
+  - pip install transformers>=4.46.0,<4.50.0 datasets>=3.0.0 \
+      peft>=0.13.0,<0.15.0 accelerate>=1.0.0,<1.3.0 \
+      bitsandbytes>=0.44.0 trl>=0.12.0,<0.16.0 \
+      huggingface_hub>=0.25.0,<0.27.0
+  # 2. write env
+  - echo "HF_TOKEN=${HF_TOKEN_VAL}" >> /etc/environment
+  - echo "BASE_MODEL=${BASE_MODEL}" >> /etc/environment
+  - echo "HUB_MODEL_ID=${HUB_MODEL_ID}" >> /etc/environment
+  - echo "MAX_SAMPLES=${MAX_SAMPLES}" >> /etc/environment
+  - echo "EPOCHS=${EPOCHS}" >> /etc/environment
+  - echo "SEQ_LEN=${SEQ_LEN}" >> /etc/environment
+  # 3. fetch train.py from axentx Space repo (HF git)
+  - cd /root && git clone --depth 1 https://huggingface.co/spaces/axentx/surrogate-1 src
+  # 4. extract embedded train.py
+  - sed -n '/cat > "\$WORK_DIR\/train.py"/,/^PYEOF$/p' /root/src/bin/kaggle-trainer.sh \
+      | sed '1d;\$d' > /root/train.py
+  # 5. run training in tmux (survives ssh disconnect)
+  - su -c "tmux new-session -d -s train 'set -a; source /etc/environment; set +a; \
+      python3 /root/train.py 2>&1 | tee /root/train.log; \
+      ${TEARDOWN:+civo instance remove --region $REGION \$(hostname) --yes}'" root
+EOF
+)
+INSTANCE_NAME="surrogate-train-$(date +%s)"
+log "creating instance: $INSTANCE_NAME ($SHAPE)"
+INSTANCE_ID=$(civo instance create \
+    --hostname "$INSTANCE_NAME" \
+    --size "$SHAPE" \
+    --diskimage ubuntu-2204 \
+    --network default \
+    --ssh-key "$SSH_KEY_NAME" \
+    --region "$REGION" \
+    --initial-user root \
+    --script <(echo "$USER_DATA" | base64 -d) \
+    --wait \
+    --output id 2>&1 | tail -1)
+if [[ -z "$INSTANCE_ID" ]]; then
+    log "❌ instance create failed"
+    exit 1
+fi
+log "✓ instance up: $INSTANCE_ID"
+PUBLIC_IP=$(civo instance show "$INSTANCE_ID" --region "$REGION" --output public_ip 2>&1 | tail -1)
+log "  public IP: $PUBLIC_IP"
+log "  ssh -i $SSH_KEY root@$PUBLIC_IP"
+log "  tmux attach -t train  # to watch training"
+log ""
+log "Training will run in tmux. Auto-teardown after job completes."
+log "Override: TEARDOWN=0 ${0##*/} ... to keep alive after training."
+[[ -n "${DISCORD_WEBHOOK:-}" ]] && curl -s -X POST -H "Content-Type: application/json" \
+    -d "{\"content\":\"🚀 Civo training instance up: $INSTANCE_NAME ($SHAPE) IP $PUBLIC_IP\"}" \
+    "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true