Ashira Pitchayapakayakul commited on
Commit
f89906d
Β·
1 Parent(s): a12a88b

feat(civo): L40S 48GB training launcher with auto-teardown

Browse files

User has $250 Civo credit (expires 1 month). L40S 48GB single-GPU is
the sweet spot for our hardware-aware picker:

per_gpu_gb=48 β‰₯30 β†’ Qwen2.5-Coder-32B-Instruct (no ZeRO-3 needed)

vs Kaggle T4Γ—2 16GB/card which OOM'd on 32B (V4) and forced fallback
to 14B (V5). L40S unlocks the proper 32B v1.5 target.

Cost projection at ~$1.50-2/hr:
v1.5 32B SFT 12-15 hr $24-30
v1.5 32B SDFT 6-8 hr $12-16
Tool-SFT 6-8 hr $12-16
Code-DPO 4-6 hr $8-12
Bench 3-way 6-8 hr $12-16
─────────────────────────────────
Subtotal ~35-45 hr $70-90
Remaining buffer $160 β†’ v2 72B trial on H100

Workflow:
1. Provision L40S instance via civo CLI
2. Cloud-init installs transformers/peft/trl/bnb stack
3. Pulls train.py from axentx/surrogate-1 Space (latest sha)
4. Runs SFTTrainer in tmux session (survives ssh disconnect)
5. Pushes LoRA to axentx/surrogate-1-coder-32B-v1.5 on every save_step
6. Auto-teardown instance when job exits (TEARDOWN=0 to keep alive)

Discord notification on instance-up so the user knows where to ssh
if they want to watch loss curve real-time.

Usage:
CIVO_API_KEY=... bash bin/v2/civo-train-launcher.sh

Override SHAPE=gpu-h100-80 + BASE_MODEL=Qwen/Qwen2.5-72B-Instruct +
HUB_MODEL_ID=axentx/surrogate-1-coder-72b-v2-sft for v2 training.

Files changed (1) hide show
  1. bin/v2/civo-train-launcher.sh +132 -0
bin/v2/civo-train-launcher.sh ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” Civo GPU instance launcher for v1.5/v2 training.
3
+ #
4
+ # Provisions an L40S 48GB instance on Civo (~$1.50-2/hr), bootstraps the
5
+ # training stack via cloud-init, runs the kaggle-trainer.sh embedded
6
+ # train.py against it, pushes the LoRA back to HF Hub, then tears the
7
+ # instance down so credit isn't burned overnight.
8
+ #
9
+ # Why L40S not Kaggle T4Γ—2:
10
+ # - L40S 48GB per-card β†’ fits 32B QLoRA single-GPU (no ZeRO-3 needed)
11
+ # - 5-10Γ— faster training step than T4 (Ada vs Turing)
12
+ # - $1.50-2/hr vs Kaggle's free-but-9hr-quota
13
+ # - $250 Civo credit covers ~125 hr training = full v1.5 chain + v2 trial
14
+ #
15
+ # Usage:
16
+ # CIVO_API_KEY=... bash civo-train-launcher.sh \
17
+ # --shape gpu-l40s-48 \
18
+ # --base Qwen/Qwen2.5-Coder-32B-Instruct \
19
+ # --hub axentx/surrogate-1-coder-32B-v1.5 \
20
+ # --max-samples 100000 --epochs 1
21
+ #
22
+ # Auto-shutdown after job completes (set TEARDOWN=0 to keep alive).
23
+ set -uo pipefail
24
+
25
+ CIVO_API_KEY="${CIVO_API_KEY:?need CIVO_API_KEY env}"
26
+ SHAPE="${SHAPE:-gpu-l40s-48}"
27
+ REGION="${REGION:-NYC1}"
28
+ BASE_MODEL="${BASE_MODEL:-Qwen/Qwen2.5-Coder-32B-Instruct}"
29
+ HUB_MODEL_ID="${HUB_MODEL_ID:-axentx/surrogate-1-coder-32B-v1.5}"
30
+ MAX_SAMPLES="${MAX_SAMPLES:-100000}"
31
+ EPOCHS="${EPOCHS:-1}"
32
+ SEQ_LEN="${SEQ_LEN:-4096}"
33
+ TEARDOWN="${TEARDOWN:-1}"
34
+ SSH_KEY="${SSH_KEY:-$HOME/.ssh/surrogate.key}"
35
+
36
+ # Resolve HF_TOKEN from env file
37
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
38
+ HF_TOKEN_VAL="${HF_TOKEN_PRO_WRITE:-${HF_TOKEN}}"
39
+ [[ -z "$HF_TOKEN_VAL" ]] && { echo " ❌ no HF_TOKEN β€” abort"; exit 1; }
40
+
41
+ LOG="$HOME/.surrogate/logs/civo-train.log"
42
+ mkdir -p "$(dirname "$LOG")"
43
+ log() { echo "[$(date +%H:%M:%S)] $*" | tee -a "$LOG"; }
44
+
45
+ log "civo-train-launcher start: $BASE_MODEL β†’ $HUB_MODEL_ID on $SHAPE/$REGION"
46
+
47
+ # Use civo CLI with key from env
48
+ export CIVO_TOKEN="$CIVO_API_KEY"
49
+ civo apikey save surrogate-1 "$CIVO_API_KEY" >/dev/null 2>&1 || true
50
+ civo apikey current surrogate-1 >/dev/null 2>&1 || true
51
+ civo region current "$REGION" >/dev/null 2>&1
52
+
53
+ # SSH key (reuse or create)
54
+ if [[ ! -f "$SSH_KEY" ]]; then
55
+ mkdir -p "$(dirname "$SSH_KEY")"
56
+ ssh-keygen -t ed25519 -N "" -f "$SSH_KEY" -C "surrogate-1-civo"
57
+ fi
58
+ PUBKEY=$(cat "${SSH_KEY}.pub")
59
+ SSH_KEY_NAME="surrogate-1-key"
60
+ civo sshkey create "$SSH_KEY_NAME" --key "${SSH_KEY}.pub" --region "$REGION" 2>/dev/null || true
61
+
62
+ # Cloud-init: install training stack, run job, push, shutdown
63
+ USER_DATA=$(base64 -w0 <<EOF
64
+ #cloud-config
65
+ package_update: true
66
+ package_upgrade: true
67
+ packages:
68
+ - python3-pip
69
+ - python3-venv
70
+ - git
71
+ - tmux
72
+ - htop
73
+ - build-essential
74
+ - nvidia-cuda-toolkit
75
+ runcmd:
76
+ # 1. install training deps
77
+ - pip install --upgrade pip
78
+ - pip install transformers>=4.46.0,<4.50.0 datasets>=3.0.0 \
79
+ peft>=0.13.0,<0.15.0 accelerate>=1.0.0,<1.3.0 \
80
+ bitsandbytes>=0.44.0 trl>=0.12.0,<0.16.0 \
81
+ huggingface_hub>=0.25.0,<0.27.0
82
+ # 2. write env
83
+ - echo "HF_TOKEN=${HF_TOKEN_VAL}" >> /etc/environment
84
+ - echo "BASE_MODEL=${BASE_MODEL}" >> /etc/environment
85
+ - echo "HUB_MODEL_ID=${HUB_MODEL_ID}" >> /etc/environment
86
+ - echo "MAX_SAMPLES=${MAX_SAMPLES}" >> /etc/environment
87
+ - echo "EPOCHS=${EPOCHS}" >> /etc/environment
88
+ - echo "SEQ_LEN=${SEQ_LEN}" >> /etc/environment
89
+ # 3. fetch train.py from axentx Space repo (HF git)
90
+ - cd /root && git clone --depth 1 https://huggingface.co/spaces/axentx/surrogate-1 src
91
+ # 4. extract embedded train.py
92
+ - sed -n '/cat > "\$WORK_DIR\/train.py"/,/^PYEOF$/p' /root/src/bin/kaggle-trainer.sh \
93
+ | sed '1d;\$d' > /root/train.py
94
+ # 5. run training in tmux (survives ssh disconnect)
95
+ - su -c "tmux new-session -d -s train 'set -a; source /etc/environment; set +a; \
96
+ python3 /root/train.py 2>&1 | tee /root/train.log; \
97
+ ${TEARDOWN:+civo instance remove --region $REGION \$(hostname) --yes}'" root
98
+ EOF
99
+ )
100
+
101
+ INSTANCE_NAME="surrogate-train-$(date +%s)"
102
+ log "creating instance: $INSTANCE_NAME ($SHAPE)"
103
+
104
+ INSTANCE_ID=$(civo instance create \
105
+ --hostname "$INSTANCE_NAME" \
106
+ --size "$SHAPE" \
107
+ --diskimage ubuntu-2204 \
108
+ --network default \
109
+ --ssh-key "$SSH_KEY_NAME" \
110
+ --region "$REGION" \
111
+ --initial-user root \
112
+ --script <(echo "$USER_DATA" | base64 -d) \
113
+ --wait \
114
+ --output id 2>&1 | tail -1)
115
+
116
+ if [[ -z "$INSTANCE_ID" ]]; then
117
+ log "❌ instance create failed"
118
+ exit 1
119
+ fi
120
+
121
+ log "βœ“ instance up: $INSTANCE_ID"
122
+ PUBLIC_IP=$(civo instance show "$INSTANCE_ID" --region "$REGION" --output public_ip 2>&1 | tail -1)
123
+ log " public IP: $PUBLIC_IP"
124
+ log " ssh -i $SSH_KEY root@$PUBLIC_IP"
125
+ log " tmux attach -t train # to watch training"
126
+ log ""
127
+ log "Training will run in tmux. Auto-teardown after job completes."
128
+ log "Override: TEARDOWN=0 ${0##*/} ... to keep alive after training."
129
+
130
+ [[ -n "${DISCORD_WEBHOOK:-}" ]] && curl -s -X POST -H "Content-Type: application/json" \
131
+ -d "{\"content\":\"πŸš€ Civo training instance up: $INSTANCE_NAME ($SHAPE) IP $PUBLIC_IP\"}" \
132
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true