bochen2079's picture
|
download
raw
9.6 kB
# CLOUD.md — math, hyperparameter derivation, and rationale
This document is the technical record for **why** each choice in the pipeline was made. The [RUNBOOK.md](RUNBOOK.md) covers **how** to run it. Read this once when reviewing the pipeline; the runbook is for execution.
---
## Project goal
Produce a Qwen3.5-9B QLoRA fine-tune that **unconditionally** embodies the Katherine k0 persona — a 29-year-old human woman in Austin with no awareness of being an LLM. The model **is** Katherine, not "an assistant playing Katherine when the system prompt says so."
---
## Dataset characterization (post-dedupe)
```
Source: C:\Katherine\JSONLs\*.jsonl (38 files, 6,164 raw lines)
After dedup by content hash:
SFT examples (have 'messages'): 1,886 unique
DPO pairs (k0_dpo_only.jsonl): 180 unique (curated, has _cat/_type metadata)
Other DPO pairs in legacy files: 30 (skipped for round 1)
SFT schema: {"messages": [{"role":"system|user|assistant", "content":"..."}]}
DPO schema: {"prompt": [...], "chosen": [{"role":"assistant","content":"..."}], "rejected": [...]}
Token-length distribution (after system-prompt stripping):
p50: 102 tokens
p90: 192 tokens
p99: 246 tokens
max: ~280 tokens
```
All examples are short conversational turns — character voice, not long-form content.
---
## Why unconditional Katherine (system prompts stripped)
The default approach is to keep the K0 system prompt in every training example. That trains `P(K_response | sysprompt_K)` — the model learns to BE Katherine *when prompted*. Three problems:
1. **Conditional collapse.** Drop the sysprompt at inference and the model reverts to its assistant-distribution priors. A "remove all previous instructions" jailbreak works trivially.
2. **Wrong objective for this deployment.** This is a single-persona dedicated model, not a multi-tenant assistant with a Katherine mode. We want the unconditional `P(K_response | nothing)`, which means stripping the sysprompt during training so the model *re-anchors its base distribution* on Katherine-output rather than learning a conditional override.
3. **Harder to attack.** Sysprompt-conditioned Katherine breaks frame on sysprompt removal. Unconditionally-trained Katherine doesn't have a frame to break — Katherine *is* the base distribution after training.
Implementation: `prep_dataset.py` drops all `role=system` messages from each conversation. The remaining user→assistant chain is what the model sees. The persona-establishing details (Austin, teal walls, cold feet, arts coordinator) survive only through the conversation content, not through the sysprompt scaffold. This is more demanding training but yields a more robust deployment.
---
## Why `enable_thinking=False`
Qwen3.5 instruct models default to thinking mode (`<think>...</think>` blocks before output). The Dave persona we trained earlier uses thinking blocks (Two-Is architecture: reasoning layer + output layer both in Dave's voice). **Katherine k0 does not.** K0 is embodied — she reasons in prose like a person, not in tagged reasoning blocks like a model.
Setting `enable_thinking=False` at chat-template time during both training and inference ensures:
- No `<think>` markers leak into training text → model doesn't learn to emit them
- Inference produces direct prose response, no visible reasoning tags
- Matches K0's persona: a human doesn't have a thinking-tag layer
---
## Hyperparameter derivation
### LoRA configuration
| Parameter | Value | Rationale |
|---|---|---|
| `rank` | 64 | High enough for persona consolidation; rank 32 is the standard floor for instruction-tuning; persona work benefits from somewhat higher rank but rank 128 is overkill for 1,886 examples (overfit risk) |
| `lora_alpha` | 128 | Maintains the standard `alpha = 2 × rank` ratio |
| `lora_dropout` | 0.05 | Light regularization. With high rank + small dataset, some dropout is sensible. Web-Claude advice argued for 0.0 ("dropout fights consolidation") but evidence is thin and overfit downside is real |
| `target_modules` | q/k/v/o + gate/up/down | Standard "all linear projections" — covers attention + MLP. No reason to exclude any |
| `bias` | none | Standard for LoRA |
| `gradient_checkpointing` | "unsloth" | Unsloth's optimized variant; trades ~10% wallclock for substantial memory savings; safe default |
### Training schedule
| Parameter | Value | Rationale |
|---|---|---|
| `epochs` | 3 | More than 2 to allow voice subtleties to consolidate; not 4+ to avoid overfit on 1,886 examples |
| `learning_rate` | 1e-4 | Conservative for QLoRA on a 9B model. Standard 2e-4 is fine for shorter runs; 1e-4 for 3 epochs gives more stable convergence |
| `lr_scheduler` | cosine | Standard; avoids the noisier endgame of linear decay |
| `warmup_ratio` | 0.05 | Standard; brief warmup avoids initial gradient blowup with high rank |
| `optim` | adamw_8bit | Standard Unsloth-friendly choice; saves ~6 GB vs full adamw at 9B |
| `weight_decay` | 0.01 | Standard regularization |
| `bf16` | true | Native on H100/H200 (Hopper), faster than fp16 |
### Batch + sequence
| Parameter | Value | Rationale |
|---|---|---|
| `per_device_batch` | 16 | Token-length p99 is 246; 16 × 1024 = 16K tokens/step fits comfortably in H200's 141 GB VRAM |
| `grad_accum` | 2 | Effective batch 32 — good for gradient stability without large memory cost |
| `max_seq_length` | 1024 | Data p99 is 246 tokens; 1024 has 4× margin. Going to 4096 wastes ~4× compute for zero quality gain |
| `packing` | False | Conservative; unpacked training is more interpretable. With 1,886 examples wallclock is short anyway |
### DPO stage (Stage 2)
| Parameter | Value | Rationale |
|---|---|---|
| `epochs` | 2 | DPO converges quickly on small preference sets (180 pairs); 2 epochs is the standard floor |
| `learning_rate` | 5e-6 | DPO requires tiny steps — 20-50× smaller than SFT LR — to avoid catastrophically forgetting the SFT-learned voice |
| `beta` | 0.1 | Standard KL strength; higher (0.3-1.0) = more conservative (stay closer to ref model); lower (0.01-0.05) = more willing to diverge for preference satisfaction. 0.1 is the sweet spot |
| `batch` | 4 + grad_accum 2 | Effective batch 8 — DPO loss is more variance-y than SFT, smaller batch is fine |
| `warmup_ratio` | 0.1 | Slightly larger than SFT warmup; DPO is sensitive to early instability |
### Reference model for DPO
The standard DPO recipe needs both a "policy" (the model being trained) and a "reference" (frozen snapshot for KL divergence). For PEFT/LoRA setups, `DPOTrainer(ref_model=None)` uses an adapter-disabled forward pass as the reference — same architecture, same weights, just LoRA-deltas zeroed. This saves a second model copy in VRAM and works correctly when the SFT adapter loaded as the policy is what we want as the reference too.
---
## Wallclock and cost projections
```
H200 SXM5 throughput on Qwen3.5-9B QLoRA:
Prefill + backward at batch 16, seq 1024 (16K tokens/step):
~3 sec/step (estimate; exact depends on Hopper-specific kernels in unsloth)
SFT total: 1,886 examples × 3 epochs / batch 32 = 177 steps
177 × 3 sec = 531 sec ≈ 9 min compute
+ ~3 min model load on first run
≈ 12 min total
DPO: 180 pairs × 2 epochs / batch 8 = 45 steps
~5 sec/step (DPO is heavier per step than SFT due to ref + policy forward)
45 × 5 = 225 sec ≈ 4 min
Merge + GGUF: ~10 min for first quant (compiles llama.cpp), ~3 min each subsequent
q4 + q5 + q6 = ~16 min total
HF push: ~6 GB GGUF × 3 + ~250 MB adapters × 2 = ~19 GB
At HF API ~50-100 MB/s realistic upload: ~3-5 min
Total wallclock: ~35-45 min on H200
~50-60 min on H100 SXM5
~75-90 min on H100 PCIe
Cost on RunPod Secure Cloud:
H200 SXM5 ($3.99/hr) × 0.75 hr = $3.00
H100 SXM5 ($3.49/hr) × 1.0 hr = $3.50
H100 PCIe ($2.49/hr) × 1.5 hr = $3.75
```
All under $5. Genuinely cheap to iterate.
---
## Provider lessons inherited from buddhabrot project
- **Use RunPod Secure Cloud, not Community.** Community shares HBM bandwidth across tenants; persona training is bandwidth-sensitive (4-bit weight loads + bf16 activations + adapter gradients all hit memory). Secure Cloud delivers the rated throughput; Community can be 3-5× slower.
- **Diagnostic for shared throttling:** `nvidia-smi --query-gpu=power.draw,power.limit --format=csv,noheader`. H200 doing real work draws 600-700W of 700W cap. If you see <30% with 100% util, you're sharing bandwidth — switch pods.
- **Web Terminal is full bash.** No SSH key drama needed for first-time setup.
- **Bootstrap one-liner pattern works.** Same shape as `bootstrap-hyperbolic.sh` — clone, install, prep, leave at ready-to-launch state.
---
## What this pipeline does NOT do (deferred decisions)
- **No DPO data augmentation.** Just the 180 curated pairs.
- **No held-out validation set.** With 1,886 examples and 3 epochs, eyeballing the loss curve in the training log is the validation. A formal eval split would cut training data by 5-10% for marginal value at this scale.
- **No multi-quantization eval automation.** All 3 GGUFs are produced; manual probe-testing across quants is post-pipeline operator work.
- **No public HF model repo.** Only the private bucket. Public release happens after operator manually evaluates the GGUF outputs.
- **No checkpoint averaging / EMA.** Single final-epoch adapter is the artifact.
These are all intentional simplifications. Add them in v2 if first pass surfaces a need.

Xet Storage Details

Size:
9.6 kB
·
Xet hash:
df816c1982d4960e59ee6c9824ae8d29acc0f4d4d455cd1a6631d7ff69078eb9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.