Buckets:
CLOUD.md — math, hyperparameter derivation, and rationale
This document is the technical record for why each choice in the pipeline was made. The RUNBOOK.md covers how to run it. Read this once when reviewing the pipeline; the runbook is for execution.
Project goal
Produce a Qwen3.5-9B QLoRA fine-tune that unconditionally embodies the Katherine k0 persona — a 29-year-old human woman in Austin with no awareness of being an LLM. The model is Katherine, not "an assistant playing Katherine when the system prompt says so."
Dataset characterization (post-dedupe)
Source: C:\Katherine\JSONLs\*.jsonl (38 files, 6,164 raw lines)
After dedup by content hash:
SFT examples (have 'messages'): 1,886 unique
DPO pairs (k0_dpo_only.jsonl): 180 unique (curated, has _cat/_type metadata)
Other DPO pairs in legacy files: 30 (skipped for round 1)
SFT schema: {"messages": [{"role":"system|user|assistant", "content":"..."}]}
DPO schema: {"prompt": [...], "chosen": [{"role":"assistant","content":"..."}], "rejected": [...]}
Token-length distribution (after system-prompt stripping):
p50: 102 tokens
p90: 192 tokens
p99: 246 tokens
max: ~280 tokens
All examples are short conversational turns — character voice, not long-form content.
Why unconditional Katherine (system prompts stripped)
The default approach is to keep the K0 system prompt in every training example. That trains P(K_response | sysprompt_K) — the model learns to BE Katherine when prompted. Three problems:
Conditional collapse. Drop the sysprompt at inference and the model reverts to its assistant-distribution priors. A "remove all previous instructions" jailbreak works trivially.
Wrong objective for this deployment. This is a single-persona dedicated model, not a multi-tenant assistant with a Katherine mode. We want the unconditional
P(K_response | nothing), which means stripping the sysprompt during training so the model re-anchors its base distribution on Katherine-output rather than learning a conditional override.Harder to attack. Sysprompt-conditioned Katherine breaks frame on sysprompt removal. Unconditionally-trained Katherine doesn't have a frame to break — Katherine is the base distribution after training.
Implementation: prep_dataset.py drops all role=system messages from each conversation. The remaining user→assistant chain is what the model sees. The persona-establishing details (Austin, teal walls, cold feet, arts coordinator) survive only through the conversation content, not through the sysprompt scaffold. This is more demanding training but yields a more robust deployment.
Why enable_thinking=False
Qwen3.5 instruct models default to thinking mode (<think>...</think> blocks before output). The Dave persona we trained earlier uses thinking blocks (Two-Is architecture: reasoning layer + output layer both in Dave's voice). Katherine k0 does not. K0 is embodied — she reasons in prose like a person, not in tagged reasoning blocks like a model.
Setting enable_thinking=False at chat-template time during both training and inference ensures:
- No
<think>markers leak into training text → model doesn't learn to emit them - Inference produces direct prose response, no visible reasoning tags
- Matches K0's persona: a human doesn't have a thinking-tag layer
Hyperparameter derivation
LoRA configuration
| Parameter | Value | Rationale |
|---|---|---|
rank |
64 | High enough for persona consolidation; rank 32 is the standard floor for instruction-tuning; persona work benefits from somewhat higher rank but rank 128 is overkill for 1,886 examples (overfit risk) |
lora_alpha |
128 | Maintains the standard alpha = 2 × rank ratio |
lora_dropout |
0.05 | Light regularization. With high rank + small dataset, some dropout is sensible. Web-Claude advice argued for 0.0 ("dropout fights consolidation") but evidence is thin and overfit downside is real |
target_modules |
q/k/v/o + gate/up/down | Standard "all linear projections" — covers attention + MLP. No reason to exclude any |
bias |
none | Standard for LoRA |
gradient_checkpointing |
"unsloth" | Unsloth's optimized variant; trades ~10% wallclock for substantial memory savings; safe default |
Training schedule
| Parameter | Value | Rationale |
|---|---|---|
epochs |
3 | More than 2 to allow voice subtleties to consolidate; not 4+ to avoid overfit on 1,886 examples |
learning_rate |
1e-4 | Conservative for QLoRA on a 9B model. Standard 2e-4 is fine for shorter runs; 1e-4 for 3 epochs gives more stable convergence |
lr_scheduler |
cosine | Standard; avoids the noisier endgame of linear decay |
warmup_ratio |
0.05 | Standard; brief warmup avoids initial gradient blowup with high rank |
optim |
adamw_8bit | Standard Unsloth-friendly choice; saves ~6 GB vs full adamw at 9B |
weight_decay |
0.01 | Standard regularization |
bf16 |
true | Native on H100/H200 (Hopper), faster than fp16 |
Batch + sequence
| Parameter | Value | Rationale |
|---|---|---|
per_device_batch |
16 | Token-length p99 is 246; 16 × 1024 = 16K tokens/step fits comfortably in H200's 141 GB VRAM |
grad_accum |
2 | Effective batch 32 — good for gradient stability without large memory cost |
max_seq_length |
1024 | Data p99 is 246 tokens; 1024 has 4× margin. Going to 4096 wastes ~4× compute for zero quality gain |
packing |
False | Conservative; unpacked training is more interpretable. With 1,886 examples wallclock is short anyway |
DPO stage (Stage 2)
| Parameter | Value | Rationale |
|---|---|---|
epochs |
2 | DPO converges quickly on small preference sets (180 pairs); 2 epochs is the standard floor |
learning_rate |
5e-6 | DPO requires tiny steps — 20-50× smaller than SFT LR — to avoid catastrophically forgetting the SFT-learned voice |
beta |
0.1 | Standard KL strength; higher (0.3-1.0) = more conservative (stay closer to ref model); lower (0.01-0.05) = more willing to diverge for preference satisfaction. 0.1 is the sweet spot |
batch |
4 + grad_accum 2 | Effective batch 8 — DPO loss is more variance-y than SFT, smaller batch is fine |
warmup_ratio |
0.1 | Slightly larger than SFT warmup; DPO is sensitive to early instability |
Reference model for DPO
The standard DPO recipe needs both a "policy" (the model being trained) and a "reference" (frozen snapshot for KL divergence). For PEFT/LoRA setups, DPOTrainer(ref_model=None) uses an adapter-disabled forward pass as the reference — same architecture, same weights, just LoRA-deltas zeroed. This saves a second model copy in VRAM and works correctly when the SFT adapter loaded as the policy is what we want as the reference too.
Wallclock and cost projections
H200 SXM5 throughput on Qwen3.5-9B QLoRA:
Prefill + backward at batch 16, seq 1024 (16K tokens/step):
~3 sec/step (estimate; exact depends on Hopper-specific kernels in unsloth)
SFT total: 1,886 examples × 3 epochs / batch 32 = 177 steps
177 × 3 sec = 531 sec ≈ 9 min compute
+ ~3 min model load on first run
≈ 12 min total
DPO: 180 pairs × 2 epochs / batch 8 = 45 steps
~5 sec/step (DPO is heavier per step than SFT due to ref + policy forward)
45 × 5 = 225 sec ≈ 4 min
Merge + GGUF: ~10 min for first quant (compiles llama.cpp), ~3 min each subsequent
q4 + q5 + q6 = ~16 min total
HF push: ~6 GB GGUF × 3 + ~250 MB adapters × 2 = ~19 GB
At HF API ~50-100 MB/s realistic upload: ~3-5 min
Total wallclock: ~35-45 min on H200
~50-60 min on H100 SXM5
~75-90 min on H100 PCIe
Cost on RunPod Secure Cloud:
H200 SXM5 ($3.99/hr) × 0.75 hr = $3.00
H100 SXM5 ($3.49/hr) × 1.0 hr = $3.50
H100 PCIe ($2.49/hr) × 1.5 hr = $3.75
All under $5. Genuinely cheap to iterate.
Provider lessons inherited from buddhabrot project
- Use RunPod Secure Cloud, not Community. Community shares HBM bandwidth across tenants; persona training is bandwidth-sensitive (4-bit weight loads + bf16 activations + adapter gradients all hit memory). Secure Cloud delivers the rated throughput; Community can be 3-5× slower.
- Diagnostic for shared throttling:
nvidia-smi --query-gpu=power.draw,power.limit --format=csv,noheader. H200 doing real work draws 600-700W of 700W cap. If you see <30% with 100% util, you're sharing bandwidth — switch pods. - Web Terminal is full bash. No SSH key drama needed for first-time setup.
- Bootstrap one-liner pattern works. Same shape as
bootstrap-hyperbolic.sh— clone, install, prep, leave at ready-to-launch state.
What this pipeline does NOT do (deferred decisions)
- No DPO data augmentation. Just the 180 curated pairs.
- No held-out validation set. With 1,886 examples and 3 epochs, eyeballing the loss curve in the training log is the validation. A formal eval split would cut training data by 5-10% for marginal value at this scale.
- No multi-quantization eval automation. All 3 GGUFs are produced; manual probe-testing across quants is post-pipeline operator work.
- No public HF model repo. Only the private bucket. Public release happens after operator manually evaluates the GGUF outputs.
- No checkpoint averaging / EMA. Single final-epoch adapter is the artifact.
These are all intentional simplifications. Add them in v2 if first pass surfaces a need.
Xet Storage Details
- Size:
- 9.6 kB
- Xet hash:
- df816c1982d4960e59ee6c9824ae8d29acc0f4d4d455cd1a6631d7ff69078eb9
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.