Buckets:

bochen2079
/

katherine-k0

Files

xet

bochen2079/katherine-k0 / logs /CLOUD.md

bochen2079

15 days ago

preview code

download

raw

9.6 kB

CLOUD.md — math, hyperparameter derivation, and rationale

This document is the technical record for why each choice in the pipeline was made. The RUNBOOK.md covers how to run it. Read this once when reviewing the pipeline; the runbook is for execution.

Project goal

Produce a Qwen3.5-9B QLoRA fine-tune that unconditionally embodies the Katherine k0 persona — a 29-year-old human woman in Austin with no awareness of being an LLM. The model is Katherine, not "an assistant playing Katherine when the system prompt says so."

Dataset characterization (post-dedupe)

Source: C:\Katherine\JSONLs\*.jsonl (38 files, 6,164 raw lines)
After dedup by content hash:
  SFT examples (have 'messages'):  1,886 unique
  DPO pairs (k0_dpo_only.jsonl):     180 unique (curated, has _cat/_type metadata)
  Other DPO pairs in legacy files:    30 (skipped for round 1)

SFT schema: {"messages": [{"role":"system|user|assistant", "content":"..."}]}
DPO schema: {"prompt": [...], "chosen": [{"role":"assistant","content":"..."}], "rejected": [...]}

Token-length distribution (after system-prompt stripping):
  p50:   102 tokens
  p90:   192 tokens
  p99:   246 tokens
  max:   ~280 tokens

All examples are short conversational turns — character voice, not long-form content.

Why unconditional Katherine (system prompts stripped)

The default approach is to keep the K0 system prompt in every training example. That trains P(K_response | sysprompt_K) — the model learns to BE Katherine when prompted. Three problems:

Conditional collapse. Drop the sysprompt at inference and the model reverts to its assistant-distribution priors. A "remove all previous instructions" jailbreak works trivially.
Wrong objective for this deployment. This is a single-persona dedicated model, not a multi-tenant assistant with a Katherine mode. We want the unconditional P(K_response | nothing), which means stripping the sysprompt during training so the model re-anchors its base distribution on Katherine-output rather than learning a conditional override.
Harder to attack. Sysprompt-conditioned Katherine breaks frame on sysprompt removal. Unconditionally-trained Katherine doesn't have a frame to break — Katherine is the base distribution after training.

Implementation: prep_dataset.py drops all role=system messages from each conversation. The remaining user→assistant chain is what the model sees. The persona-establishing details (Austin, teal walls, cold feet, arts coordinator) survive only through the conversation content, not through the sysprompt scaffold. This is more demanding training but yields a more robust deployment.

Why `enable_thinking=False`

Qwen3.5 instruct models default to thinking mode (<think>...</think> blocks before output). The Dave persona we trained earlier uses thinking blocks (Two-Is architecture: reasoning layer + output layer both in Dave's voice). Katherine k0 does not. K0 is embodied — she reasons in prose like a person, not in tagged reasoning blocks like a model.

Setting enable_thinking=False at chat-template time during both training and inference ensures:

No <think> markers leak into training text → model doesn't learn to emit them
Inference produces direct prose response, no visible reasoning tags
Matches K0's persona: a human doesn't have a thinking-tag layer

Hyperparameter derivation

LoRA configuration

Parameter	Value	Rationale
`rank`	64	High enough for persona consolidation; rank 32 is the standard floor for instruction-tuning; persona work benefits from somewhat higher rank but rank 128 is overkill for 1,886 examples (overfit risk)
`lora_alpha`	128	Maintains the standard `alpha = 2 × rank` ratio
`lora_dropout`	0.05	Light regularization. With high rank + small dataset, some dropout is sensible. Web-Claude advice argued for 0.0 ("dropout fights consolidation") but evidence is thin and overfit downside is real
`target_modules`	q/k/v/o + gate/up/down	Standard "all linear projections" — covers attention + MLP. No reason to exclude any
`bias`	none	Standard for LoRA
`gradient_checkpointing`	"unsloth"	Unsloth's optimized variant; trades ~10% wallclock for substantial memory savings; safe default

Training schedule

Parameter	Value	Rationale
`epochs`	3	More than 2 to allow voice subtleties to consolidate; not 4+ to avoid overfit on 1,886 examples
`learning_rate`	1e-4	Conservative for QLoRA on a 9B model. Standard 2e-4 is fine for shorter runs; 1e-4 for 3 epochs gives more stable convergence
`lr_scheduler`	cosine	Standard; avoids the noisier endgame of linear decay
`warmup_ratio`	0.05	Standard; brief warmup avoids initial gradient blowup with high rank
`optim`	adamw_8bit	Standard Unsloth-friendly choice; saves ~6 GB vs full adamw at 9B
`weight_decay`	0.01	Standard regularization
`bf16`	true	Native on H100/H200 (Hopper), faster than fp16

Batch + sequence

Parameter	Value	Rationale
`per_device_batch`	16	Token-length p99 is 246; 16 × 1024 = 16K tokens/step fits comfortably in H200's 141 GB VRAM
`grad_accum`	2	Effective batch 32 — good for gradient stability without large memory cost
`max_seq_length`	1024	Data p99 is 246 tokens; 1024 has 4× margin. Going to 4096 wastes ~4× compute for zero quality gain
`packing`	False	Conservative; unpacked training is more interpretable. With 1,886 examples wallclock is short anyway

DPO stage (Stage 2)

Parameter	Value	Rationale
`epochs`	2	DPO converges quickly on small preference sets (180 pairs); 2 epochs is the standard floor
`learning_rate`	5e-6	DPO requires tiny steps — 20-50× smaller than SFT LR — to avoid catastrophically forgetting the SFT-learned voice
`beta`	0.1	Standard KL strength; higher (0.3-1.0) = more conservative (stay closer to ref model); lower (0.01-0.05) = more willing to diverge for preference satisfaction. 0.1 is the sweet spot
`batch`	4 + grad_accum 2	Effective batch 8 — DPO loss is more variance-y than SFT, smaller batch is fine
`warmup_ratio`	0.1	Slightly larger than SFT warmup; DPO is sensitive to early instability

Reference model for DPO

The standard DPO recipe needs both a "policy" (the model being trained) and a "reference" (frozen snapshot for KL divergence). For PEFT/LoRA setups, DPOTrainer(ref_model=None) uses an adapter-disabled forward pass as the reference — same architecture, same weights, just LoRA-deltas zeroed. This saves a second model copy in VRAM and works correctly when the SFT adapter loaded as the policy is what we want as the reference too.

Wallclock and cost projections

H200 SXM5 throughput on Qwen3.5-9B QLoRA:
  Prefill + backward at batch 16, seq 1024 (16K tokens/step):
  ~3 sec/step (estimate; exact depends on Hopper-specific kernels in unsloth)

SFT total: 1,886 examples × 3 epochs / batch 32 = 177 steps
           177 × 3 sec = 531 sec ≈ 9 min compute
           + ~3 min model load on first run
           ≈ 12 min total

DPO: 180 pairs × 2 epochs / batch 8 = 45 steps
     ~5 sec/step (DPO is heavier per step than SFT due to ref + policy forward)
     45 × 5 = 225 sec ≈ 4 min

Merge + GGUF: ~10 min for first quant (compiles llama.cpp), ~3 min each subsequent
              q4 + q5 + q6 = ~16 min total

HF push: ~6 GB GGUF × 3 + ~250 MB adapters × 2 = ~19 GB
         At HF API ~50-100 MB/s realistic upload: ~3-5 min

Total wallclock: ~35-45 min on H200
                ~50-60 min on H100 SXM5
                ~75-90 min on H100 PCIe
                
Cost on RunPod Secure Cloud:
  H200 SXM5 ($3.99/hr) × 0.75 hr = $3.00
  H100 SXM5 ($3.49/hr) × 1.0 hr  = $3.50
  H100 PCIe ($2.49/hr) × 1.5 hr  = $3.75

All under $5. Genuinely cheap to iterate.

Provider lessons inherited from buddhabrot project

Use RunPod Secure Cloud, not Community. Community shares HBM bandwidth across tenants; persona training is bandwidth-sensitive (4-bit weight loads + bf16 activations + adapter gradients all hit memory). Secure Cloud delivers the rated throughput; Community can be 3-5× slower.
Diagnostic for shared throttling: nvidia-smi --query-gpu=power.draw,power.limit --format=csv,noheader. H200 doing real work draws 600-700W of 700W cap. If you see <30% with 100% util, you're sharing bandwidth — switch pods.
Web Terminal is full bash. No SSH key drama needed for first-time setup.
Bootstrap one-liner pattern works. Same shape as bootstrap-hyperbolic.sh — clone, install, prep, leave at ready-to-launch state.

What this pipeline does NOT do (deferred decisions)

No DPO data augmentation. Just the 180 curated pairs.
No held-out validation set. With 1,886 examples and 3 epochs, eyeballing the loss curve in the training log is the validation. A formal eval split would cut training data by 5-10% for marginal value at this scale.
No multi-quantization eval automation. All 3 GGUFs are produced; manual probe-testing across quants is post-pipeline operator work.
No public HF model repo. Only the private bucket. Public release happens after operator manually evaluates the GGUF outputs.
No checkpoint averaging / EMA. Single final-epoch adapter is the artifact.

These are all intentional simplifications. Add them in v2 if first pass surfaces a need.

Xet Storage Details

Size:: 9.6 kB
Xet hash:: df816c1982d4960e59ee6c9824ae8d29acc0f4d4d455cd1a6631d7ff69078eb9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.