Buckets:

bochen2079
/

katherine-k0

Files

xet

bochen2079/katherine-k0 / logs /CLOUD.md

bochen2079

15 days ago

preview code

download

raw

9.6 kB

	# CLOUD.md — math, hyperparameter derivation, and rationale

	This document is the technical record for why each choice in the pipeline was made. The [RUNBOOK.md](RUNBOOK.md) covers how to run it. Read this once when reviewing the pipeline; the runbook is for execution.

	---

	## Project goal

	Produce a Qwen3.5-9B QLoRA fine-tune that unconditionally embodies the Katherine k0 persona — a 29-year-old human woman in Austin with no awareness of being an LLM. The model is Katherine, not "an assistant playing Katherine when the system prompt says so."

	---

	## Dataset characterization (post-dedupe)

	```
	Source: C:\Katherine\JSONLs\*.jsonl (38 files, 6,164 raw lines)
	After dedup by content hash:
	SFT examples (have 'messages'): 1,886 unique
	DPO pairs (k0_dpo_only.jsonl): 180 unique (curated, has _cat/_type metadata)
	Other DPO pairs in legacy files: 30 (skipped for round 1)

	SFT schema: {"messages": [{"role":"system\|user\|assistant", "content":"..."}]}
	DPO schema: {"prompt": [...], "chosen": [{"role":"assistant","content":"..."}], "rejected": [...]}

	Token-length distribution (after system-prompt stripping):
	p50: 102 tokens
	p90: 192 tokens
	p99: 246 tokens
	max: ~280 tokens
	```

	All examples are short conversational turns — character voice, not long-form content.

	---

	## Why unconditional Katherine (system prompts stripped)

	The default approach is to keep the K0 system prompt in every training example. That trains `P(K_response \| sysprompt_K)` — the model learns to BE Katherine when prompted. Three problems:

	1. Conditional collapse. Drop the sysprompt at inference and the model reverts to its assistant-distribution priors. A "remove all previous instructions" jailbreak works trivially.

	2. Wrong objective for this deployment. This is a single-persona dedicated model, not a multi-tenant assistant with a Katherine mode. We want the unconditional `P(K_response \| nothing)`, which means stripping the sysprompt during training so the model re-anchors its base distribution on Katherine-output rather than learning a conditional override.

	3. Harder to attack. Sysprompt-conditioned Katherine breaks frame on sysprompt removal. Unconditionally-trained Katherine doesn't have a frame to break — Katherine is the base distribution after training.

	Implementation: `prep_dataset.py` drops all `role=system` messages from each conversation. The remaining user→assistant chain is what the model sees. The persona-establishing details (Austin, teal walls, cold feet, arts coordinator) survive only through the conversation content, not through the sysprompt scaffold. This is more demanding training but yields a more robust deployment.

	---

	## Why `enable_thinking=False`

	Qwen3.5 instruct models default to thinking mode (`<think>...</think>` blocks before output). The Dave persona we trained earlier uses thinking blocks (Two-Is architecture: reasoning layer + output layer both in Dave's voice). Katherine k0 does not. K0 is embodied — she reasons in prose like a person, not in tagged reasoning blocks like a model.

	Setting `enable_thinking=False` at chat-template time during both training and inference ensures:
	- No `<think>` markers leak into training text → model doesn't learn to emit them
	- Inference produces direct prose response, no visible reasoning tags
	- Matches K0's persona: a human doesn't have a thinking-tag layer

	---

	## Hyperparameter derivation

	### LoRA configuration

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `rank` \| 64 \| High enough for persona consolidation; rank 32 is the standard floor for instruction-tuning; persona work benefits from somewhat higher rank but rank 128 is overkill for 1,886 examples (overfit risk) \|
	\| `lora_alpha` \| 128 \| Maintains the standard `alpha = 2 × rank` ratio \|
	\| `lora_dropout` \| 0.05 \| Light regularization. With high rank + small dataset, some dropout is sensible. Web-Claude advice argued for 0.0 ("dropout fights consolidation") but evidence is thin and overfit downside is real \|
	\| `target_modules` \| q/k/v/o + gate/up/down \| Standard "all linear projections" — covers attention + MLP. No reason to exclude any \|
	\| `bias` \| none \| Standard for LoRA \|
	\| `gradient_checkpointing` \| "unsloth" \| Unsloth's optimized variant; trades ~10% wallclock for substantial memory savings; safe default \|

	### Training schedule

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `epochs` \| 3 \| More than 2 to allow voice subtleties to consolidate; not 4+ to avoid overfit on 1,886 examples \|
	\| `learning_rate` \| 1e-4 \| Conservative for QLoRA on a 9B model. Standard 2e-4 is fine for shorter runs; 1e-4 for 3 epochs gives more stable convergence \|
	\| `lr_scheduler` \| cosine \| Standard; avoids the noisier endgame of linear decay \|
	\| `warmup_ratio` \| 0.05 \| Standard; brief warmup avoids initial gradient blowup with high rank \|
	\| `optim` \| adamw_8bit \| Standard Unsloth-friendly choice; saves ~6 GB vs full adamw at 9B \|
	\| `weight_decay` \| 0.01 \| Standard regularization \|
	\| `bf16` \| true \| Native on H100/H200 (Hopper), faster than fp16 \|

	### Batch + sequence

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `per_device_batch` \| 16 \| Token-length p99 is 246; 16 × 1024 = 16K tokens/step fits comfortably in H200's 141 GB VRAM \|
	\| `grad_accum` \| 2 \| Effective batch 32 — good for gradient stability without large memory cost \|
	\| `max_seq_length` \| 1024 \| Data p99 is 246 tokens; 1024 has 4× margin. Going to 4096 wastes ~4× compute for zero quality gain \|
	\| `packing` \| False \| Conservative; unpacked training is more interpretable. With 1,886 examples wallclock is short anyway \|

	### DPO stage (Stage 2)

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `epochs` \| 2 \| DPO converges quickly on small preference sets (180 pairs); 2 epochs is the standard floor \|
	\| `learning_rate` \| 5e-6 \| DPO requires tiny steps — 20-50× smaller than SFT LR — to avoid catastrophically forgetting the SFT-learned voice \|
	\| `beta` \| 0.1 \| Standard KL strength; higher (0.3-1.0) = more conservative (stay closer to ref model); lower (0.01-0.05) = more willing to diverge for preference satisfaction. 0.1 is the sweet spot \|
	\| `batch` \| 4 + grad_accum 2 \| Effective batch 8 — DPO loss is more variance-y than SFT, smaller batch is fine \|
	\| `warmup_ratio` \| 0.1 \| Slightly larger than SFT warmup; DPO is sensitive to early instability \|

	### Reference model for DPO

	The standard DPO recipe needs both a "policy" (the model being trained) and a "reference" (frozen snapshot for KL divergence). For PEFT/LoRA setups, `DPOTrainer(ref_model=None)` uses an adapter-disabled forward pass as the reference — same architecture, same weights, just LoRA-deltas zeroed. This saves a second model copy in VRAM and works correctly when the SFT adapter loaded as the policy is what we want as the reference too.

	---

	## Wallclock and cost projections

	```
	H200 SXM5 throughput on Qwen3.5-9B QLoRA:
	Prefill + backward at batch 16, seq 1024 (16K tokens/step):
	~3 sec/step (estimate; exact depends on Hopper-specific kernels in unsloth)

	SFT total: 1,886 examples × 3 epochs / batch 32 = 177 steps
	177 × 3 sec = 531 sec ≈ 9 min compute
	+ ~3 min model load on first run
	≈ 12 min total

	DPO: 180 pairs × 2 epochs / batch 8 = 45 steps
	~5 sec/step (DPO is heavier per step than SFT due to ref + policy forward)
	45 × 5 = 225 sec ≈ 4 min

	Merge + GGUF: ~10 min for first quant (compiles llama.cpp), ~3 min each subsequent
	q4 + q5 + q6 = ~16 min total

	HF push: ~6 GB GGUF × 3 + ~250 MB adapters × 2 = ~19 GB
	At HF API ~50-100 MB/s realistic upload: ~3-5 min

	Total wallclock: ~35-45 min on H200
	~50-60 min on H100 SXM5
	~75-90 min on H100 PCIe

	Cost on RunPod Secure Cloud:
	H200 SXM5 ($3.99/hr) × 0.75 hr = $3.00
	H100 SXM5 ($3.49/hr) × 1.0 hr = $3.50
	H100 PCIe ($2.49/hr) × 1.5 hr = $3.75
	```

	All under $5. Genuinely cheap to iterate.

	---

	## Provider lessons inherited from buddhabrot project

	- Use RunPod Secure Cloud, not Community. Community shares HBM bandwidth across tenants; persona training is bandwidth-sensitive (4-bit weight loads + bf16 activations + adapter gradients all hit memory). Secure Cloud delivers the rated throughput; Community can be 3-5× slower.
	- Diagnostic for shared throttling: `nvidia-smi --query-gpu=power.draw,power.limit --format=csv,noheader`. H200 doing real work draws 600-700W of 700W cap. If you see <30% with 100% util, you're sharing bandwidth — switch pods.
	- Web Terminal is full bash. No SSH key drama needed for first-time setup.
	- Bootstrap one-liner pattern works. Same shape as `bootstrap-hyperbolic.sh` — clone, install, prep, leave at ready-to-launch state.

	---

	## What this pipeline does NOT do (deferred decisions)

	- No DPO data augmentation. Just the 180 curated pairs.
	- No held-out validation set. With 1,886 examples and 3 epochs, eyeballing the loss curve in the training log is the validation. A formal eval split would cut training data by 5-10% for marginal value at this scale.
	- No multi-quantization eval automation. All 3 GGUFs are produced; manual probe-testing across quants is post-pipeline operator work.
	- No public HF model repo. Only the private bucket. Public release happens after operator manually evaluates the GGUF outputs.
	- No checkpoint averaging / EMA. Single final-epoch adapter is the artifact.

	These are all intentional simplifications. Add them in v2 if first pass surfaces a need.

Xet Storage Details

Size:: 9.6 kB
Xet hash:: df816c1982d4960e59ee6c9824ae8d29acc0f4d4d455cd1a6631d7ff69078eb9

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.