Buckets:
Autotrain: Qwen3.5-2B Japanese adaptation
Objective
Improve Qwen/Qwen3.5-2B at Japanese instruction following and Japanese-only responses.
Success means lower held-out Japanese instruction tuning loss on a fixed test split built from multiple Japanese datasets.
Hardware & Framework
- Hardware: Apple M4 Max, 128 GB unified memory
- Framework:
mlx-lm - Detection:
uname -mreturnedarm64on macOS 15.7.4, so this session uses MLX/Apple Silicon fine-tuning.
Base Model
- Upstream model:
Qwen/Qwen3.5-2B - Local training artifact:
mlx-community/Qwen3.5-2B-bf16in./model - Size: ~2.27B BF16 parameters
- Quantization: BF16
- Architecture:
Qwen3_5ForConditionalGenerationused in text-only mode via chat template
HF Session Targets
- Adapter repo:
victor/qwen3_5_2b-japanese-lora - Session bucket:
hf://buckets/victor/autotrain-japanese-qwen35-2b
Metrics
- Primary:
test_loss(unitless, lower is better) - Secondary:
val_loss,test_ppl,train_loss,peak_mem_gb - Track val_loss alongside test_loss — divergence suggests overfitting or test-set gaming
Dataset Splits
- Sources:
kunishou/databricks-dolly-15k-jakunishou/oasst1-89k-jakunishou/hh-rlhf-49k-ja
- Phase-1 curation already applied:
- drop
ng_translationrows where available - prompt length 4–1200 chars
- completion length 20–1200 chars
- Japanese character ratio >= 0.15
- dedupe to one canonical completion per prompt hash to avoid prompt leakage across splits
- drop
- Available after filtering/dedup:
- Dolly: 12,791
- OASST: 20,595
- HH-RLHF: 23,534
- Train: 4,096 examples
- Dolly 1,024 / HH 1,536 / OASST 1,536
- Val: 800 examples
- Dolly 250 / HH 275 / OASST 275
- Test: 800 examples — FIXED at session start
- Dolly 250 / HH 275 / OASST 275
- Fresh validation: sample a new 800-example validation subset from the fixed 2,400-example validation pool by changing
FRESH_VAL_SEED - Sacred split manifests:
splits/test_ids.txtsplits/val_pool_ids.txt
How to Run
- Baseline eval only:
SKIP_TRAIN=1 ./autotrain.sh - Fine-tuning run:
./autotrain.sh - Fresh validation rotation:
FRESH_VAL_SEED=1 ./autotrain.sh - Output: one
METRIC name=valueline per metric
Files in Scope
.gitignore— ignore large/generated artifactsautotrain.md— living session memoryautotrain.sh— benchmark/train/eval entrypointautotrain.ideas.md— deferred ideasconfig/data_config.json— data curation and split policysplits/test_ids.txt— fixed sacred test split idssplits/val_pool_ids.txt— fixed validation pool idssrc/prepare_data.py— data normalization, curation, split generationsrc/parse_mlx_lora_log.py— metric extraction from MLX logs
Off Limits
model/— downloaded base model snapshotdata/raw/— downloaded dataset sources- global system Python / Homebrew packages
Constraints
- Apple Silicon shares memory with display; keep training stable and avoid needlessly aggressive batch sizes
- Prefer runs that fit comfortably on-device and finish in roughly <= 15 minutes while searching
- Keep the fixed test split unchanged
- Do not replace the base model family unless the user asks
Experiment Priority Order
- Phase 1: Data Quality — source mix, stricter HH/OASST filtering, smaller higher-quality subsets
- Phase 2: Prompt & Output Format — Japanese-only prefix/system wording, formatting consistency
- Phase 3: LoRA Architecture —
num_layers, maybe all layers vs 16 - Phase 4: Training Hyperparameters — LR, iters, batch size, max seq length
- Phase 5: Regularization — only if overfitting appears
Anti-Thrash Rules
- 5+ consecutive discards → pivot phase
- 8 near-tie metrics in a row → make a structural change
- If only hyperparameter ideas remain, go back to data/prompt work first
- Every keep or every 10 experiments → run fresh validation with a new
FRESH_VAL_SEED
What's Been Tried
Key Wins
- Baseline established: base
Qwen3.5-2Bwith no adapter scoredtest_loss=2.988/test_ppl=19.837on the fixed 800-example Japanese test split. This is the number to beat. - First LoRA win: a 4,096-example balanced Japanese mixture (Dolly 1024 / HH 1536 / OASST 1536),
num_layers=16,lr=2e-5,batch_size=2,grad_acc=8,iters=400,max_seq_length=1024,val_batches=100improved totest_loss=2.495/test_ppl=12.127with finalval_loss=2.448.
Dead Ends
- Two initial eval-only harness attempts crashed because
mlx_lm.loratries to load adapters by default during--test; baseline mode must pass--adapter-path ""explicitly. - A fixed prompt prefix (
日本語で自然かつ簡潔に答えてください。) did not help:test_lossmoved from2.495to2.497. - An OASST-heavy source mix (1024 Dolly / 768 HH / 2304 OASST) crashed around iter 200 with
Abort trap: 6, so any further source-mix changes should isolate whether the instability is MLX-runtime-related or data-shape-related.
Observations
- The repo started empty, so the full autotrain harness was created from scratch.
Qwen/Qwen3.5-2Bis exposed on the Hub as an image-text-to-text model, but the MLX-converted checkpoint loads and tokenizes text-only prompts correctly.- Raw translated Japanese datasets contain many duplicate prompts and a non-trivial tail of too-short or too-long completions; prompt-level dedup is the first major data-quality choice.
- Cross-source prompt overlap is tiny, so the current multi-source mixture should mainly diversify Japanese response style rather than duplicate the same prompts.
- Full test evaluation over 800 examples currently takes about 7m20s at
batch_size=2, so in-loop validation must stay cheaper than the final test pass. - The first real LoRA run peaked around 50 GB unified memory, so there is headroom for modestly longer runs or slightly larger batches on this machine.
Xet Storage Details
- Size:
- 6.07 kB
- Xet hash:
- fd6192561f0c95e1c34a221681eccc3d773c6714e67a65a04c9e2e7577b43133
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.