6.07 kB

Autotrain: Qwen3.5-2B Japanese adaptation

Objective

Improve Qwen/Qwen3.5-2B at Japanese instruction following and Japanese-only responses. Success means lower held-out Japanese instruction tuning loss on a fixed test split built from multiple Japanese datasets.

Hardware & Framework

Hardware: Apple M4 Max, 128 GB unified memory
Framework: mlx-lm
Detection: uname -m returned arm64 on macOS 15.7.4, so this session uses MLX/Apple Silicon fine-tuning.

Base Model

Upstream model: Qwen/Qwen3.5-2B
Local training artifact: mlx-community/Qwen3.5-2B-bf16 in ./model
Size: ~2.27B BF16 parameters
Quantization: BF16
Architecture: Qwen3_5ForConditionalGeneration used in text-only mode via chat template

HF Session Targets

Adapter repo: victor/qwen3_5_2b-japanese-lora
Session bucket: hf://buckets/victor/autotrain-japanese-qwen35-2b

Metrics

Primary: test_loss (unitless, lower is better)
Secondary: val_loss, test_ppl, train_loss, peak_mem_gb
Track val_loss alongside test_loss — divergence suggests overfitting or test-set gaming

Dataset Splits

Sources:
- kunishou/databricks-dolly-15k-ja
- kunishou/oasst1-89k-ja
- kunishou/hh-rlhf-49k-ja
Phase-1 curation already applied:
- drop ng_translation rows where available
- prompt length 4–1200 chars
- completion length 20–1200 chars
- Japanese character ratio >= 0.15
- dedupe to one canonical completion per prompt hash to avoid prompt leakage across splits
Available after filtering/dedup:
- Dolly: 12,791
- OASST: 20,595
- HH-RLHF: 23,534
Train: 4,096 examples
- Dolly 1,024 / HH 1,536 / OASST 1,536
Val: 800 examples
- Dolly 250 / HH 275 / OASST 275
Test: 800 examples — FIXED at session start
- Dolly 250 / HH 275 / OASST 275
Fresh validation: sample a new 800-example validation subset from the fixed 2,400-example validation pool by changing FRESH_VAL_SEED
Sacred split manifests:
- splits/test_ids.txt
- splits/val_pool_ids.txt

How to Run

Baseline eval only: SKIP_TRAIN=1 ./autotrain.sh
Fine-tuning run: ./autotrain.sh
Fresh validation rotation: FRESH_VAL_SEED=1 ./autotrain.sh
Output: one METRIC name=value line per metric

Files in Scope

.gitignore — ignore large/generated artifacts
autotrain.md — living session memory
autotrain.sh — benchmark/train/eval entrypoint
autotrain.ideas.md — deferred ideas
config/data_config.json — data curation and split policy
splits/test_ids.txt — fixed sacred test split ids
splits/val_pool_ids.txt — fixed validation pool ids
src/prepare_data.py — data normalization, curation, split generation
src/parse_mlx_lora_log.py — metric extraction from MLX logs

Off Limits

model/ — downloaded base model snapshot
data/raw/ — downloaded dataset sources
global system Python / Homebrew packages

Constraints

Apple Silicon shares memory with display; keep training stable and avoid needlessly aggressive batch sizes
Prefer runs that fit comfortably on-device and finish in roughly <= 15 minutes while searching
Keep the fixed test split unchanged
Do not replace the base model family unless the user asks

Experiment Priority Order

Phase 1: Data Quality — source mix, stricter HH/OASST filtering, smaller higher-quality subsets
Phase 2: Prompt & Output Format — Japanese-only prefix/system wording, formatting consistency
Phase 3: LoRA Architecture — num_layers, maybe all layers vs 16
Phase 4: Training Hyperparameters — LR, iters, batch size, max seq length
Phase 5: Regularization — only if overfitting appears

Anti-Thrash Rules

5+ consecutive discards → pivot phase
8 near-tie metrics in a row → make a structural change
If only hyperparameter ideas remain, go back to data/prompt work first
Every keep or every 10 experiments → run fresh validation with a new FRESH_VAL_SEED

What's Been Tried

Key Wins

Baseline established: base Qwen3.5-2B with no adapter scored test_loss=2.988 / test_ppl=19.837 on the fixed 800-example Japanese test split. This is the number to beat.
First LoRA win: a 4,096-example balanced Japanese mixture (Dolly 1024 / HH 1536 / OASST 1536), num_layers=16, lr=2e-5, batch_size=2, grad_acc=8, iters=400, max_seq_length=1024, val_batches=100 improved to test_loss=2.495 / test_ppl=12.127 with final val_loss=2.448.

Dead Ends

Two initial eval-only harness attempts crashed because mlx_lm.lora tries to load adapters by default during --test; baseline mode must pass --adapter-path "" explicitly.
A fixed prompt prefix (日本語で自然かつ簡潔に答えてください。) did not help: test_loss moved from 2.495 to 2.497.
An OASST-heavy source mix (1024 Dolly / 768 HH / 2304 OASST) crashed around iter 200 with Abort trap: 6, so any further source-mix changes should isolate whether the instability is MLX-runtime-related or data-shape-related.

Observations

The repo started empty, so the full autotrain harness was created from scratch.
Qwen/Qwen3.5-2B is exposed on the Hub as an image-text-to-text model, but the MLX-converted checkpoint loads and tokenizes text-only prompts correctly.
Raw translated Japanese datasets contain many duplicate prompts and a non-trivial tail of too-short or too-long completions; prompt-level dedup is the first major data-quality choice.
Cross-source prompt overlap is tiny, so the current multi-source mixture should mainly diversify Japanese response style rather than duplicate the same prompts.
Full test evaluation over 800 examples currently takes about 7m20s at batch_size=2, so in-loop validation must stay cheaper than the final test pass.
The first real LoRA run peaked around 50 GB unified memory, so there is headroom for modestly longer runs or slightly larger batches on this machine.

Xet Storage Details

Size:: 6.07 kB
Xet hash:: fd6192561f0c95e1c34a221681eccc3d773c6714e67a65a04c9e2e7577b43133

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.