|
download
raw
6.07 kB

Autotrain: Qwen3.5-2B Japanese adaptation

Objective

Improve Qwen/Qwen3.5-2B at Japanese instruction following and Japanese-only responses. Success means lower held-out Japanese instruction tuning loss on a fixed test split built from multiple Japanese datasets.

Hardware & Framework

  • Hardware: Apple M4 Max, 128 GB unified memory
  • Framework: mlx-lm
  • Detection: uname -m returned arm64 on macOS 15.7.4, so this session uses MLX/Apple Silicon fine-tuning.

Base Model

  • Upstream model: Qwen/Qwen3.5-2B
  • Local training artifact: mlx-community/Qwen3.5-2B-bf16 in ./model
  • Size: ~2.27B BF16 parameters
  • Quantization: BF16
  • Architecture: Qwen3_5ForConditionalGeneration used in text-only mode via chat template

HF Session Targets

  • Adapter repo: victor/qwen3_5_2b-japanese-lora
  • Session bucket: hf://buckets/victor/autotrain-japanese-qwen35-2b

Metrics

  • Primary: test_loss (unitless, lower is better)
  • Secondary: val_loss, test_ppl, train_loss, peak_mem_gb
  • Track val_loss alongside test_loss — divergence suggests overfitting or test-set gaming

Dataset Splits

  • Sources:
    • kunishou/databricks-dolly-15k-ja
    • kunishou/oasst1-89k-ja
    • kunishou/hh-rlhf-49k-ja
  • Phase-1 curation already applied:
    • drop ng_translation rows where available
    • prompt length 4–1200 chars
    • completion length 20–1200 chars
    • Japanese character ratio >= 0.15
    • dedupe to one canonical completion per prompt hash to avoid prompt leakage across splits
  • Available after filtering/dedup:
    • Dolly: 12,791
    • OASST: 20,595
    • HH-RLHF: 23,534
  • Train: 4,096 examples
    • Dolly 1,024 / HH 1,536 / OASST 1,536
  • Val: 800 examples
    • Dolly 250 / HH 275 / OASST 275
  • Test: 800 examples — FIXED at session start
    • Dolly 250 / HH 275 / OASST 275
  • Fresh validation: sample a new 800-example validation subset from the fixed 2,400-example validation pool by changing FRESH_VAL_SEED
  • Sacred split manifests:
    • splits/test_ids.txt
    • splits/val_pool_ids.txt

How to Run

  • Baseline eval only: SKIP_TRAIN=1 ./autotrain.sh
  • Fine-tuning run: ./autotrain.sh
  • Fresh validation rotation: FRESH_VAL_SEED=1 ./autotrain.sh
  • Output: one METRIC name=value line per metric

Files in Scope

  • .gitignore — ignore large/generated artifacts
  • autotrain.md — living session memory
  • autotrain.sh — benchmark/train/eval entrypoint
  • autotrain.ideas.md — deferred ideas
  • config/data_config.json — data curation and split policy
  • splits/test_ids.txt — fixed sacred test split ids
  • splits/val_pool_ids.txt — fixed validation pool ids
  • src/prepare_data.py — data normalization, curation, split generation
  • src/parse_mlx_lora_log.py — metric extraction from MLX logs

Off Limits

  • model/ — downloaded base model snapshot
  • data/raw/ — downloaded dataset sources
  • global system Python / Homebrew packages

Constraints

  • Apple Silicon shares memory with display; keep training stable and avoid needlessly aggressive batch sizes
  • Prefer runs that fit comfortably on-device and finish in roughly <= 15 minutes while searching
  • Keep the fixed test split unchanged
  • Do not replace the base model family unless the user asks

Experiment Priority Order

  1. Phase 1: Data Quality — source mix, stricter HH/OASST filtering, smaller higher-quality subsets
  2. Phase 2: Prompt & Output Format — Japanese-only prefix/system wording, formatting consistency
  3. Phase 3: LoRA Architecturenum_layers, maybe all layers vs 16
  4. Phase 4: Training Hyperparameters — LR, iters, batch size, max seq length
  5. Phase 5: Regularization — only if overfitting appears

Anti-Thrash Rules

  1. 5+ consecutive discards → pivot phase
  2. 8 near-tie metrics in a row → make a structural change
  3. If only hyperparameter ideas remain, go back to data/prompt work first
  4. Every keep or every 10 experiments → run fresh validation with a new FRESH_VAL_SEED

What's Been Tried

Key Wins

  • Baseline established: base Qwen3.5-2B with no adapter scored test_loss=2.988 / test_ppl=19.837 on the fixed 800-example Japanese test split. This is the number to beat.
  • First LoRA win: a 4,096-example balanced Japanese mixture (Dolly 1024 / HH 1536 / OASST 1536), num_layers=16, lr=2e-5, batch_size=2, grad_acc=8, iters=400, max_seq_length=1024, val_batches=100 improved to test_loss=2.495 / test_ppl=12.127 with final val_loss=2.448.

Dead Ends

  • Two initial eval-only harness attempts crashed because mlx_lm.lora tries to load adapters by default during --test; baseline mode must pass --adapter-path "" explicitly.
  • A fixed prompt prefix (日本語で自然かつ簡潔に答えてください。) did not help: test_loss moved from 2.495 to 2.497.
  • An OASST-heavy source mix (1024 Dolly / 768 HH / 2304 OASST) crashed around iter 200 with Abort trap: 6, so any further source-mix changes should isolate whether the instability is MLX-runtime-related or data-shape-related.

Observations

  • The repo started empty, so the full autotrain harness was created from scratch.
  • Qwen/Qwen3.5-2B is exposed on the Hub as an image-text-to-text model, but the MLX-converted checkpoint loads and tokenizes text-only prompts correctly.
  • Raw translated Japanese datasets contain many duplicate prompts and a non-trivial tail of too-short or too-long completions; prompt-level dedup is the first major data-quality choice.
  • Cross-source prompt overlap is tiny, so the current multi-source mixture should mainly diversify Japanese response style rather than duplicate the same prompts.
  • Full test evaluation over 800 examples currently takes about 7m20s at batch_size=2, so in-loop validation must stay cheaper than the final test pass.
  • The first real LoRA run peaked around 50 GB unified memory, so there is headroom for modestly longer runs or slightly larger batches on this machine.

Xet Storage Details

Size:
6.07 kB
·
Xet hash:
fd6192561f0c95e1c34a221681eccc3d773c6714e67a65a04c9e2e7577b43133

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.