Qwen3-1.8B Stage 1: Vocabulary Expansion for Semantic IDs

Overview

Qwen3-1.8B after Stage 1 (vocabulary expansion): 1,027 SID tokens added to the tokenizer, only embedding matrices trained. All other 1.7B parameters frozen.

This checkpoint serves as the starting point for Stage 2 full fine-tuning experiments.

Training

  • Base model: Qwen/Qwen3-1.7B (tied embeddings)
  • New tokens: 1,027 (3 structural + 4×256 codebook tokens)
  • Trainable parameters: 312M / 1.7B (18.2%)
  • Dataset: Amazon Pet Supplies (64K samples from 4.7M conversations)
  • Steps: 2,000
  • LR: 1×10⁻³, cosine scheduler
  • Optimizer: adamw_torch_fused
  • Batch: 32 × 2 = 64 effective
  • Hardware: NVIDIA H100 80GB

SID Token Format

<|sid_start|><|A42|><|B128|><|C64|><|D0|><|sid_end|>

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("kalistratov/qwen3-1.8b-stage1-sid")
tokenizer = AutoTokenizer.from_pretrained("kalistratov/qwen3-1.8b-stage1-sid")

# Verify SID tokens
sid = "<|sid_start|><|A10|><|B20|><|C30|><|D0|><|sid_end|>"
ids = tokenizer.encode(sid, add_special_tokens=False)
assert tokenizer.decode(ids, skip_special_tokens=False) == sid

Citation

Master's thesis, Moscow Institute of Physics and Technology (MIPT), 2026.

Downloads last month
31
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kalistratov/qwen3-1.8b-stage1-sid

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(622)
this model