Qwen3-1.8B Stage 1: Vocabulary Expansion for Semantic IDs
Overview
Qwen3-1.8B after Stage 1 (vocabulary expansion): 1,027 SID tokens added to the tokenizer, only embedding matrices trained. All other 1.7B parameters frozen.
This checkpoint serves as the starting point for Stage 2 full fine-tuning experiments.
Training
- Base model: Qwen/Qwen3-1.7B (tied embeddings)
- New tokens: 1,027 (3 structural + 4×256 codebook tokens)
- Trainable parameters: 312M / 1.7B (18.2%)
- Dataset: Amazon Pet Supplies (64K samples from 4.7M conversations)
- Steps: 2,000
- LR: 1×10⁻³, cosine scheduler
- Optimizer: adamw_torch_fused
- Batch: 32 × 2 = 64 effective
- Hardware: NVIDIA H100 80GB
SID Token Format
<|sid_start|><|A42|><|B128|><|C64|><|D0|><|sid_end|>
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("kalistratov/qwen3-1.8b-stage1-sid")
tokenizer = AutoTokenizer.from_pretrained("kalistratov/qwen3-1.8b-stage1-sid")
# Verify SID tokens
sid = "<|sid_start|><|A10|><|B20|><|C30|><|D0|><|sid_end|>"
ids = tokenizer.encode(sid, add_special_tokens=False)
assert tokenizer.decode(ids, skip_special_tokens=False) == sid
Citation
Master's thesis, Moscow Institute of Physics and Technology (MIPT), 2026.
- Downloads last month
- 31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support