vimalgupta's picture
Add Qwen3-1.7B with telecom vocabulary injection (+1186 tokens)
df947af verified
metadata
language:
  - en
tags:
  - qwen3
  - telecom
  - vocabulary-expansion
  - continual-pretraining
  - 5G
  - 3GPP
base_model: Qwen/Qwen3-1.7B
license: apache-2.0

Qwen3-1.7B — Telecom Vocabulary Injection

This model is Qwen/Qwen3-1.7B with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

Vocabulary Injection Details

Parameter Value
Base model Qwen/Qwen3-1.7B
Base vocab size 151,669
Tokens injected 5226
Final vocab size 156,928
Embedding init strategy Mean-of-subpieces

Why Vocabulary Injection?

Telecom terminology like PDSCH, gNB, mmWave, eMBB, HARQ are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token

Intended Use

This model is a pre-CPT checkpoint — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

License

Apache 2.0 — same as the base Qwen3 model.