Add Qwen3-1.7B with telecom vocabulary injection (+1186 tokens)

df947af verified 18 days ago

1.7 kB

language:
  - en
tags:
  - qwen3
  - telecom
  - vocabulary-expansion
  - continual-pretraining
  - 5G
  - 3GPP
base_model: Qwen/Qwen3-1.7B
license: apache-2.0

Qwen3-1.7B — Telecom Vocabulary Injection

This model is Qwen/Qwen3-1.7B with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

Vocabulary Injection Details

Parameter	Value
Base model	`Qwen/Qwen3-1.7B`
Base vocab size	151,669
Tokens injected	5226
Final vocab size	156,928
Embedding init strategy	Mean-of-subpieces

Why Vocabulary Injection?

Telecom terminology like PDSCH, gNB, mmWave, eMBB, HARQ are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token

Intended Use

This model is a pre-CPT checkpoint — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

License

Apache 2.0 — same as the base Qwen3 model.