Qwen3-1.7B โ€” Telecom Vocabulary Injection

This model is Qwen/Qwen3-1.7B with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

Vocabulary Injection Details

Parameter Value
Base model Qwen/Qwen3-1.7B
Base vocab size 151,669
Tokens injected 5226
Final vocab size 156,928
Embedding init strategy Mean-of-subpieces

Why Vocabulary Injection?

Telecom terminology like PDSCH, gNB, mmWave, eMBB, HARQ are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token

Intended Use

This model is a pre-CPT checkpoint โ€” the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

License

Apache 2.0 โ€” same as the base Qwen3 model.

Downloads last month
16
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vimalgupta/qwen-telecom-injected_v3

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(704)
this model