Qwen3-1.7B — Telecom Vocabulary Injection

This model is Qwen/Qwen3-1.7B with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

Vocabulary Injection Details

Parameter	Value
Base model	`Qwen/Qwen3-1.7B`
Base vocab size	151,669
Tokens injected	5226
Final vocab size	156,928
Embedding init strategy	Mean-of-subpieces

Why Vocabulary Injection?

Telecom terminology like PDSCH, gNB, mmWave, eMBB, HARQ are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token

Intended Use

This model is a pre-CPT checkpoint — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

License

Apache 2.0 — same as the base Qwen3 model.

Downloads last month: 16

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vimalgupta/qwen-telecom-injected_v3

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(704)

this model