Qwen3-1.7B โ Telecom Vocabulary Injection
This model is Qwen/Qwen3-1.7B with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.
Vocabulary Injection Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-1.7B |
| Base vocab size | 151,669 |
| Tokens injected | 5226 |
| Final vocab size | 156,928 |
| Embedding init strategy | Mean-of-subpieces |
Why Vocabulary Injection?
Telecom terminology like PDSCH, gNB, mmWave, eMBB, HARQ are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token
Intended Use
This model is a pre-CPT checkpoint โ the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.
License
Apache 2.0 โ same as the base Qwen3 model.
- Downloads last month
- 16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support