vimalgupta's picture
Add Qwen3-1.7B with telecom vocabulary injection (+1186 tokens)
df947af verified
---
language:
- en
tags:
- qwen3
- telecom
- vocabulary-expansion
- continual-pretraining
- 5G
- 3GPP
base_model: Qwen/Qwen3-1.7B
license: apache-2.0
---
# Qwen3-1.7B — Telecom Vocabulary Injection
This model is [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with **5226 telecom-domain tokens** injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.
## Vocabulary Injection Details
| Parameter | Value |
|---|---|
| Base model | `Qwen/Qwen3-1.7B` |
| Base vocab size | 151,669 |
| Tokens injected | **5226** |
| Final vocab size | **156,928** |
| Embedding init strategy | Mean-of-subpieces |
## Why Vocabulary Injection?
Telecom terminology like `PDSCH`, `gNB`, `mmWave`, `eMBB`, `HARQ` are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a **single atomic token**.
## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token
```
## Intended Use
This model is a **pre-CPT checkpoint** — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.
## License
Apache 2.0 — same as the base Qwen3 model.