---
language:
- en
tags:
- qwen3
- telecom
- vocabulary-expansion
- continual-pretraining
- 5G
- 3GPP
base_model: Qwen/Qwen3-1.7B
license: apache-2.0
---

# Qwen3-1.7B — Telecom Vocabulary Injection

This model is [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with **5226 telecom-domain tokens** injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

## Vocabulary Injection Details

| Parameter | Value |
|---|---|
| Base model | `Qwen/Qwen3-1.7B` |
| Base vocab size | 151,669 |
| Tokens injected | **5226** |
| Final vocab size | **156,928** |
| Embedding init strategy | Mean-of-subpieces |

## Why Vocabulary Injection?

Telecom terminology like `PDSCH`, `gNB`, `mmWave`, `eMBB`, `HARQ` are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a **single atomic token**.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

# Verify injection
tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(tokens))
# Each term should be a single token
```

## Intended Use

This model is a **pre-CPT checkpoint** — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

## License

Apache 2.0 — same as the base Qwen3 model.