| --- |
| language: |
| - en |
| tags: |
| - qwen3 |
| - telecom |
| - vocabulary-expansion |
| - continual-pretraining |
| - 5G |
| - 3GPP |
| base_model: Qwen/Qwen3-1.7B |
| license: apache-2.0 |
| --- |
| |
| # Qwen3-1.7B — Telecom Vocabulary Injection |
|
|
| This model is [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with **5226 telecom-domain tokens** injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets. |
|
|
| ## Vocabulary Injection Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | `Qwen/Qwen3-1.7B` | |
| | Base vocab size | 151,669 | |
| | Tokens injected | **5226** | |
| | Final vocab size | **156,928** | |
| | Embedding init strategy | Mean-of-subpieces | |
|
|
| ## Why Vocabulary Injection? |
|
|
| Telecom terminology like `PDSCH`, `gNB`, `mmWave`, `eMBB`, `HARQ` are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a **single atomic token**. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3") |
| model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3") |
| |
| # Verify injection |
| tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False) |
| print(tokenizer.convert_ids_to_tokens(tokens)) |
| # Each term should be a single token |
| ``` |
|
|
| ## Intended Use |
|
|
| This model is a **pre-CPT checkpoint** — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training. |
|
|
| ## License |
|
|
| Apache 2.0 — same as the base Qwen3 model. |
|
|