--- language: - en tags: - qwen3 - telecom - vocabulary-expansion - continual-pretraining - 5G - 3GPP base_model: Qwen/Qwen3-1.7B license: apache-2.0 --- # Qwen3-1.7B — Telecom Vocabulary Injection This model is [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with **5226 telecom-domain tokens** injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets. ## Vocabulary Injection Details | Parameter | Value | |---|---| | Base model | `Qwen/Qwen3-1.7B` | | Base vocab size | 151,669 | | Tokens injected | **5226** | | Final vocab size | **156,928** | | Embedding init strategy | Mean-of-subpieces | ## Why Vocabulary Injection? Telecom terminology like `PDSCH`, `gNB`, `mmWave`, `eMBB`, `HARQ` are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a **single atomic token**. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3") model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3") # Verify injection tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False) print(tokenizer.convert_ids_to_tokens(tokens)) # Each term should be a single token ``` ## Intended Use This model is a **pre-CPT checkpoint** — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training. ## License Apache 2.0 — same as the base Qwen3 model.