Add Qwen3-1.7B with telecom vocabulary injection (+1186 tokens)

df947af verified 19 days ago

1.7 kB

	---
	language:
	- en
	tags:
	- qwen3
	- telecom
	- vocabulary-expansion
	- continual-pretraining
	- 5G
	- 3GPP
	base_model: Qwen/Qwen3-1.7B
	license: apache-2.0
	---

	# Qwen3-1.7B — Telecom Vocabulary Injection

	This model is [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) with 5226 telecom-domain tokens injected into its tokenizer and embedding matrices, prepared for Continual Pre-Training (CPT) on telecom datasets.

	## Vocabulary Injection Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen3-1.7B` \|
	\| Base vocab size \| 151,669 \|
	\| Tokens injected \| 5226 \|
	\| Final vocab size \| 156,928 \|
	\| Embedding init strategy \| Mean-of-subpieces \|

	## Why Vocabulary Injection?

	Telecom terminology like `PDSCH`, `gNB`, `mmWave`, `eMBB`, `HARQ` are split into multiple BPE subpieces by the base tokenizer, wasting context window and making it harder for the model to learn domain concepts. After injection, each term is a single atomic token.

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("vimalgupta/qwen-telecom-injected_v3")
	model = AutoModelForCausalLM.from_pretrained("vimalgupta/qwen-telecom-injected_v3")

	# Verify injection
	tokens = tokenizer.encode("PDSCH HARQ gNB mmWave", add_special_tokens=False)
	print(tokenizer.convert_ids_to_tokens(tokens))
	# Each term should be a single token
	```

	## Intended Use

	This model is a pre-CPT checkpoint — the embeddings for new tokens are initialized via mean-of-subpieces but have not yet been trained on telecom data. Use this as the starting point for Continual Pre-Training.

	## License

	Apache 2.0 — same as the base Qwen3 model.