hexa-forge โ extended tokenizer (Qwen2.5-Coder base + hexa-lang ext v1)
Repository: dancinlab/hexa-forge-tokenizer-qwen-hexa-v1
Phase: v0.1.3 G-BASE
Forge commit: 6197ea4315
Generated: 2026-05-11
What this is
The base Qwen/Qwen2.5-Coder-7B tokenizer with a hexa-lang token
extension applied per tool/tokenizer_extension.toml. Produced by
tool/extend_tokenizer.py ; round-trip verified to 100% and tuned to
โค 0.5ร compression ratio on the hexa source sample (per runbook ยง4.4 /
D-008 close criterion).
Files
5tokenizer artefacts (tokenizer.json,tokenizer_config.json, vocab + merges if applicable)- total size:
10.92 MB - extension manifest hash recorded in the
extension_manifest_hashfield oftokenizer_config.json
License
This artefact is Apache-2.0. The base Qwen2.5-Coder tokenizer carries
Qwen's own license terms ; this extension does not redistribute the
base weights, only the tokenizer files derived from the base
tokenizer + the extension TOML.
Provenance
No local license audit was found alongside the source path.
Run tool/license_clean_scan.py --path /home/summer/runs/tokenizer-qwen-hexa-v1 --report /home/summer/runs/tokenizer-qwen-hexa-v1/license_audit.json before publishing for a permissive-only attestation.
Cross-reference
papers/plan-runbook-v0.1.3.mdยง4.4 โ generation steptool/tokenizer_extension.tomlโ the extension manifest (hexa-lang tokens)tool/extend_tokenizer.pyโ the tool that produced this artefact
Compatibility
Loadable via transformers.AutoTokenizer.from_pretrained("dancinlab/hexa-forge-tokenizer-qwen-hexa-v1")
once the upload completes.