hexa-forge — extended tokenizer (Qwen2.5-Coder base + hexa-lang ext v1)

Repository: dancinlab/hexa-forge-tokenizer-qwen-hexa-v1 Phase: v0.1.3 G-BASE Forge commit: 6197ea4315 Generated: 2026-05-11

What this is

The base Qwen/Qwen2.5-Coder-7B tokenizer with a hexa-lang token extension applied per tool/tokenizer_extension.toml. Produced by tool/extend_tokenizer.py ; round-trip verified to 100% and tuned to ≤ 0.5× compression ratio on the hexa source sample (per runbook §4.4 / D-008 close criterion).

Files

5 tokenizer artefacts (tokenizer.json, tokenizer_config.json, vocab + merges if applicable)
total size: 10.92 MB
extension manifest hash recorded in the extension_manifest_hash field of tokenizer_config.json

License

This artefact is Apache-2.0. The base Qwen2.5-Coder tokenizer carries Qwen's own license terms ; this extension does not redistribute the base weights, only the tokenizer files derived from the base tokenizer + the extension TOML.

Provenance

No local license audit was found alongside the source path. Run tool/license_clean_scan.py --path /home/summer/runs/tokenizer-qwen-hexa-v1 --report /home/summer/runs/tokenizer-qwen-hexa-v1/license_audit.json before publishing for a permissive-only attestation.

Cross-reference

papers/plan-runbook-v0.1.3.md §4.4 — generation step
tool/tokenizer_extension.toml — the extension manifest (hexa-lang tokens)
tool/extend_tokenizer.py — the tool that produced this artefact

Compatibility

Loadable via transformers.AutoTokenizer.from_pretrained("dancinlab/hexa-forge-tokenizer-qwen-hexa-v1") once the upload completes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support