hexa-forge โ€” extended tokenizer (Qwen2.5-Coder base + hexa-lang ext v1)

Repository: dancinlab/hexa-forge-tokenizer-qwen-hexa-v1 Phase: v0.1.3 G-BASE Forge commit: 6197ea4315 Generated: 2026-05-11

What this is

The base Qwen/Qwen2.5-Coder-7B tokenizer with a hexa-lang token extension applied per tool/tokenizer_extension.toml. Produced by tool/extend_tokenizer.py ; round-trip verified to 100% and tuned to โ‰ค 0.5ร— compression ratio on the hexa source sample (per runbook ยง4.4 / D-008 close criterion).

Files

  • 5 tokenizer artefacts (tokenizer.json, tokenizer_config.json, vocab + merges if applicable)
  • total size: 10.92 MB
  • extension manifest hash recorded in the extension_manifest_hash field of tokenizer_config.json

License

This artefact is Apache-2.0. The base Qwen2.5-Coder tokenizer carries Qwen's own license terms ; this extension does not redistribute the base weights, only the tokenizer files derived from the base tokenizer + the extension TOML.

Provenance

No local license audit was found alongside the source path. Run tool/license_clean_scan.py --path /home/summer/runs/tokenizer-qwen-hexa-v1 --report /home/summer/runs/tokenizer-qwen-hexa-v1/license_audit.json before publishing for a permissive-only attestation.

Cross-reference

  • papers/plan-runbook-v0.1.3.md ยง4.4 โ€” generation step
  • tool/tokenizer_extension.toml โ€” the extension manifest (hexa-lang tokens)
  • tool/extend_tokenizer.py โ€” the tool that produced this artefact

Compatibility

Loadable via transformers.AutoTokenizer.from_pretrained("dancinlab/hexa-forge-tokenizer-qwen-hexa-v1") once the upload completes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support