arubeh/as2s-lexicon-ja
Pre-built Japanese lexicon binary consumed by as2s (Arubeh's Speech-to-Speech) to drive
shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST
trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd
read-only at daemon boot and consulted on every decode step via a BiasCursor.
Initial release (v0.0.1) is Sudachi-only. Wikidata Q-item integration is tracked
in a follow-up issue; see the Future work section below.
On-disk format
[ magic: 9 bytes = b"AS2SLEXv2" ]
[ version: u32 LE ]
[ FST length: u64 LE ][ FST bytes ]
[ token table length: u64 LE ][ zstd-compressed token table ]
[ manifest length: u64 LE ][ manifest JSON ]
The runtime parser is in crates/as2s-lexicon/src/load.rs. AS2SLEXv1 files surface
LexiconError::LegacyFormatV1 and refuse to load โ re-download with
as2s download lexicon-ja --force to pull the current v2 artefact.
Manifest
| Field | Value |
|---|---|
builder_version |
0.0.1 |
sudachi_dict |
core_lex.csv (SudachiDict 20260428) |
wikidata_dump |
null (initial release is Sudachi-only) |
entry_count |
783,648 |
entries_by_tier.proper |
510,945 |
entries_by_tier.generic |
272,703 |
| Binary size | 12 MiB |
| Magic | AS2SLEXv2 |
| SHA-256 | 6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf |
Proper Tier entries receive a stronger biasing weight than Generic (per-tier ฮฑ is
tuned by the daemon at runtime; see as2s-lexicon runtime docs).
Usage
The recommended path is the bundled CLI fetcher, which lands the binary under
<assets_dir>/lexicon-ja/ and the upstream attribution NOTICE under
<licenses>/lexicon-ja/:
cargo run -p as2s-cli -- download lexicon-ja
cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja
Manual download is also supported โ clone this repo (with Git LFS) and point the daemon
at the resulting directory via --lexicon-dir.
Reproduction
The binary is regenerated by the offline as2s-lexicon-build CLI from the as2s
workspace:
cargo run --release -p as2s-lexicon-build -- \
--sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \
--tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \
--output lexicon.aslex
Required inputs:
- SudachiDict core โ fetch the
20260428core lex CSV from the SudachiDict CDN (d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip), 21 MB zip expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd. under the Apache License, Version 2.0. - Qwen3-ASR tokenizer โ
tokenizer.jsonfromQwen/Qwen3-ASR-1.7Bor the JA fine-tune; both share the same vocabulary. The builder callstokenizer.encode()on every surface so the trie is keyed by the post-tokenizer token-id sequence, side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7).
The builder lives at crates/as2s-lexicon-build/. The daemon's transitive dependency
closure (as2s-cli) does not include sudachi.rs or any Wikidata stream parser โ
the heavy parsing dependencies are confined to the builder crate (architectural review
H2 from Issue #319).
Future work
- Wikidata Q-item integration โ adds ~360k named-entity entries (people, places,
products) on top of the SudachiDict base. Tracked as a follow-up issue
(
#324b: lexicon-ja Wikidata extensionor similar โ see the as2s repository). Deferred fromv0.0.1because the upstream dump (wikidata-<date>-all.json.gz) is ~153 GB and the immediate priority is closing the daemon-startup gap.
Attribution
This artefact embeds entries derived from SudachiDict:
SudachiDict โ ยฉ Works Applications Co., Ltd. โ Apache License, Version 2.0 https://github.com/WorksApplications/SudachiDict
The full upstream LICENSE / NOTICE files are reproduced under <licenses>/lexicon-ja/
when the binary is installed via as2s download lexicon-ja. The biasing trie shipped
here is a derived work referencing SudachiDict entries by token-id sequence and Tier
tag and retains the original Apache-2.0 licence terms.
"Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the upstream work as required by Apache-2.0 ยง4 and is not an endorsement.
License
Apache License, Version 2.0. See LICENSE-2.0.txt for the full text (or
https://www.apache.org/licenses/LICENSE-2.0).