as2s-lexicon-ja / README.md
ryugo0226's picture
initial v0.0.1: Sudachi-only v2 lexicon (Issue #324)
553064b verified
metadata
license: apache-2.0
language:
  - ja
library_name: as2s
tags:
  - lexicon
  - japanese
  - asr
  - speech-recognition
  - biasing
  - shallow-fusion

arubeh/as2s-lexicon-ja

Pre-built Japanese lexicon binary consumed by as2s (Arubeh's Speech-to-Speech) to drive shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd read-only at daemon boot and consulted on every decode step via a BiasCursor.

Initial release (v0.0.1) is Sudachi-only. Wikidata Q-item integration is tracked in a follow-up issue; see the Future work section below.

On-disk format

[ magic: 9 bytes = b"AS2SLEXv2" ]
[ version: u32 LE ]
[ FST length: u64 LE ][ FST bytes ]
[ token table length: u64 LE ][ zstd-compressed token table ]
[ manifest length: u64 LE ][ manifest JSON ]

The runtime parser is in crates/as2s-lexicon/src/load.rs. AS2SLEXv1 files surface LexiconError::LegacyFormatV1 and refuse to load — re-download with as2s download lexicon-ja --force to pull the current v2 artefact.

Manifest

Field Value
builder_version 0.0.1
sudachi_dict core_lex.csv (SudachiDict 20260428)
wikidata_dump null (initial release is Sudachi-only)
entry_count 783,648
entries_by_tier.proper 510,945
entries_by_tier.generic 272,703
Binary size 12 MiB
Magic AS2SLEXv2
SHA-256 6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf

Proper Tier entries receive a stronger biasing weight than Generic (per-tier α is tuned by the daemon at runtime; see as2s-lexicon runtime docs).

Usage

The recommended path is the bundled CLI fetcher, which lands the binary under <assets_dir>/lexicon-ja/ and the upstream attribution NOTICE under <licenses>/lexicon-ja/:

cargo run -p as2s-cli -- download lexicon-ja
cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja

Manual download is also supported — clone this repo (with Git LFS) and point the daemon at the resulting directory via --lexicon-dir.

Reproduction

The binary is regenerated by the offline as2s-lexicon-build CLI from the as2s workspace:

cargo run --release -p as2s-lexicon-build -- \
    --sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \
    --tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \
    --output lexicon.aslex

Required inputs:

  • SudachiDict core — fetch the 20260428 core lex CSV from the SudachiDict CDN (d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip), 21 MB zip expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd. under the Apache License, Version 2.0.
  • Qwen3-ASR tokenizertokenizer.json from Qwen/Qwen3-ASR-1.7B or the JA fine-tune; both share the same vocabulary. The builder calls tokenizer.encode() on every surface so the trie is keyed by the post-tokenizer token-id sequence, side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7).

The builder lives at crates/as2s-lexicon-build/. The daemon's transitive dependency closure (as2s-cli) does not include sudachi.rs or any Wikidata stream parser — the heavy parsing dependencies are confined to the builder crate (architectural review H2 from Issue #319).

Future work

  • Wikidata Q-item integration — adds ~360k named-entity entries (people, places, products) on top of the SudachiDict base. Tracked as a follow-up issue (#324b: lexicon-ja Wikidata extension or similar — see the as2s repository). Deferred from v0.0.1 because the upstream dump (wikidata-<date>-all.json.gz) is ~153 GB and the immediate priority is closing the daemon-startup gap.

Attribution

This artefact embeds entries derived from SudachiDict:

SudachiDict — © Works Applications Co., Ltd. — Apache License, Version 2.0 https://github.com/WorksApplications/SudachiDict

The full upstream LICENSE / NOTICE files are reproduced under <licenses>/lexicon-ja/ when the binary is installed via as2s download lexicon-ja. The biasing trie shipped here is a derived work referencing SudachiDict entries by token-id sequence and Tier tag and retains the original Apache-2.0 licence terms.

"Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the upstream work as required by Apache-2.0 §4 and is not an endorsement.

License

Apache License, Version 2.0. See LICENSE-2.0.txt for the full text (or https://www.apache.org/licenses/LICENSE-2.0).