--- license: apache-2.0 language: - ja library_name: as2s tags: - lexicon - japanese - asr - speech-recognition - biasing - shallow-fusion --- # arubeh/as2s-lexicon-ja Pre-built Japanese lexicon binary consumed by [as2s] (Arubeh's Speech-to-Speech) to drive shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd read-only at daemon boot and consulted on every decode step via a `BiasCursor`. Initial release (`v0.0.1`) is **Sudachi-only**. Wikidata Q-item integration is tracked in a follow-up issue; see the [Future work](#future-work) section below. [as2s]: https://github.com/arubeh/as2s ## On-disk format ``` [ magic: 9 bytes = b"AS2SLEXv2" ] [ version: u32 LE ] [ FST length: u64 LE ][ FST bytes ] [ token table length: u64 LE ][ zstd-compressed token table ] [ manifest length: u64 LE ][ manifest JSON ] ``` The runtime parser is in `crates/as2s-lexicon/src/load.rs`. `AS2SLEXv1` files surface `LexiconError::LegacyFormatV1` and refuse to load — re-download with `as2s download lexicon-ja --force` to pull the current `v2` artefact. ## Manifest | Field | Value | |---|---| | `builder_version` | `0.0.1` | | `sudachi_dict` | `core_lex.csv` (SudachiDict 20260428) | | `wikidata_dump` | `null` (initial release is Sudachi-only) | | `entry_count` | `783,648` | | `entries_by_tier.proper` | `510,945` | | `entries_by_tier.generic` | `272,703` | | Binary size | 12 MiB | | Magic | `AS2SLEXv2` | | SHA-256 | `6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf` | `Proper` Tier entries receive a stronger biasing weight than `Generic` (per-tier α is tuned by the daemon at runtime; see `as2s-lexicon` runtime docs). ## Usage The recommended path is the bundled CLI fetcher, which lands the binary under `/lexicon-ja/` and the upstream attribution NOTICE under `/lexicon-ja/`: ``` cargo run -p as2s-cli -- download lexicon-ja cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir /lexicon-ja ``` Manual download is also supported — clone this repo (with Git LFS) and point the daemon at the resulting directory via `--lexicon-dir`. ## Reproduction The binary is regenerated by the offline `as2s-lexicon-build` CLI from the as2s workspace: ``` cargo run --release -p as2s-lexicon-build -- \ --sudachi-dict /core_lex.csv \ --tokenizer /qwen3-asr-rs-ja/tokenizer.json \ --output lexicon.aslex ``` Required inputs: - **SudachiDict core** — fetch the `20260428` core lex CSV from the SudachiDict CDN (`d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip`), 21 MB zip expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd. under the Apache License, Version 2.0. - **Qwen3-ASR tokenizer** — `tokenizer.json` from `Qwen/Qwen3-ASR-1.7B` or the JA fine-tune; both share the same vocabulary. The builder calls `tokenizer.encode()` on every surface so the trie is keyed by the post-tokenizer token-id sequence, side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7). The builder lives at `crates/as2s-lexicon-build/`. The daemon's transitive dependency closure (`as2s-cli`) **does not** include `sudachi.rs` or any Wikidata stream parser — the heavy parsing dependencies are confined to the builder crate (architectural review H2 from Issue #319). ## Future work - **Wikidata Q-item integration** — adds ~360k named-entity entries (people, places, products) on top of the SudachiDict base. Tracked as a follow-up issue (`#324b: lexicon-ja Wikidata extension` or similar — see the as2s repository). Deferred from `v0.0.1` because the upstream dump (`wikidata--all.json.gz`) is ~153 GB and the immediate priority is closing the daemon-startup gap. ## Attribution This artefact embeds entries derived from **SudachiDict**: > SudachiDict — © Works Applications Co., Ltd. — Apache License, Version 2.0 > The full upstream LICENSE / NOTICE files are reproduced under `/lexicon-ja/` when the binary is installed via `as2s download lexicon-ja`. The biasing trie shipped here is a derived work referencing SudachiDict entries by token-id sequence and Tier tag and retains the original Apache-2.0 licence terms. "Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the upstream work as required by Apache-2.0 §4 and is not an endorsement. ## License Apache License, Version 2.0. See `LICENSE-2.0.txt` for the full text (or ).