| --- |
| license: apache-2.0 |
| language: |
| - ja |
| library_name: as2s |
| tags: |
| - lexicon |
| - japanese |
| - asr |
| - speech-recognition |
| - biasing |
| - shallow-fusion |
| --- |
| |
| # arubeh/as2s-lexicon-ja |
|
|
| Pre-built Japanese lexicon binary consumed by [as2s] (Arubeh's Speech-to-Speech) to drive |
| shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST |
| trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd |
| read-only at daemon boot and consulted on every decode step via a `BiasCursor`. |
|
|
| Initial release (`v0.0.1`) is **Sudachi-only**. Wikidata Q-item integration is tracked |
| in a follow-up issue; see the [Future work](#future-work) section below. |
|
|
| [as2s]: https://github.com/arubeh/as2s |
|
|
| ## On-disk format |
|
|
| ``` |
| [ magic: 9 bytes = b"AS2SLEXv2" ] |
| [ version: u32 LE ] |
| [ FST length: u64 LE ][ FST bytes ] |
| [ token table length: u64 LE ][ zstd-compressed token table ] |
| [ manifest length: u64 LE ][ manifest JSON ] |
| ``` |
|
|
| The runtime parser is in `crates/as2s-lexicon/src/load.rs`. `AS2SLEXv1` files surface |
| `LexiconError::LegacyFormatV1` and refuse to load β re-download with |
| `as2s download lexicon-ja --force` to pull the current `v2` artefact. |
|
|
| ## Manifest |
|
|
| | Field | Value | |
| |---|---| |
| | `builder_version` | `0.0.1` | |
| | `sudachi_dict` | `core_lex.csv` (SudachiDict 20260428) | |
| | `wikidata_dump` | `null` (initial release is Sudachi-only) | |
| | `entry_count` | `783,648` | |
| | `entries_by_tier.proper` | `510,945` | |
| | `entries_by_tier.generic` | `272,703` | |
| | Binary size | 12 MiB | |
| | Magic | `AS2SLEXv2` | |
| | SHA-256 | `6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf` | |
|
|
| `Proper` Tier entries receive a stronger biasing weight than `Generic` (per-tier Ξ± is |
| tuned by the daemon at runtime; see `as2s-lexicon` runtime docs). |
|
|
| ## Usage |
|
|
| The recommended path is the bundled CLI fetcher, which lands the binary under |
| `<assets_dir>/lexicon-ja/` and the upstream attribution NOTICE under |
| `<licenses>/lexicon-ja/`: |
|
|
| ``` |
| cargo run -p as2s-cli -- download lexicon-ja |
| cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja |
| ``` |
|
|
| Manual download is also supported β clone this repo (with Git LFS) and point the daemon |
| at the resulting directory via `--lexicon-dir`. |
|
|
| ## Reproduction |
|
|
| The binary is regenerated by the offline `as2s-lexicon-build` CLI from the as2s |
| workspace: |
|
|
| ``` |
| cargo run --release -p as2s-lexicon-build -- \ |
| --sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \ |
| --tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \ |
| --output lexicon.aslex |
| ``` |
|
|
| Required inputs: |
|
|
| - **SudachiDict core** β fetch the `20260428` core lex CSV from the SudachiDict CDN |
| (`d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip`), 21 MB zip |
| expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd. |
| under the Apache License, Version 2.0. |
| - **Qwen3-ASR tokenizer** β `tokenizer.json` from `Qwen/Qwen3-ASR-1.7B` or the JA |
| fine-tune; both share the same vocabulary. The builder calls `tokenizer.encode()` |
| on every surface so the trie is keyed by the post-tokenizer token-id sequence, |
| side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7). |
|
|
| The builder lives at `crates/as2s-lexicon-build/`. The daemon's transitive dependency |
| closure (`as2s-cli`) **does not** include `sudachi.rs` or any Wikidata stream parser β |
| the heavy parsing dependencies are confined to the builder crate (architectural review |
| H2 from Issue #319). |
|
|
| ## Future work |
|
|
| - **Wikidata Q-item integration** β adds ~360k named-entity entries (people, places, |
| products) on top of the SudachiDict base. Tracked as a follow-up issue |
| (`#324b: lexicon-ja Wikidata extension` or similar β see the as2s repository). |
| Deferred from `v0.0.1` because the upstream dump (`wikidata-<date>-all.json.gz`) is |
| ~153 GB and the immediate priority is closing the daemon-startup gap. |
|
|
| ## Attribution |
|
|
| This artefact embeds entries derived from **SudachiDict**: |
|
|
| > SudachiDict β Β© Works Applications Co., Ltd. β Apache License, Version 2.0 |
| > <https://github.com/WorksApplications/SudachiDict> |
|
|
| The full upstream LICENSE / NOTICE files are reproduced under `<licenses>/lexicon-ja/` |
| when the binary is installed via `as2s download lexicon-ja`. The biasing trie shipped |
| here is a derived work referencing SudachiDict entries by token-id sequence and Tier |
| tag and retains the original Apache-2.0 licence terms. |
|
|
| "Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the |
| upstream work as required by Apache-2.0 Β§4 and is not an endorsement. |
|
|
| ## License |
|
|
| Apache License, Version 2.0. See `LICENSE-2.0.txt` for the full text (or |
| <https://www.apache.org/licenses/LICENSE-2.0>). |
|
|