as2s-lexicon-ja / README.md
ryugo0226's picture
initial v0.0.1: Sudachi-only v2 lexicon (Issue #324)
553064b verified
---
license: apache-2.0
language:
- ja
library_name: as2s
tags:
- lexicon
- japanese
- asr
- speech-recognition
- biasing
- shallow-fusion
---
# arubeh/as2s-lexicon-ja
Pre-built Japanese lexicon binary consumed by [as2s] (Arubeh's Speech-to-Speech) to drive
shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST
trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd
read-only at daemon boot and consulted on every decode step via a `BiasCursor`.
Initial release (`v0.0.1`) is **Sudachi-only**. Wikidata Q-item integration is tracked
in a follow-up issue; see the [Future work](#future-work) section below.
[as2s]: https://github.com/arubeh/as2s
## On-disk format
```
[ magic: 9 bytes = b"AS2SLEXv2" ]
[ version: u32 LE ]
[ FST length: u64 LE ][ FST bytes ]
[ token table length: u64 LE ][ zstd-compressed token table ]
[ manifest length: u64 LE ][ manifest JSON ]
```
The runtime parser is in `crates/as2s-lexicon/src/load.rs`. `AS2SLEXv1` files surface
`LexiconError::LegacyFormatV1` and refuse to load β€” re-download with
`as2s download lexicon-ja --force` to pull the current `v2` artefact.
## Manifest
| Field | Value |
|---|---|
| `builder_version` | `0.0.1` |
| `sudachi_dict` | `core_lex.csv` (SudachiDict 20260428) |
| `wikidata_dump` | `null` (initial release is Sudachi-only) |
| `entry_count` | `783,648` |
| `entries_by_tier.proper` | `510,945` |
| `entries_by_tier.generic` | `272,703` |
| Binary size | 12 MiB |
| Magic | `AS2SLEXv2` |
| SHA-256 | `6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf` |
`Proper` Tier entries receive a stronger biasing weight than `Generic` (per-tier Ξ± is
tuned by the daemon at runtime; see `as2s-lexicon` runtime docs).
## Usage
The recommended path is the bundled CLI fetcher, which lands the binary under
`<assets_dir>/lexicon-ja/` and the upstream attribution NOTICE under
`<licenses>/lexicon-ja/`:
```
cargo run -p as2s-cli -- download lexicon-ja
cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja
```
Manual download is also supported β€” clone this repo (with Git LFS) and point the daemon
at the resulting directory via `--lexicon-dir`.
## Reproduction
The binary is regenerated by the offline `as2s-lexicon-build` CLI from the as2s
workspace:
```
cargo run --release -p as2s-lexicon-build -- \
--sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \
--tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \
--output lexicon.aslex
```
Required inputs:
- **SudachiDict core** β€” fetch the `20260428` core lex CSV from the SudachiDict CDN
(`d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip`), 21 MB zip
expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd.
under the Apache License, Version 2.0.
- **Qwen3-ASR tokenizer** β€” `tokenizer.json` from `Qwen/Qwen3-ASR-1.7B` or the JA
fine-tune; both share the same vocabulary. The builder calls `tokenizer.encode()`
on every surface so the trie is keyed by the post-tokenizer token-id sequence,
side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7).
The builder lives at `crates/as2s-lexicon-build/`. The daemon's transitive dependency
closure (`as2s-cli`) **does not** include `sudachi.rs` or any Wikidata stream parser β€”
the heavy parsing dependencies are confined to the builder crate (architectural review
H2 from Issue #319).
## Future work
- **Wikidata Q-item integration** β€” adds ~360k named-entity entries (people, places,
products) on top of the SudachiDict base. Tracked as a follow-up issue
(`#324b: lexicon-ja Wikidata extension` or similar β€” see the as2s repository).
Deferred from `v0.0.1` because the upstream dump (`wikidata-<date>-all.json.gz`) is
~153 GB and the immediate priority is closing the daemon-startup gap.
## Attribution
This artefact embeds entries derived from **SudachiDict**:
> SudachiDict β€” Β© Works Applications Co., Ltd. β€” Apache License, Version 2.0
> <https://github.com/WorksApplications/SudachiDict>
The full upstream LICENSE / NOTICE files are reproduced under `<licenses>/lexicon-ja/`
when the binary is installed via `as2s download lexicon-ja`. The biasing trie shipped
here is a derived work referencing SudachiDict entries by token-id sequence and Tier
tag and retains the original Apache-2.0 licence terms.
"Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the
upstream work as required by Apache-2.0 Β§4 and is not an endorsement.
## License
Apache License, Version 2.0. See `LICENSE-2.0.txt` for the full text (or
<https://www.apache.org/licenses/LICENSE-2.0>).