---
license: apache-2.0
language:
- ja
library_name: as2s
tags:
- lexicon
- japanese
- asr
- speech-recognition
- biasing
- shallow-fusion
---

# arubeh/as2s-lexicon-ja

Pre-built Japanese lexicon binary consumed by [as2s] (Arubeh's Speech-to-Speech) to drive
shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST
trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd
read-only at daemon boot and consulted on every decode step via a `BiasCursor`.

Initial release (`v0.0.1`) is **Sudachi-only**. Wikidata Q-item integration is tracked
in a follow-up issue; see the [Future work](#future-work) section below.

[as2s]: https://github.com/arubeh/as2s

## On-disk format

```
[ magic: 9 bytes = b"AS2SLEXv2" ]
[ version: u32 LE ]
[ FST length: u64 LE ][ FST bytes ]
[ token table length: u64 LE ][ zstd-compressed token table ]
[ manifest length: u64 LE ][ manifest JSON ]
```

The runtime parser is in `crates/as2s-lexicon/src/load.rs`. `AS2SLEXv1` files surface
`LexiconError::LegacyFormatV1` and refuse to load — re-download with
`as2s download lexicon-ja --force` to pull the current `v2` artefact.

## Manifest

| Field | Value |
|---|---|
| `builder_version` | `0.0.1` |
| `sudachi_dict` | `core_lex.csv` (SudachiDict 20260428) |
| `wikidata_dump` | `null` (initial release is Sudachi-only) |
| `entry_count` | `783,648` |
| `entries_by_tier.proper` | `510,945` |
| `entries_by_tier.generic` | `272,703` |
| Binary size | 12 MiB |
| Magic | `AS2SLEXv2` |
| SHA-256 | `6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf` |

`Proper` Tier entries receive a stronger biasing weight than `Generic` (per-tier α is
tuned by the daemon at runtime; see `as2s-lexicon` runtime docs).

## Usage

The recommended path is the bundled CLI fetcher, which lands the binary under
`<assets_dir>/lexicon-ja/` and the upstream attribution NOTICE under
`<licenses>/lexicon-ja/`:

```
cargo run -p as2s-cli -- download lexicon-ja
cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja
```

Manual download is also supported — clone this repo (with Git LFS) and point the daemon
at the resulting directory via `--lexicon-dir`.

## Reproduction

The binary is regenerated by the offline `as2s-lexicon-build` CLI from the as2s
workspace:

```
cargo run --release -p as2s-lexicon-build -- \
    --sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \
    --tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \
    --output lexicon.aslex
```

Required inputs:

- **SudachiDict core** — fetch the `20260428` core lex CSV from the SudachiDict CDN
  (`d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip`), 21 MB zip
  expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd.
  under the Apache License, Version 2.0.
- **Qwen3-ASR tokenizer** — `tokenizer.json` from `Qwen/Qwen3-ASR-1.7B` or the JA
  fine-tune; both share the same vocabulary. The builder calls `tokenizer.encode()`
  on every surface so the trie is keyed by the post-tokenizer token-id sequence,
  side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7).

The builder lives at `crates/as2s-lexicon-build/`. The daemon's transitive dependency
closure (`as2s-cli`) **does not** include `sudachi.rs` or any Wikidata stream parser —
the heavy parsing dependencies are confined to the builder crate (architectural review
H2 from Issue #319).

## Future work

- **Wikidata Q-item integration** — adds ~360k named-entity entries (people, places,
  products) on top of the SudachiDict base. Tracked as a follow-up issue
  (`#324b: lexicon-ja Wikidata extension` or similar — see the as2s repository).
  Deferred from `v0.0.1` because the upstream dump (`wikidata-<date>-all.json.gz`) is
  ~153 GB and the immediate priority is closing the daemon-startup gap.

## Attribution

This artefact embeds entries derived from **SudachiDict**:

> SudachiDict — © Works Applications Co., Ltd. — Apache License, Version 2.0
> <https://github.com/WorksApplications/SudachiDict>

The full upstream LICENSE / NOTICE files are reproduced under `<licenses>/lexicon-ja/`
when the binary is installed via `as2s download lexicon-ja`. The biasing trie shipped
here is a derived work referencing SudachiDict entries by token-id sequence and Tier
tag and retains the original Apache-2.0 licence terms.

"Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the
upstream work as required by Apache-2.0 §4 and is not an endorsement.

## License

Apache License, Version 2.0. See `LICENSE-2.0.txt` for the full text (or
<https://www.apache.org/licenses/LICENSE-2.0>).