initial v0.0.1: Sudachi-only v2 lexicon (Issue #324)

553064b verified 8 days ago

4.7 kB

	---
	license: apache-2.0
	language:
	- ja
	library_name: as2s
	tags:
	- lexicon
	- japanese
	- asr
	- speech-recognition
	- biasing
	- shallow-fusion
	---

	# arubeh/as2s-lexicon-ja

	Pre-built Japanese lexicon binary consumed by [as2s] (Arubeh's Speech-to-Speech) to drive
	shallow-fusion logit biasing on the Qwen3-ASR greedy decode path. The binary is an FST
	trie keyed by Qwen3-ASR tokenizer token-id sequences plus a per-entry Tier tag, mmap'd
	read-only at daemon boot and consulted on every decode step via a `BiasCursor`.

	Initial release (`v0.0.1`) is Sudachi-only. Wikidata Q-item integration is tracked
	in a follow-up issue; see the [Future work](#future-work) section below.

	[as2s]: https://github.com/arubeh/as2s

	## On-disk format

	```
	[ magic: 9 bytes = b"AS2SLEXv2" ]
	[ version: u32 LE ]
	[ FST length: u64 LE ][ FST bytes ]
	[ token table length: u64 LE ][ zstd-compressed token table ]
	[ manifest length: u64 LE ][ manifest JSON ]
	```

	The runtime parser is in `crates/as2s-lexicon/src/load.rs`. `AS2SLEXv1` files surface
	`LexiconError::LegacyFormatV1` and refuse to load — re-download with
	`as2s download lexicon-ja --force` to pull the current `v2` artefact.

	## Manifest

	\| Field \| Value \|
	\|---\|---\|
	\| `builder_version` \| `0.0.1` \|
	\| `sudachi_dict` \| `core_lex.csv` (SudachiDict 20260428) \|
	\| `wikidata_dump` \| `null` (initial release is Sudachi-only) \|
	\| `entry_count` \| `783,648` \|
	\| `entries_by_tier.proper` \| `510,945` \|
	\| `entries_by_tier.generic` \| `272,703` \|
	\| Binary size \| 12 MiB \|
	\| Magic \| `AS2SLEXv2` \|
	\| SHA-256 \| `6002fa57f9be208681f50c81d7d2aec195f233e3bee387bab7d576548decebcf` \|

	`Proper` Tier entries receive a stronger biasing weight than `Generic` (per-tier α is
	tuned by the daemon at runtime; see `as2s-lexicon` runtime docs).

	## Usage

	The recommended path is the bundled CLI fetcher, which lands the binary under
	`<assets_dir>/lexicon-ja/` and the upstream attribution NOTICE under
	`<licenses>/lexicon-ja/`:

	```
	cargo run -p as2s-cli -- download lexicon-ja
	cargo run -p as2s-cli -- serve --asr-engine qwen3-asr-rs --lexicon-dir <assets>/lexicon-ja
	```

	Manual download is also supported — clone this repo (with Git LFS) and point the daemon
	at the resulting directory via `--lexicon-dir`.

	## Reproduction

	The binary is regenerated by the offline `as2s-lexicon-build` CLI from the as2s
	workspace:

	```
	cargo run --release -p as2s-lexicon-build -- \
	--sudachi-dict <path-to-SudachiDict-core>/core_lex.csv \
	--tokenizer <as2s-assets>/qwen3-asr-rs-ja/tokenizer.json \
	--output lexicon.aslex
	```

	Required inputs:

	- SudachiDict core — fetch the `20260428` core lex CSV from the SudachiDict CDN
	(`d2ej7fkh96fzlu.cloudfront.net/sudachidict-raw/20260428/core_lex.zip`), 21 MB zip
	expanding to ~158 MB CSV. SudachiDict is published by Works Applications Co., Ltd.
	under the Apache License, Version 2.0.
	- Qwen3-ASR tokenizer — `tokenizer.json` from `Qwen/Qwen3-ASR-1.7B` or the JA
	fine-tune; both share the same vocabulary. The builder calls `tokenizer.encode()`
	on every surface so the trie is keyed by the post-tokenizer token-id sequence,
	side-stepping the NFC vs NFKC normalisation pitfall (Issue #319 R7).

	The builder lives at `crates/as2s-lexicon-build/`. The daemon's transitive dependency
	closure (`as2s-cli`) does not include `sudachi.rs` or any Wikidata stream parser —
	the heavy parsing dependencies are confined to the builder crate (architectural review
	H2 from Issue #319).

	## Future work

	- Wikidata Q-item integration — adds ~360k named-entity entries (people, places,
	products) on top of the SudachiDict base. Tracked as a follow-up issue
	(`#324b: lexicon-ja Wikidata extension` or similar — see the as2s repository).
	Deferred from `v0.0.1` because the upstream dump (`wikidata-<date>-all.json.gz`) is
	~153 GB and the immediate priority is closing the daemon-startup gap.

	## Attribution

	This artefact embeds entries derived from SudachiDict:

	> SudachiDict — © Works Applications Co., Ltd. — Apache License, Version 2.0
	> <https://github.com/WorksApplications/SudachiDict>

	The full upstream LICENSE / NOTICE files are reproduced under `<licenses>/lexicon-ja/`
	when the binary is installed via `as2s download lexicon-ja`. The biasing trie shipped
	here is a derived work referencing SudachiDict entries by token-id sequence and Tier
	tag and retains the original Apache-2.0 licence terms.

	"Sudachi" is a trade mark of Works Applications Co., Ltd.; this NOTICE attributes the
	upstream work as required by Apache-2.0 §4 and is not an endorsement.

	## License

	Apache License, Version 2.0. See `LICENSE-2.0.txt` for the full text (or
	<https://www.apache.org/licenses/LICENSE-2.0>).