radar-1 / TECHNICAL_REPORT.md

Update to use underthesea_core for FastText inference

9b203d0 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

Radar-1 Technical Report

1. Overview

Radar-1 is a language detection module for the underthesea NLP ecosystem. It provides fast, accurate language identification by reimplementing Facebook's FastText inference pipeline in pure Rust with PyO3 Python bindings, integrated into the underthesea_core native extension.

Key results:

Metric	Value
Accuracy (97 test cases, 25+ languages)	95.9%
Prediction match vs C++ fasttext	100%
Batch throughput	110,001 predictions/sec
vs fasttext-predict (C++ stripped)	2.14x faster
vs fasttext-wheel (C++ full)	2.42x faster

2. Model

Radar-1 uses the pre-trained FastText language identification model lid.176.ftz, which supports 176 languages.

Parameter	Value
Architecture	FastText supervised
Loss function	Hierarchical softmax
Dimensions	16
Vocabulary	7,235 words + 176 labels
Character n-grams	minn=2, maxn=4
Hash buckets	2,000,000
Format	`.ftz` (quantized with Product Quantization)
File size	~917 KB

3. Architecture

3.1 Inference Pipeline

Input Text
    |
    v
Tokenization (whitespace split)
    |
    v
Feature Extraction
  - Word IDs (vocabulary lookup via FNV-1a hash table)
  - Character n-grams (minn=2..maxn=4, with <BOW>/<EOW> markers)
  - Word n-grams (bigrams via hash)
  - EOS token (</s>)
    |
    v
Input Embedding (average of feature vectors)
  - Quantized input matrix (Product Quantization)
  - 50,000 rows x 16 dims, 8 sub-quantizers, 256 centroids each
    |
    v
Prediction (Hierarchical Softmax)
  - Huffman tree with 176 leaves (labels) and 175 internal nodes
  - DFS traversal with log-space scoring and min-heap pruning
  - Dense output matrix (176 x 16)
    |
    v
Top-k (label, score) pairs

3.2 Rust Module Structure

The implementation lives in underthesea_core/src/fasttext/ with 6 files:

File	Lines	Responsibility
`hash.rs`	~40	FNV-1a hash with C++ sign-extension compatibility
`args.rs`	~100	Model hyperparameter deserialization (magic, version, 12 args)
`dictionary.rs`	~240	Vocabulary, word/char/word n-gram feature extraction, prune index
`matrix.rs`	~305	Dense matrix + Product Quantization (PQ) matrix with norm PQ
`inference.rs`	~285	Hierarchical softmax (Huffman tree, DFS), softmax, sigmoid
`mod.rs`	~200	`FastTextModel` struct: load `.bin`/`.ftz`, predict, public API

Python bindings via PyO3 in lib.rs:

from underthesea_core import FastText

model = FastText.load("lid.176.ftz")
results = model.predict("Xin chao Viet Nam", k=3)
# [("vi", 0.98), ("id", 0.005), ("ms", 0.003)]

4. Implementation Details

4.1 FNV-1a Hash (C++ Compatibility)

FastText uses FNV-1a hashing for vocabulary lookup and n-gram bucket assignment. A critical compatibility detail: C++ char is signed on most platforms, so bytes >= 0x80 are sign-extended when cast to uint32_t. The Rust implementation matches this:

h ^= byte as i8 as u32;  // sign-extend: 0xFF -> 0xFFFFFFFF

4.2 Product Quantization

The .ftz format uses Product Quantization to compress the input embedding matrix:

The 50,000 x 16 input matrix is split into 8 sub-quantizers of dimension 2
Each sub-quantizer has 256 centroids (KSUB=256)
Row norms are separately quantized (qnorm=true)
Lookup: for each row, 8 code bytes select centroids, which are concatenated and scaled by the norm

4.3 Hierarchical Softmax

The key to correct HS prediction required matching three C++ behaviors:

Log-space scoring: The DFS accumulates score + log(p + 1e-5) rather than multiplying probabilities, preventing numerical underflow on deep tree paths.
Left/right sigmoid assignment: In C++ FastText v0.9.2 (loss.cc), the left child receives std_log(1 - sigmoid(f)) and the right child receives std_log(sigmoid(f)). This is counterintuitive and opposite to some documentation.
Min-heap pruning: A BinaryHeap<Reverse<...>> maintains the top-k scores. Branches with scores below the current minimum are pruned, exploiting the monotonically decreasing property of log-space scores.

4.4 EOS Token Handling

C++ FastText's initNgrams() explicitly skips character n-gram computation for the </s> (EOS) token. Only the EOS word ID is included as a feature, not its character n-grams (<, </, </s, etc.). This is replicated in dictionary.rs:

if self.minn > 0 && *token != "</s>" {
    self.compute_char_ngrams(token, &mut features);
}

4.5 Prune Index

Quantized .ftz models use a prune index (pruneidx) to remap n-gram hash buckets to a smaller set of active rows in the compressed input matrix. N-grams whose bucket hash is not in pruneidx are silently dropped.

5. Correctness Verification

5.1 Component-Level Verification

Each pipeline stage was verified against C++ FastText (fasttext-wheel 0.9.2):

Component	Method	Result
FNV-1a hash	Compared with compiled C program output	Exact match
Feature IDs	`get_features("hello")` vs C++ `get_subwords`	Exact match: `[7165, 0]`
Hidden vector	`get_hidden("hello")` vs C++ `get_sentence_vector`	Max diff: 1.49e-08
Output matrix	Raw binary parse vs Rust dot products	Max diff: 4.8e-07
HS tree structure	Python rebuild from same counts	All 351 nodes match
Predictions	Top-k labels and scores	100% match (97/97 texts)

5.2 End-to-End Accuracy

Tested on 97 hand-crafted sentences across 25+ languages:

Metric	Value
Rust accuracy	93/97 (95.9%)
Python C++ accuracy	93/97 (95.9%)
Rust vs Python label match	97/97 (100.0%)

Languages tested: vi, en, fr, de, es, ru, ja, zh, ko, ar, pt, it, tr, pl, nl, sv, el, cs, hu, fi, he, th, id, hi, uk, ro, da, no, ca, bg, fa.

The 4 misclassified cases are inherent model errors (both Rust and Python produce the same wrong label), not implementation bugs.

6. Performance

6.1 Library Comparison

Benchmarked against all major Python FastText libraries using lid.176.ftz on 16 multilingual sentences:

Library	Type	Load (ms)	Avg Latency (us)	Throughput (pred/s)	vs Rust
underthesea_core	Rust/PyO3	51.8	8.3	110,001	1.00x
fasttext-langdetect	C++ wrapper	0.0*	8.9	89,038	0.81x
fast-langdetect	C++ wrapper	0.0*	14.6	57,399	0.52x
fasttext-predict	C++ stripped	29.9	17.1	51,493	0.47x
fasttext-wheel	C++ full	28.4	19.2	45,547	0.41x

* Wrappers keep model loaded globally, so load = 0 after warmup.

Libraries tested:

Package	Version	Description
underthesea_core	3.3.0	Pure Rust FastText inference (this project)
fasttext-predict	0.9.2.4	C++ predict-only fork, no numpy, <1MB wheel
fasttext-wheel	0.9.2	Full Facebook C++ fasttext with numpy/pybind11
fast-langdetect	1.0.0	Wrapper around fasttext-predict, bundles lid.176.ftz
fasttext-langdetect	1.0.5	Wrapper around full fasttext

6.2 Prediction Latency by Input

Median latency over 500 runs per sentence (top-3 prediction):

Input	Rust (us)	C++ fasttext-predict (us)	Speedup
"hello" (5 chars)	3.0	6.3	2.10x
Vietnamese, medium (36 chars)	6.2	13.0	2.10x
English, medium (44 chars)	6.7	14.7	2.19x
French, medium (49 chars)	7.2	15.2	2.11x
Chinese (10 chars)	4.6	10.7	2.33x
Japanese (11 chars)	4.0	8.3	2.08x
Vietnamese, long (185 chars)	29.1	59.3	2.04x
Average	8.3	17.1	2.06x

6.3 Prediction Verification

All implementations produce identical top-1 predictions on 16 test sentences:

	underthesea_core	fasttext-predict	fasttext-wheel
Match rate	-	16/16 (100%)	16/16 (100%)

Note: fast-langdetect and fasttext-langdetect show 15/16 match because they default to the larger lid.176.bin model instead of .ftz.

6.4 Model Loading

Implementation	Load Time
fasttext-predict (C++)	29.9 ms
fasttext-wheel (C++)	28.4 ms
underthesea_core (Rust)	51.8 ms

Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk memcpy). This is a one-time cost and does not affect prediction performance.

7. Bugs Found and Fixed

During development, three bugs were identified and fixed:

Bug 1: EOS Character N-grams

Symptom: 0% accuracy - all predictions wrong.

Root cause: Rust included character n-grams for the </s> token (e.g., <, </, </s, s>, >), but C++ initNgrams() has an explicit if (words_[i].word != EOS) check that skips n-gram computation for EOS.

Fix: Added *token != "</s>" guard before compute_char_ngrams().

Bug 2: Probability Space vs Log Space

Symptom: 0% accuracy after fixing Bug 1.

Root cause: The HS DFS used probability-space multiplication (score * sigmoid(f)) with an arbitrary 0.001 pruning threshold. C++ uses log-space addition (score + log(sigmoid(f) + 1e-5)) with proper min-heap pruning.

Fix: Switched to log-space scoring with std_log(x) = ln(x + 1e-5) and BinaryHeap<Reverse<...>> min-heap.

Bug 3: Left/Right Sigmoid Swap

Symptom: 0% accuracy after fixing Bugs 1 and 2. Hidden vectors and output matrix verified identical.

Root cause: The DFS assigned std_log(sigmoid(f)) to the left child and std_log(1 - sigmoid(f)) to the right child. The actual C++ v0.9.2 code (loss.cc) does the opposite: left gets std_log(1-f), right gets std_log(f).

Fix: Swapped the left/right assignments to match C++.

Discovery method: Fetched the exact C++ source from github.com/facebookresearch/fastText/v0.9.2/src/loss.cc and compared line by line.

8. Binary Format

The .ftz file format (FastText quantized model, version 12):

[Header]
  magic:    i32 = 0x2F4F16BA
  version:  i32 = 12

[Args]
  dim, ws, epoch, minCount, neg, wordNgrams: i32 x 6
  loss, model, bucket, minn, maxn, lrUpdateRate: i32 x 6
  t: f64

[Dictionary]
  size, nwords, nlabels: i32 x 3
  ntokens: i64
  pruneidx_size: i64
  entries[size]: { null-terminated string, count: i64, type: i8 }
  pruneidx[pruneidx_size]: { key: i32, value: i32 }

[Input Matrix - Quantized]
  quant_input: bool (1 byte)
  qnorm: bool
  rows, cols: i64 x 2
  codesize: i32
  codes: u8[codesize]
  PQ: { dim, nsubq, dsub, lastdsub: i32 x 4, centroids: f32[nsubq * 256 * dsub] }
  norm_codes: u8[rows]  (if qnorm)
  norm_PQ: { ... }       (if qnorm)

[Output Matrix - Dense]
  quant_output: bool (1 byte)
  rows, cols: i64 x 2
  data: f32[rows * cols]

9. Dependencies

Rust (underthesea_core)

Crate	Version	Purpose
pyo3	0.25.1	Python bindings
byteorder	1.5	Little-endian binary I/O
(std only)	-	BinaryHeap, HashMap, BufReader

No additional crates were added for the FastText module.

Python (radar-1)

Package	Version	Purpose
underthesea	>= 9.2.9	NLP ecosystem integration
underthesea_core	>= 3.3.0	Rust FastText inference

10. Future Work

Batch prediction API (predict_batch(texts: Vec<str>))
SIMD-accelerated dot products for further speedup
Bulk read_exact for dense matrix loading to improve load time
Softmax loss function models (currently only HS is tested)