Radar-1 Technical Report
1. Overview
Radar-1 is a language detection module for the underthesea NLP ecosystem. It provides fast, accurate language identification by reimplementing Facebook's FastText inference pipeline in pure Rust with PyO3 Python bindings, integrated into the underthesea_core native extension.
Key results:
| Metric | Value |
|---|---|
| Accuracy (97 test cases, 25+ languages) | 95.9% |
| Prediction match vs C++ fasttext | 100% |
| Batch throughput | 110,001 predictions/sec |
| vs fasttext-predict (C++ stripped) | 2.14x faster |
| vs fasttext-wheel (C++ full) | 2.42x faster |
2. Model
Radar-1 uses the pre-trained FastText language identification model lid.176.ftz, which supports 176 languages.
| Parameter | Value |
|---|---|
| Architecture | FastText supervised |
| Loss function | Hierarchical softmax |
| Dimensions | 16 |
| Vocabulary | 7,235 words + 176 labels |
| Character n-grams | minn=2, maxn=4 |
| Hash buckets | 2,000,000 |
| Format | .ftz (quantized with Product Quantization) |
| File size | ~917 KB |
3. Architecture
3.1 Inference Pipeline
Input Text
|
v
Tokenization (whitespace split)
|
v
Feature Extraction
- Word IDs (vocabulary lookup via FNV-1a hash table)
- Character n-grams (minn=2..maxn=4, with <BOW>/<EOW> markers)
- Word n-grams (bigrams via hash)
- EOS token (</s>)
|
v
Input Embedding (average of feature vectors)
- Quantized input matrix (Product Quantization)
- 50,000 rows x 16 dims, 8 sub-quantizers, 256 centroids each
|
v
Prediction (Hierarchical Softmax)
- Huffman tree with 176 leaves (labels) and 175 internal nodes
- DFS traversal with log-space scoring and min-heap pruning
- Dense output matrix (176 x 16)
|
v
Top-k (label, score) pairs
3.2 Rust Module Structure
The implementation lives in underthesea_core/src/fasttext/ with 6 files:
| File | Lines | Responsibility |
|---|---|---|
hash.rs |
~40 | FNV-1a hash with C++ sign-extension compatibility |
args.rs |
~100 | Model hyperparameter deserialization (magic, version, 12 args) |
dictionary.rs |
~240 | Vocabulary, word/char/word n-gram feature extraction, prune index |
matrix.rs |
~305 | Dense matrix + Product Quantization (PQ) matrix with norm PQ |
inference.rs |
~285 | Hierarchical softmax (Huffman tree, DFS), softmax, sigmoid |
mod.rs |
~200 | FastTextModel struct: load .bin/.ftz, predict, public API |
Python bindings via PyO3 in lib.rs:
from underthesea_core import FastText
model = FastText.load("lid.176.ftz")
results = model.predict("Xin chao Viet Nam", k=3)
# [("vi", 0.98), ("id", 0.005), ("ms", 0.003)]
4. Implementation Details
4.1 FNV-1a Hash (C++ Compatibility)
FastText uses FNV-1a hashing for vocabulary lookup and n-gram bucket assignment. A critical compatibility detail: C++ char is signed on most platforms, so bytes >= 0x80 are sign-extended when cast to uint32_t. The Rust implementation matches this:
h ^= byte as i8 as u32; // sign-extend: 0xFF -> 0xFFFFFFFF
4.2 Product Quantization
The .ftz format uses Product Quantization to compress the input embedding matrix:
- The 50,000 x 16 input matrix is split into 8 sub-quantizers of dimension 2
- Each sub-quantizer has 256 centroids (KSUB=256)
- Row norms are separately quantized (qnorm=true)
- Lookup: for each row, 8 code bytes select centroids, which are concatenated and scaled by the norm
4.3 Hierarchical Softmax
The key to correct HS prediction required matching three C++ behaviors:
Log-space scoring: The DFS accumulates
score + log(p + 1e-5)rather than multiplying probabilities, preventing numerical underflow on deep tree paths.Left/right sigmoid assignment: In C++ FastText v0.9.2 (
loss.cc), the left child receivesstd_log(1 - sigmoid(f))and the right child receivesstd_log(sigmoid(f)). This is counterintuitive and opposite to some documentation.Min-heap pruning: A
BinaryHeap<Reverse<...>>maintains the top-k scores. Branches with scores below the current minimum are pruned, exploiting the monotonically decreasing property of log-space scores.
4.4 EOS Token Handling
C++ FastText's initNgrams() explicitly skips character n-gram computation for the </s> (EOS) token. Only the EOS word ID is included as a feature, not its character n-grams (<, </, </s, etc.). This is replicated in dictionary.rs:
if self.minn > 0 && *token != "</s>" {
self.compute_char_ngrams(token, &mut features);
}
4.5 Prune Index
Quantized .ftz models use a prune index (pruneidx) to remap n-gram hash buckets to a smaller set of active rows in the compressed input matrix. N-grams whose bucket hash is not in pruneidx are silently dropped.
5. Correctness Verification
5.1 Component-Level Verification
Each pipeline stage was verified against C++ FastText (fasttext-wheel 0.9.2):
| Component | Method | Result |
|---|---|---|
| FNV-1a hash | Compared with compiled C program output | Exact match |
| Feature IDs | get_features("hello") vs C++ get_subwords |
Exact match: [7165, 0] |
| Hidden vector | get_hidden("hello") vs C++ get_sentence_vector |
Max diff: 1.49e-08 |
| Output matrix | Raw binary parse vs Rust dot products | Max diff: 4.8e-07 |
| HS tree structure | Python rebuild from same counts | All 351 nodes match |
| Predictions | Top-k labels and scores | 100% match (97/97 texts) |
5.2 End-to-End Accuracy
Tested on 97 hand-crafted sentences across 25+ languages:
| Metric | Value |
|---|---|
| Rust accuracy | 93/97 (95.9%) |
| Python C++ accuracy | 93/97 (95.9%) |
| Rust vs Python label match | 97/97 (100.0%) |
Languages tested: vi, en, fr, de, es, ru, ja, zh, ko, ar, pt, it, tr, pl, nl, sv, el, cs, hu, fi, he, th, id, hi, uk, ro, da, no, ca, bg, fa.
The 4 misclassified cases are inherent model errors (both Rust and Python produce the same wrong label), not implementation bugs.
6. Performance
6.1 Library Comparison
Benchmarked against all major Python FastText libraries using lid.176.ftz on 16 multilingual sentences:
| Library | Type | Load (ms) | Avg Latency (us) | Throughput (pred/s) | vs Rust |
|---|---|---|---|---|---|
| underthesea_core | Rust/PyO3 | 51.8 | 8.3 | 110,001 | 1.00x |
| fasttext-langdetect | C++ wrapper | 0.0* | 8.9 | 89,038 | 0.81x |
| fast-langdetect | C++ wrapper | 0.0* | 14.6 | 57,399 | 0.52x |
| fasttext-predict | C++ stripped | 29.9 | 17.1 | 51,493 | 0.47x |
| fasttext-wheel | C++ full | 28.4 | 19.2 | 45,547 | 0.41x |
* Wrappers keep model loaded globally, so load = 0 after warmup.
Libraries tested:
| Package | Version | Description |
|---|---|---|
| underthesea_core | 3.3.0 | Pure Rust FastText inference (this project) |
| fasttext-predict | 0.9.2.4 | C++ predict-only fork, no numpy, <1MB wheel |
| fasttext-wheel | 0.9.2 | Full Facebook C++ fasttext with numpy/pybind11 |
| fast-langdetect | 1.0.0 | Wrapper around fasttext-predict, bundles lid.176.ftz |
| fasttext-langdetect | 1.0.5 | Wrapper around full fasttext |
6.2 Prediction Latency by Input
Median latency over 500 runs per sentence (top-3 prediction):
| Input | Rust (us) | C++ fasttext-predict (us) | Speedup |
|---|---|---|---|
| "hello" (5 chars) | 3.0 | 6.3 | 2.10x |
| Vietnamese, medium (36 chars) | 6.2 | 13.0 | 2.10x |
| English, medium (44 chars) | 6.7 | 14.7 | 2.19x |
| French, medium (49 chars) | 7.2 | 15.2 | 2.11x |
| Chinese (10 chars) | 4.6 | 10.7 | 2.33x |
| Japanese (11 chars) | 4.0 | 8.3 | 2.08x |
| Vietnamese, long (185 chars) | 29.1 | 59.3 | 2.04x |
| Average | 8.3 | 17.1 | 2.06x |
6.3 Prediction Verification
All implementations produce identical top-1 predictions on 16 test sentences:
| underthesea_core | fasttext-predict | fasttext-wheel | |
|---|---|---|---|
| Match rate | - | 16/16 (100%) | 16/16 (100%) |
Note: fast-langdetect and fasttext-langdetect show 15/16 match because they default to the larger lid.176.bin model instead of .ftz.
6.4 Model Loading
| Implementation | Load Time |
|---|---|
| fasttext-predict (C++) | 29.9 ms |
| fasttext-wheel (C++) | 28.4 ms |
| underthesea_core (Rust) | 51.8 ms |
Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk memcpy). This is a one-time cost and does not affect prediction performance.
7. Bugs Found and Fixed
During development, three bugs were identified and fixed:
Bug 1: EOS Character N-grams
Symptom: 0% accuracy - all predictions wrong.
Root cause: Rust included character n-grams for the </s> token (e.g., <, </, </s, s>, >), but C++ initNgrams() has an explicit if (words_[i].word != EOS) check that skips n-gram computation for EOS.
Fix: Added *token != "</s>" guard before compute_char_ngrams().
Bug 2: Probability Space vs Log Space
Symptom: 0% accuracy after fixing Bug 1.
Root cause: The HS DFS used probability-space multiplication (score * sigmoid(f)) with an arbitrary 0.001 pruning threshold. C++ uses log-space addition (score + log(sigmoid(f) + 1e-5)) with proper min-heap pruning.
Fix: Switched to log-space scoring with std_log(x) = ln(x + 1e-5) and BinaryHeap<Reverse<...>> min-heap.
Bug 3: Left/Right Sigmoid Swap
Symptom: 0% accuracy after fixing Bugs 1 and 2. Hidden vectors and output matrix verified identical.
Root cause: The DFS assigned std_log(sigmoid(f)) to the left child and std_log(1 - sigmoid(f)) to the right child. The actual C++ v0.9.2 code (loss.cc) does the opposite: left gets std_log(1-f), right gets std_log(f).
Fix: Swapped the left/right assignments to match C++.
Discovery method: Fetched the exact C++ source from github.com/facebookresearch/fastText/v0.9.2/src/loss.cc and compared line by line.
8. Binary Format
The .ftz file format (FastText quantized model, version 12):
[Header]
magic: i32 = 0x2F4F16BA
version: i32 = 12
[Args]
dim, ws, epoch, minCount, neg, wordNgrams: i32 x 6
loss, model, bucket, minn, maxn, lrUpdateRate: i32 x 6
t: f64
[Dictionary]
size, nwords, nlabels: i32 x 3
ntokens: i64
pruneidx_size: i64
entries[size]: { null-terminated string, count: i64, type: i8 }
pruneidx[pruneidx_size]: { key: i32, value: i32 }
[Input Matrix - Quantized]
quant_input: bool (1 byte)
qnorm: bool
rows, cols: i64 x 2
codesize: i32
codes: u8[codesize]
PQ: { dim, nsubq, dsub, lastdsub: i32 x 4, centroids: f32[nsubq * 256 * dsub] }
norm_codes: u8[rows] (if qnorm)
norm_PQ: { ... } (if qnorm)
[Output Matrix - Dense]
quant_output: bool (1 byte)
rows, cols: i64 x 2
data: f32[rows * cols]
9. Dependencies
Rust (underthesea_core)
| Crate | Version | Purpose |
|---|---|---|
| pyo3 | 0.25.1 | Python bindings |
| byteorder | 1.5 | Little-endian binary I/O |
| (std only) | - | BinaryHeap, HashMap, BufReader |
No additional crates were added for the FastText module.
Python (radar-1)
| Package | Version | Purpose |
|---|---|---|
| underthesea | >= 9.2.9 | NLP ecosystem integration |
| underthesea_core | >= 3.3.0 | Rust FastText inference |
10. Future Work
- Batch prediction API (
predict_batch(texts: Vec<str>)) - SIMD-accelerated dot products for further speedup
- Bulk
read_exactfor dense matrix loading to improve load time - Softmax loss function models (currently only HS is tested)