radar-1 / TECHNICAL_REPORT.md
rain1024's picture
Update to use underthesea_core for FastText inference
9b203d0

Radar-1 Technical Report

1. Overview

Radar-1 is a language detection module for the underthesea NLP ecosystem. It provides fast, accurate language identification by reimplementing Facebook's FastText inference pipeline in pure Rust with PyO3 Python bindings, integrated into the underthesea_core native extension.

Key results:

Metric Value
Accuracy (97 test cases, 25+ languages) 95.9%
Prediction match vs C++ fasttext 100%
Batch throughput 110,001 predictions/sec
vs fasttext-predict (C++ stripped) 2.14x faster
vs fasttext-wheel (C++ full) 2.42x faster

2. Model

Radar-1 uses the pre-trained FastText language identification model lid.176.ftz, which supports 176 languages.

Parameter Value
Architecture FastText supervised
Loss function Hierarchical softmax
Dimensions 16
Vocabulary 7,235 words + 176 labels
Character n-grams minn=2, maxn=4
Hash buckets 2,000,000
Format .ftz (quantized with Product Quantization)
File size ~917 KB

3. Architecture

3.1 Inference Pipeline

Input Text
    |
    v
Tokenization (whitespace split)
    |
    v
Feature Extraction
  - Word IDs (vocabulary lookup via FNV-1a hash table)
  - Character n-grams (minn=2..maxn=4, with <BOW>/<EOW> markers)
  - Word n-grams (bigrams via hash)
  - EOS token (</s>)
    |
    v
Input Embedding (average of feature vectors)
  - Quantized input matrix (Product Quantization)
  - 50,000 rows x 16 dims, 8 sub-quantizers, 256 centroids each
    |
    v
Prediction (Hierarchical Softmax)
  - Huffman tree with 176 leaves (labels) and 175 internal nodes
  - DFS traversal with log-space scoring and min-heap pruning
  - Dense output matrix (176 x 16)
    |
    v
Top-k (label, score) pairs

3.2 Rust Module Structure

The implementation lives in underthesea_core/src/fasttext/ with 6 files:

File Lines Responsibility
hash.rs ~40 FNV-1a hash with C++ sign-extension compatibility
args.rs ~100 Model hyperparameter deserialization (magic, version, 12 args)
dictionary.rs ~240 Vocabulary, word/char/word n-gram feature extraction, prune index
matrix.rs ~305 Dense matrix + Product Quantization (PQ) matrix with norm PQ
inference.rs ~285 Hierarchical softmax (Huffman tree, DFS), softmax, sigmoid
mod.rs ~200 FastTextModel struct: load .bin/.ftz, predict, public API

Python bindings via PyO3 in lib.rs:

from underthesea_core import FastText

model = FastText.load("lid.176.ftz")
results = model.predict("Xin chao Viet Nam", k=3)
# [("vi", 0.98), ("id", 0.005), ("ms", 0.003)]

4. Implementation Details

4.1 FNV-1a Hash (C++ Compatibility)

FastText uses FNV-1a hashing for vocabulary lookup and n-gram bucket assignment. A critical compatibility detail: C++ char is signed on most platforms, so bytes >= 0x80 are sign-extended when cast to uint32_t. The Rust implementation matches this:

h ^= byte as i8 as u32;  // sign-extend: 0xFF -> 0xFFFFFFFF

4.2 Product Quantization

The .ftz format uses Product Quantization to compress the input embedding matrix:

  • The 50,000 x 16 input matrix is split into 8 sub-quantizers of dimension 2
  • Each sub-quantizer has 256 centroids (KSUB=256)
  • Row norms are separately quantized (qnorm=true)
  • Lookup: for each row, 8 code bytes select centroids, which are concatenated and scaled by the norm

4.3 Hierarchical Softmax

The key to correct HS prediction required matching three C++ behaviors:

  1. Log-space scoring: The DFS accumulates score + log(p + 1e-5) rather than multiplying probabilities, preventing numerical underflow on deep tree paths.

  2. Left/right sigmoid assignment: In C++ FastText v0.9.2 (loss.cc), the left child receives std_log(1 - sigmoid(f)) and the right child receives std_log(sigmoid(f)). This is counterintuitive and opposite to some documentation.

  3. Min-heap pruning: A BinaryHeap<Reverse<...>> maintains the top-k scores. Branches with scores below the current minimum are pruned, exploiting the monotonically decreasing property of log-space scores.

4.4 EOS Token Handling

C++ FastText's initNgrams() explicitly skips character n-gram computation for the </s> (EOS) token. Only the EOS word ID is included as a feature, not its character n-grams (<, </, </s, etc.). This is replicated in dictionary.rs:

if self.minn > 0 && *token != "</s>" {
    self.compute_char_ngrams(token, &mut features);
}

4.5 Prune Index

Quantized .ftz models use a prune index (pruneidx) to remap n-gram hash buckets to a smaller set of active rows in the compressed input matrix. N-grams whose bucket hash is not in pruneidx are silently dropped.

5. Correctness Verification

5.1 Component-Level Verification

Each pipeline stage was verified against C++ FastText (fasttext-wheel 0.9.2):

Component Method Result
FNV-1a hash Compared with compiled C program output Exact match
Feature IDs get_features("hello") vs C++ get_subwords Exact match: [7165, 0]
Hidden vector get_hidden("hello") vs C++ get_sentence_vector Max diff: 1.49e-08
Output matrix Raw binary parse vs Rust dot products Max diff: 4.8e-07
HS tree structure Python rebuild from same counts All 351 nodes match
Predictions Top-k labels and scores 100% match (97/97 texts)

5.2 End-to-End Accuracy

Tested on 97 hand-crafted sentences across 25+ languages:

Metric Value
Rust accuracy 93/97 (95.9%)
Python C++ accuracy 93/97 (95.9%)
Rust vs Python label match 97/97 (100.0%)

Languages tested: vi, en, fr, de, es, ru, ja, zh, ko, ar, pt, it, tr, pl, nl, sv, el, cs, hu, fi, he, th, id, hi, uk, ro, da, no, ca, bg, fa.

The 4 misclassified cases are inherent model errors (both Rust and Python produce the same wrong label), not implementation bugs.

6. Performance

6.1 Library Comparison

Benchmarked against all major Python FastText libraries using lid.176.ftz on 16 multilingual sentences:

Library Type Load (ms) Avg Latency (us) Throughput (pred/s) vs Rust
underthesea_core Rust/PyO3 51.8 8.3 110,001 1.00x
fasttext-langdetect C++ wrapper 0.0* 8.9 89,038 0.81x
fast-langdetect C++ wrapper 0.0* 14.6 57,399 0.52x
fasttext-predict C++ stripped 29.9 17.1 51,493 0.47x
fasttext-wheel C++ full 28.4 19.2 45,547 0.41x

* Wrappers keep model loaded globally, so load = 0 after warmup.

Libraries tested:

Package Version Description
underthesea_core 3.3.0 Pure Rust FastText inference (this project)
fasttext-predict 0.9.2.4 C++ predict-only fork, no numpy, <1MB wheel
fasttext-wheel 0.9.2 Full Facebook C++ fasttext with numpy/pybind11
fast-langdetect 1.0.0 Wrapper around fasttext-predict, bundles lid.176.ftz
fasttext-langdetect 1.0.5 Wrapper around full fasttext

6.2 Prediction Latency by Input

Median latency over 500 runs per sentence (top-3 prediction):

Input Rust (us) C++ fasttext-predict (us) Speedup
"hello" (5 chars) 3.0 6.3 2.10x
Vietnamese, medium (36 chars) 6.2 13.0 2.10x
English, medium (44 chars) 6.7 14.7 2.19x
French, medium (49 chars) 7.2 15.2 2.11x
Chinese (10 chars) 4.6 10.7 2.33x
Japanese (11 chars) 4.0 8.3 2.08x
Vietnamese, long (185 chars) 29.1 59.3 2.04x
Average 8.3 17.1 2.06x

6.3 Prediction Verification

All implementations produce identical top-1 predictions on 16 test sentences:

underthesea_core fasttext-predict fasttext-wheel
Match rate - 16/16 (100%) 16/16 (100%)

Note: fast-langdetect and fasttext-langdetect show 15/16 match because they default to the larger lid.176.bin model instead of .ftz.

6.4 Model Loading

Implementation Load Time
fasttext-predict (C++) 29.9 ms
fasttext-wheel (C++) 28.4 ms
underthesea_core (Rust) 51.8 ms

Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk memcpy). This is a one-time cost and does not affect prediction performance.

7. Bugs Found and Fixed

During development, three bugs were identified and fixed:

Bug 1: EOS Character N-grams

Symptom: 0% accuracy - all predictions wrong.

Root cause: Rust included character n-grams for the </s> token (e.g., <, </, </s, s>, >), but C++ initNgrams() has an explicit if (words_[i].word != EOS) check that skips n-gram computation for EOS.

Fix: Added *token != "</s>" guard before compute_char_ngrams().

Bug 2: Probability Space vs Log Space

Symptom: 0% accuracy after fixing Bug 1.

Root cause: The HS DFS used probability-space multiplication (score * sigmoid(f)) with an arbitrary 0.001 pruning threshold. C++ uses log-space addition (score + log(sigmoid(f) + 1e-5)) with proper min-heap pruning.

Fix: Switched to log-space scoring with std_log(x) = ln(x + 1e-5) and BinaryHeap<Reverse<...>> min-heap.

Bug 3: Left/Right Sigmoid Swap

Symptom: 0% accuracy after fixing Bugs 1 and 2. Hidden vectors and output matrix verified identical.

Root cause: The DFS assigned std_log(sigmoid(f)) to the left child and std_log(1 - sigmoid(f)) to the right child. The actual C++ v0.9.2 code (loss.cc) does the opposite: left gets std_log(1-f), right gets std_log(f).

Fix: Swapped the left/right assignments to match C++.

Discovery method: Fetched the exact C++ source from github.com/facebookresearch/fastText/v0.9.2/src/loss.cc and compared line by line.

8. Binary Format

The .ftz file format (FastText quantized model, version 12):

[Header]
  magic:    i32 = 0x2F4F16BA
  version:  i32 = 12

[Args]
  dim, ws, epoch, minCount, neg, wordNgrams: i32 x 6
  loss, model, bucket, minn, maxn, lrUpdateRate: i32 x 6
  t: f64

[Dictionary]
  size, nwords, nlabels: i32 x 3
  ntokens: i64
  pruneidx_size: i64
  entries[size]: { null-terminated string, count: i64, type: i8 }
  pruneidx[pruneidx_size]: { key: i32, value: i32 }

[Input Matrix - Quantized]
  quant_input: bool (1 byte)
  qnorm: bool
  rows, cols: i64 x 2
  codesize: i32
  codes: u8[codesize]
  PQ: { dim, nsubq, dsub, lastdsub: i32 x 4, centroids: f32[nsubq * 256 * dsub] }
  norm_codes: u8[rows]  (if qnorm)
  norm_PQ: { ... }       (if qnorm)

[Output Matrix - Dense]
  quant_output: bool (1 byte)
  rows, cols: i64 x 2
  data: f32[rows * cols]

9. Dependencies

Rust (underthesea_core)

Crate Version Purpose
pyo3 0.25.1 Python bindings
byteorder 1.5 Little-endian binary I/O
(std only) - BinaryHeap, HashMap, BufReader

No additional crates were added for the FastText module.

Python (radar-1)

Package Version Purpose
underthesea >= 9.2.9 NLP ecosystem integration
underthesea_core >= 3.3.0 Rust FastText inference

10. Future Work

  • Batch prediction API (predict_batch(texts: Vec<str>))
  • SIMD-accelerated dot products for further speedup
  • Bulk read_exact for dense matrix loading to improve load time
  • Softmax loss function models (currently only HS is tested)