onnx-community
/

needle-onnx

+---
+license: mit
+tags:
+- onnx
+- function-calling
+- needle
+- cactus
+- browser
+- sentencepiece
+base_model: Cactus-Compute/needle
+library_name: onnxruntime
+---
+# Needle — ONNX export for in-browser inference
+Browser-ready ONNX export of [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle), a 26M-parameter function-calling model. Designed to run entirely client-side via `onnxruntime-web` (WASM backend) — no server required.
+## Files
+| File | Description | Size |
+|---|---|---|
+| `encoder.onnx` | Needle encoder. Input `input_ids:(B,T)`, output `encoder_out:(B,T,512)`. Single-pass. | ~55 MB |
+| `decoder_step.onnx` | One decoder step with explicit past-KV in / present-KV out. Run in a JS loop. | ~85 MB |
+| `needle.model` | SentencePiece BPE protobuf (vocab=8192, `byte_fallback=True`, `identity` normalization). Loadable by `sentencepiece-js` / `@huggingface/transformers`. | 125 KB |
+| `tokenizer-specials.json` | `{"pad":0,"eos":1,"bos":2,"tool_call":4,"tools":5}` | tiny |
+## Origin
+The upstream Cactus Needle is implemented in **JAX/Flax**, not PyTorch — `torch.onnx.export` cannot run against the upstream model directly. This ONNX export was produced via a "port-and-copy" pipeline:
+1. Reimplemented the Simple Attention Network in PyTorch (parametric on `TransformerConfig`)
+2. Copied weights tensor-by-tensor from the upstream Flax checkpoint (handling Flax `(in, out)` → PyTorch `(out, in)` transposition for Linear kernels and the `nn.scan` layer-stacking convention)
+3. Verified Flax↔PyTorch parity at `<1e-3` max-abs-diff
+4. Exported encoder + decoder-step to ONNX via legacy TorchScript-based `torch.onnx.export`
+5. Verified PyTorch↔ONNX parity at `<1e-3`
+6. Verified end-to-end: Cactus's native `generate()` and a hand-rolled `onnxruntime` KV-cache loop produce **byte-identical** output token sequences
+## Parity numbers (against Cactus's native `generate(constrained=False)`)
+| Stage | max-abs-diff |
+|---|---|
+| Flax encoder ↔ PyTorch port | 0.000010 |
+| Flax decoder step-0 ↔ PyTorch port | 0.000029 |
+| PyTorch encoder ↔ ONNX | 0.000004 |
+| PyTorch decoder step ↔ ONNX | 0.000014 (logits) |
+| End-to-end token sequence | byte-identical |
+Example: `query="set a 5 min timer"` produces `' [{"name":"set_timer","arguments":{"time_human":"5 minutes"}}]'` in both Cactus native and the browser via these artifacts.
+## Usage in the browser
+Load both `.onnx` files via `onnxruntime-web` (WASM backend), load `needle.model` via `sentencepiece-js`, and run the encoder once + decoder-step in a JS loop with the KV cache passed through.
+## Architecture
+Per the upstream model card: encoder-decoder "Simple Attention Network", d_model=512, GQA 8/4 heads, 12 encoder layers, 8 decoder layers, no FFN, ZCRMSNorm (`(1+γ)·x/RMS(x)`, γ init zero), RoPE on Q and K.
+The decoder is exported as a **single step** with past/present KV as graph I/O — the JS side calls it in a loop, allowing streaming token output and avoiding ONNX symbolic control flow.
+## License
+MIT, matching the upstream Cactus Needle license.