Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- onnx
|
| 5 |
+
- function-calling
|
| 6 |
+
- needle
|
| 7 |
+
- cactus
|
| 8 |
+
- browser
|
| 9 |
+
- sentencepiece
|
| 10 |
+
base_model: Cactus-Compute/needle
|
| 11 |
+
library_name: onnxruntime
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Needle — ONNX export for in-browser inference
|
| 15 |
+
|
| 16 |
+
Browser-ready ONNX export of [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle), a 26M-parameter function-calling model. Designed to run entirely client-side via `onnxruntime-web` (WASM backend) — no server required.
|
| 17 |
+
|
| 18 |
+
## Files
|
| 19 |
+
|
| 20 |
+
| File | Description | Size |
|
| 21 |
+
|---|---|---|
|
| 22 |
+
| `encoder.onnx` | Needle encoder. Input `input_ids:(B,T)`, output `encoder_out:(B,T,512)`. Single-pass. | ~55 MB |
|
| 23 |
+
| `decoder_step.onnx` | One decoder step with explicit past-KV in / present-KV out. Run in a JS loop. | ~85 MB |
|
| 24 |
+
| `needle.model` | SentencePiece BPE protobuf (vocab=8192, `byte_fallback=True`, `identity` normalization). Loadable by `sentencepiece-js` / `@huggingface/transformers`. | 125 KB |
|
| 25 |
+
| `tokenizer-specials.json` | `{"pad":0,"eos":1,"bos":2,"tool_call":4,"tools":5}` | tiny |
|
| 26 |
+
|
| 27 |
+
## Origin
|
| 28 |
+
|
| 29 |
+
The upstream Cactus Needle is implemented in **JAX/Flax**, not PyTorch — `torch.onnx.export` cannot run against the upstream model directly. This ONNX export was produced via a "port-and-copy" pipeline:
|
| 30 |
+
|
| 31 |
+
1. Reimplemented the Simple Attention Network in PyTorch (parametric on `TransformerConfig`)
|
| 32 |
+
2. Copied weights tensor-by-tensor from the upstream Flax checkpoint (handling Flax `(in, out)` → PyTorch `(out, in)` transposition for Linear kernels and the `nn.scan` layer-stacking convention)
|
| 33 |
+
3. Verified Flax↔PyTorch parity at `<1e-3` max-abs-diff
|
| 34 |
+
4. Exported encoder + decoder-step to ONNX via legacy TorchScript-based `torch.onnx.export`
|
| 35 |
+
5. Verified PyTorch↔ONNX parity at `<1e-3`
|
| 36 |
+
6. Verified end-to-end: Cactus's native `generate()` and a hand-rolled `onnxruntime` KV-cache loop produce **byte-identical** output token sequences
|
| 37 |
+
|
| 38 |
+
## Parity numbers (against Cactus's native `generate(constrained=False)`)
|
| 39 |
+
|
| 40 |
+
| Stage | max-abs-diff |
|
| 41 |
+
|---|---|
|
| 42 |
+
| Flax encoder ↔ PyTorch port | 0.000010 |
|
| 43 |
+
| Flax decoder step-0 ↔ PyTorch port | 0.000029 |
|
| 44 |
+
| PyTorch encoder ↔ ONNX | 0.000004 |
|
| 45 |
+
| PyTorch decoder step ↔ ONNX | 0.000014 (logits) |
|
| 46 |
+
| End-to-end token sequence | byte-identical |
|
| 47 |
+
|
| 48 |
+
Example: `query="set a 5 min timer"` produces `' [{"name":"set_timer","arguments":{"time_human":"5 minutes"}}]'` in both Cactus native and the browser via these artifacts.
|
| 49 |
+
|
| 50 |
+
## Usage in the browser
|
| 51 |
+
|
| 52 |
+
Load both `.onnx` files via `onnxruntime-web` (WASM backend), load `needle.model` via `sentencepiece-js`, and run the encoder once + decoder-step in a JS loop with the KV cache passed through.
|
| 53 |
+
|
| 54 |
+
## Architecture
|
| 55 |
+
|
| 56 |
+
Per the upstream model card: encoder-decoder "Simple Attention Network", d_model=512, GQA 8/4 heads, 12 encoder layers, 8 decoder layers, no FFN, ZCRMSNorm (`(1+γ)·x/RMS(x)`, γ init zero), RoPE on Q and K.
|
| 57 |
+
|
| 58 |
+
The decoder is exported as a **single step** with past/present KV as graph I/O — the JS side calls it in a loop, allowing streaming token output and avoiding ONNX symbolic control flow.
|
| 59 |
+
|
| 60 |
+
## License
|
| 61 |
+
|
| 62 |
+
MIT, matching the upstream Cactus Needle license.
|