--- license: mit library_name: jax tags: - function-calling - tool-use - encoder-decoder - edge - on-device - jax - flax datasets: - Cactus-Compute/tool-calls --- # Needle A 26M parameter encoder-decoder transformer for on-device function calling, built on a "Simple Attention Network" architecture (no feedforward layers). Distilled from Gemini 3.1 Flash Lite. Runs at 6000 tok/s prefill and 1200 tok/s decode on [Cactus](https://github.com/cactus-compute/cactus). ## Model Details | | | |---|---| | Parameters | 26M | | Architecture | Encoder-decoder, pure attention (no FFN) | | Encoder | 12 layers, GQA (8H/4KV), RoPE, gated residuals | | Decoder | 8 layers, self-attn + cross-attn, gated residuals | | d_model | 512 | | Vocab | 8192 (SentencePiece BPE) | | Norm | ZCRMSNorm (zero-centered, init=0) | | Precision | bfloat16 (INT4 QAT during training) | | Pretraining | 200B tokens on 16x TPU v6e (27hrs) | | Post-training | 2B tokens of function call data (45mins) | ## Architecture No feedforward layers. Each encoder block is gated self-attention; each decoder block is gated self-attention + gated cross-attention. The only nonlinearities are softmax and sigmoid. See [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) for the full architectural breakdown. ## Quickstart ```bash git clone https://github.com/cactus-compute/needle.git cd needle && source ./setup needle ui ``` Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded. ## Usage (Python) ```python from src.model.run import load_checkpoint, generate from src.model.architecture import EncoderDecoderTransformer from src.dataset.dataset import get_tokenizer params, config = load_checkpoint("checkpoints/needle.pkl") model = EncoderDecoderTransformer(config) tokenizer = get_tokenizer() result = generate( model, params, tokenizer, query="What's the weather in San Francisco?", tools='[{"name":"get_weather","parameters":{"location":"string"}}]', stream=False, ) print(result) # [{"name":"get_weather","arguments":{"location":"San Francisco"}}] ``` ## Finetuning Finetune on your own tools via the web UI or CLI: ```bash # Web UI (generates data via Gemini, trains, evaluates, bundles result) needle ui # CLI python -m src.training.finetune data.jsonl --checkpoint checkpoints/needle.pkl ``` ## File Format The checkpoint is a Python pickle containing: ```python { "params": { ... }, # nested dict of numpy float16 arrays "config": { ... }, # TransformerConfig fields as dict } ``` Load with: ```python import pickle with open("needle.pkl", "rb") as f: data = pickle.load(f) ``` ## Training Data Post-trained on [Cactus-Compute/tool-calls](https://huggingface.co/datasets/Cactus-Compute/tool-calls), a synthesized dataset of 2M+ function calling examples spanning 15 tool categories (timers, messaging, media, navigation, smart home, fitness, etc.). ## License MIT ## Citation ``` @misc{ndubuaku2026needle, title={Simple Attention Networks}, author={Henry Ndubuaku}, year={2026}, url={https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md} } ```