Cactus-Compute
/

needle

@@ -13,9 +13,9 @@ tags:
 # Needle
-A 26M parameter encoder-decoder transformer for on-device function calling, built on a "Simple Attention Network" architecture (no feedforward layers).
-Distilled from Gemini 3.1 Flash Lite. Runs at 6000 tok/s prefill and 1200 tok/s decode on [Cactus](https://github.com/cactus-compute/cactus).
 | | |
 |---|---|
@@ -76,10 +76,6 @@ d=512, 8H/4KV, BPE=8192
     └───────────┘
 ```
-No feedforward layers. Each encoder block is gated self-attention; each decoder block is gated self-attention + gated cross-attention. The only nonlinearities are softmax and sigmoid.
-See [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) for the full architectural breakdown.
 ## Quickstart
 ```bash
@@ -119,8 +115,8 @@ Finetune on your own tools via the web UI or CLI:
 # Web UI (generates data via Gemini, trains, evaluates, bundles result)
 needle ui
-# CLI
-python -m src.training.finetune data.jsonl --checkpoint checkpoints/needle.pkl
 ```
 ## Links

 # Needle
+We distilled Gemini 3.1 into a 26m parameter "[Simple Attention Network](docs/simple_attention_networks.md)" that you can even finetune locally on your Mac/PC.
+In production, Needle runs on [Cactus](https://github.com/cactus-compute/cactus) at 6000 toks/sec prefill and 1200 decode speed.
+Weights are fully open on [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle), as well as the dataset generation.
 | | |
 |---|---|
     └───────────┘
 ```
 ## Quickstart
 ```bash
 # Web UI (generates data via Gemini, trains, evaluates, bundles result)
 needle ui
+# CLI (auto-downloads weights if not local)
+python -m src.training.finetune data.jsonl
 ```
 ## Links