Edge LLM Deployments
Collection
LLMs quantized for Raspberry Pi and ARM edge. GGUF + ONNX. No cloud required. • 3 items • Updated
Quantized ONNX export of TinyLlama-1.1B-Chat-v1.0 for local inference on edge devices, single-board computers, and resource-constrained environments.
| Property | Value |
|---|---|
| Base Model | TinyLlama-1.1B-Chat-v1.0 |
| Parameters | 1.1B |
| Format | ONNX (quantized) |
| Archive | TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z |
| Context Length | 2,048 tokens |
| Target Hardware | Raspberry Pi, ARM64, edge CPUs |
| License | MIT |
# Download
huggingface-cli download Makatia/TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx \
TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z --local-dir .
# Extract (requires 7-zip)
7z x TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
session = ort.InferenceSession(
"model.onnx",
providers=["CPUExecutionProvider"],
)
prompt = "<|user|>What is LSTM and how is it used in signal processing?</s><|assistant|>"
inputs = tokenizer(prompt, return_tensors="np")
outputs = session.run(None, dict(inputs))
At 1.1B parameters, TinyLlama runs on a Raspberry Pi 4/5 with acceptable latency while maintaining useful conversational ability. The ONNX format enables hardware-accelerated inference through ONNX Runtime on any platform.
| Device | RAM | Inference Speed |
|---|---|---|
| Raspberry Pi 5 (8GB) | 2 GB footprint | ~5 tokens/s |
| Raspberry Pi 4 (4GB) | 2 GB footprint | ~2 tokens/s |
| Desktop x86 | 4 GB+ | ~20 tokens/s |
| Apple Silicon | 4 GB+ | ~30 tokens/s |
model.onnx -- Full ONNX modelmodel_quantized.onnx -- INT8 quantized variantconfig.json -- Model configurationtokenizer.json -- Tokenizer filesMaintainer: Makatia