TinyLlama-1.1B-Chat-v1.0 (ONNX, Edge Optimized)

Quantized ONNX export of TinyLlama-1.1B-Chat-v1.0 for local inference on edge devices, single-board computers, and resource-constrained environments.

Model Details

Property	Value
Base Model	TinyLlama-1.1B-Chat-v1.0
Parameters	1.1B
Format	ONNX (quantized)
Archive	`TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z`
Context Length	2,048 tokens
Target Hardware	Raspberry Pi, ARM64, edge CPUs
License	MIT

Quick Start

1. Extract the Model

# Download
huggingface-cli download Makatia/TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx \
    TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z --local-dir .

# Extract (requires 7-zip)
7z x TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx.7z

2. Run Inference

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

session = ort.InferenceSession(
    "model.onnx",
    providers=["CPUExecutionProvider"],
)

prompt = "<|user|>What is LSTM and how is it used in signal processing?</s><|assistant|>"
inputs = tokenizer(prompt, return_tensors="np")

outputs = session.run(None, dict(inputs))

Why TinyLlama on Edge?

At 1.1B parameters, TinyLlama runs on a Raspberry Pi 4/5 with acceptable latency while maintaining useful conversational ability. The ONNX format enables hardware-accelerated inference through ONNX Runtime on any platform.

Hardware Requirements

Device	RAM	Inference Speed
Raspberry Pi 5 (8GB)	2 GB footprint	~5 tokens/s
Raspberry Pi 4 (4GB)	2 GB footprint	~2 tokens/s
Desktop x86	4 GB+	~20 tokens/s
Apple Silicon	4 GB+	~30 tokens/s

Archive Contents

model.onnx -- Full ONNX model
model_quantized.onnx -- INT8 quantized variant
config.json -- Model configuration
tokenizer.json -- Tokenizer files

Credits

Base model: TinyLlama
ONNX export: Optimum
Runtime: ONNX Runtime

Maintainer: Makatia

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including Makatia/TinyLlama_TinyLlama-1.1B-Chat-v1.0_onnx

Edge LLM Deployments

Collection

LLMs quantized for Raspberry Pi and ARM edge. GGUF + ONNX. No cloud required. • 3 items • Updated 29 days ago