π© AMINI Q4_K_M GGUF β Tushe Foundry Edge Inference Pack
Quantised & packaged by Tushe β The Foundry Research Team
π§ What Is This?
We are Tushe β The Foundry Research Team, and we are building a bare-metal inference engine for African language AI on constrained hardware.
This repository is part of our open-source model-baked-on-metal-inference-engine package β a lightweight inference runtime we are releasing as:
| Format | Use case |
|---|---|
π Python library (pip install tushe-bare-metal) |
Server, Raspberry Pi, edge Linux |
π¦ npm package (npm install tushe-bare-metal) |
Node.js apps, Electron, React Native |
| βοΈ Compiled C executable | Bare-metal embedded, IoT, MCUs |
Every developer can drop this into their app and run offline African language inference right away β no internet, no cloud, no GPU required.
π― Why We Built This
Africa has some of the most resource-constrained connectivity environments in the world. Millions of people β rural doctors, farmers, teachers, students, traders, and tourists β need intelligent language tools but have no reliable internet access.
We took N-ATLaS, the Llama 3 8B model fine-tuned on Nigerian and African languages by Awarri Technologies in collaboration with NCAIR, and quantised it with immense optimizations to run on low-resource hardware including:
- π± Android & iOS phones
- πΎ Edge IoT devices in agricultural fields
- π₯ Offline clinical/medical support tools in rural clinics
- π« Classrooms with no internet access
- π§ Portable translator devices for traders and tourists
π Target Use Cases
| Domain | Description |
|---|---|
| π₯ Rural & Edge Medical | Doctors and health workers in remote clinics β symptom triage, patient communication, drug info in local languages |
| πΎ Farmers Support | Modern and rural farmers β crop advice, weather interpretation, market prices, pest identification in Hausa, Igbo, Yoruba |
| π« Education | Teachers and students in areas without internet β explanations, tutoring, literacy support in local languages |
| π Traders & Markets | Cross-language communication for traders and informal markets across Africa |
| βοΈ Tourists | Real-time offline translation across African languages |
π© About the Base Model
This GGUF is derived from NCAIR1/N-ATLaS β An open-source multilingual LLM, built on Llama 3 8B, fine-tuned by Awarri Technologies in collaboration with the National Centre for Artificial Intelligence & Robotics (NCAIR) and the Federal Ministry of Communications, Innovation and Digital Economy of Nigeria.
N-ATLaS was trained on approximately ~392 million multilingual tokens spanning English, Hausa, Igbo, and Yoruba.
We did not change the weights. We quantised the original model and build a highly optimized efficient inference engine to enable edge deployment.
πΎ This File
| File | Quant | Size | Min RAM |
|---|---|---|---|
AMINI-q4_k_m.gguf |
Q4_K_M | ~4.5 GB | 6β8 GB |
Q4_K_M is our recommended quant for edge deployment β best balance of accuracy, speed, and memory. Runs on a phone with 8GB RAM or a Raspberry Pi 5.
π Inference Examples
1. Python β llama-cpp-python
pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id = "AlaminI/AMINI-ASSISTANT-GGUF-Q4-B",
filename = "*.gguf",
n_ctx = 2048,
n_gpu_layers = 0, # 0 = CPU only (edge/offline), -1 = full GPU
verbose = False,
)
English β rural medical support:
output = llm(
"A patient presents with fever, headache, and joint pain for 3 days. What are the possible diagnoses and first-line management?",
max_tokens = 256,
temperature = 0.7,
echo = False,
)
print(output["choices"][0]["text"])
Hausa β farmer support:
output = llm(
"Gonar hatsi na da kwari da yawa. Menene zan iya yi don kare amfanin gona na?",
max_tokens = 256,
temperature = 0.7,
echo = False,
)
print(output["choices"][0]["text"])
Yoruba β student support:
output = llm(
"αΉ’e alaye ohun ti photosynthesis jαΊΉ ni ede Yoruba fun α»mα» ile-iwe.",
max_tokens = 256,
temperature = 0.7,
echo = False,
)
print(output["choices"][0]["text"])
Igbo β trader/market support:
output = llm(
"Gwa m α»nα»₯ ahα»a nke α»ka ugbu a n'ahα»a Onitsha.",
max_tokens = 256,
temperature = 0.7,
echo = False,
)
print(output["choices"][0]["text"])
2. Chat format β multilingual instruction
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id = "AlaminI/AMINI-ASSISTANT-GGUF-Q4-B",
filename = "*.gguf",
n_ctx = 2048,
n_gpu_layers = 0,
verbose = False,
)
response = llm.create_chat_completion(
messages = [
{
"role": "system",
"content": (
"You are an offline African language assistant running on a local device. "
"You support English, Hausa, Igbo, and Yoruba. "
"Respond in the same language the user writes in. "
"Be concise β this device has limited resources."
)
},
{
"role": "user",
"content": "Translate 'The child has a high fever and needs immediate care' into Hausa and Yoruba."
}
],
max_tokens = 256,
temperature = 0.7,
)
print(response["choices"][0]["message"]["content"])
3. Streaming (for responsive UIs on edge devices)
stream = llm.create_chat_completion(
messages = [
{"role": "user", "content": "Explain crop rotation to a farmer in Hausa."}
],
max_tokens = 256,
temperature = 0.7,
stream = True,
)
for chunk in stream:
delta = chunk["choices"][0]["delta"].get("content", "")
print(delta, end="", flush=True)
4. Node.js β node-llama-cpp
npm install node-llama-cpp
import { getLlama, LlamaChatSession } from "node-llama-cpp";
import path from "path";
const llama = await getLlama();
const model = await llama.loadModel({ modelPath: path.join("models", "AMINI-q4_k_m.gguf") });
const context = await model.createContext({ contextSize: 2048 });
const session = new LlamaChatSession({ contextSequence: context.getSequence() });
const response = await session.prompt(
"A farmer asks: my tomatoes are wilting despite regular watering. What could be wrong?",
{ maxTokens: 256 }
);
console.log(response);
5. llama.cpp CLI (bare-metal / embedded)
# Download
huggingface-cli download AlaminI/AMINI-ASSISTANT-GGUF-Q4-B \
AMINI-q4_k_m.gguf --local-dir ./models/
# Run on CPU only (edge device)
./llama-cli -m ./models/AMINI-q4_k_m.gguf \
--ctx-size 2048 \
--threads 4 \
--temp 0.7 \
-i -r "User:" \
-p "You are an offline assistant for African languages. Respond in the user's language.\nUser:"
6. Ollama (local server mode)
ollama run hf.co/AlaminI/AMINI-ASSISTANT-GGUF-Q4-B
βοΈ Recommended Settings for Edge Devices
| Parameter | Value | Notes |
|---|---|---|
n_ctx |
512β1024 | Reduce on very low RAM devices |
n_gpu_layers |
0 | CPU-only for phones/IoT |
n_threads |
4 | Match your device's core count |
temperature |
0.7 | Balanced responses |
max_tokens |
128β256 | Keep short for low-latency UX |
repeat_penalty |
1.1 | Reduces looping on edge |
π₯οΈ Tested Hardware Targets
| Device | RAM | n_gpu_layers | Speed (tok/s) |
|---|---|---|---|
| Raspberry Pi 5 | 8 GB | 0 (CPU) | ~2β4 tok/s |
| Android phone (8GB) | 8 GB | 0 (CPU) | ~3β6 tok/s |
| Laptop (no GPU) | 16 GB | 0 (CPU) | ~8β15 tok/s |
| Laptop (GPU 6GB) | 16 GB | -1 (GPU) | ~30β60 tok/s |
π Related Repositories
| Repo | Description |
|---|---|
| NCAIR1/N-ATLaS | Original model by Awarri / NCAIR |
| AlaminI/AMINI-ASSISTANT-GGUF-Q4-B | F16 GGUF (full precision, for re-quantising) |
β οΈ License
This GGUF quantisation is an independent contribution by Tushe β The Foundry Research Team, We enncorage developers to refer to the N-ATLaS licence. But our Inference engine ca be used for any mean, commercial and beyond any user Number. We will rellease models traind by us to give developers fullly open-source models and inference at edge
- Downloads last month
- 8
4-bit
Model tree for AlaminI/AMINI-ASSISTANT-GGUF-Q4-B
Base model
NCAIR1/N-ATLaS