🔩 AMINI Q4_K_M GGUF — Tushe Foundry Edge Inference Pack

Quantised & packaged by Tushe – The Foundry Research Team

🧭 What Is This?

We are Tushe – The Foundry Research Team, and we are building a bare-metal inference engine for African language AI on constrained hardware.

This repository is part of our open-source model-baked-on-metal-inference-engine package — a lightweight inference runtime we are releasing as:

Format	Use case
🐍 Python library (`pip install tushe-bare-metal`)	Server, Raspberry Pi, edge Linux
📦 npm package (`npm install tushe-bare-metal`)	Node.js apps, Electron, React Native
⚙️ Compiled C executable	Bare-metal embedded, IoT, MCUs

Every developer can drop this into their app and run offline African language inference right away — no internet, no cloud, no GPU required.

🎯 Why We Built This

Africa has some of the most resource-constrained connectivity environments in the world. Millions of people — rural doctors, farmers, teachers, students, traders, and tourists — need intelligent language tools but have no reliable internet access.

We took N-ATLaS, the Llama 3 8B model fine-tuned on Nigerian and African languages by Awarri Technologies in collaboration with NCAIR, and quantised it with immense optimizations to run on low-resource hardware including:

📱 Android & iOS phones
🌾 Edge IoT devices in agricultural fields
🏥 Offline clinical/medical support tools in rural clinics
🏫 Classrooms with no internet access
🧭 Portable translator devices for traders and tourists

🌍 Target Use Cases

Domain	Description
🏥 Rural & Edge Medical	Doctors and health workers in remote clinics — symptom triage, patient communication, drug info in local languages
🌾 Farmers Support	Modern and rural farmers — crop advice, weather interpretation, market prices, pest identification in Hausa, Igbo, Yoruba
🏫 Education	Teachers and students in areas without internet — explanations, tutoring, literacy support in local languages
🛒 Traders & Markets	Cross-language communication for traders and informal markets across Africa
✈️ Tourists	Real-time offline translation across African languages

🔩 About the Base Model

This GGUF is derived from NCAIR1/N-ATLaS — An open-source multilingual LLM, built on Llama 3 8B, fine-tuned by Awarri Technologies in collaboration with the National Centre for Artificial Intelligence & Robotics (NCAIR) and the Federal Ministry of Communications, Innovation and Digital Economy of Nigeria.

N-ATLaS was trained on approximately ~392 million multilingual tokens spanning English, Hausa, Igbo, and Yoruba.

We did not change the weights. We quantised the original model and build a highly optimized efficient inference engine to enable edge deployment.

💾 This File

File	Quant	Size	Min RAM
`AMINI-q4_k_m.gguf`	Q4_K_M	~4.5 GB	6–8 GB

Q4_K_M is our recommended quant for edge deployment — best balance of accuracy, speed, and memory. Runs on a phone with 8GB RAM or a Raspberry Pi 5.

🚀 Inference Examples

1. Python — `llama-cpp-python`

pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id      = "AlaminI/AMINI-ASSISTANT-GGUF-Q4-B",
    filename     = "*.gguf",
    n_ctx        = 2048,
    n_gpu_layers = 0,      # 0 = CPU only (edge/offline), -1 = full GPU
    verbose      = False,
)

English — rural medical support:

output = llm(
    "A patient presents with fever, headache, and joint pain for 3 days. What are the possible diagnoses and first-line management?",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Hausa — farmer support:

output = llm(
    "Gonar hatsi na da kwari da yawa. Menene zan iya yi don kare amfanin gona na?",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Yoruba — student support:

output = llm(
    "Ṣe alaye ohun ti photosynthesis jẹ ni ede Yoruba fun ọmọ ile-iwe.",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

Igbo — trader/market support:

output = llm(
    "Gwa m ọnụ ahịa nke ọka ugbu a n'ahịa Onitsha.",
    max_tokens  = 256,
    temperature = 0.7,
    echo        = False,
)
print(output["choices"][0]["text"])

2. Chat format — multilingual instruction

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id      = "AlaminI/AMINI-ASSISTANT-GGUF-Q4-B",
    filename     = "*.gguf",
    n_ctx        = 2048,
    n_gpu_layers = 0,
    verbose      = False,
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "system",
            "content": (
                "You are an offline African language assistant running on a local device. "
                "You support English, Hausa, Igbo, and Yoruba. "
                "Respond in the same language the user writes in. "
                "Be concise — this device has limited resources."
            )
        },
        {
            "role": "user",
            "content": "Translate 'The child has a high fever and needs immediate care' into Hausa and Yoruba."
        }
    ],
    max_tokens  = 256,
    temperature = 0.7,
)
print(response["choices"][0]["message"]["content"])

3. Streaming (for responsive UIs on edge devices)

stream = llm.create_chat_completion(
    messages = [
        {"role": "user", "content": "Explain crop rotation to a farmer in Hausa."}
    ],
    max_tokens = 256,
    temperature = 0.7,
    stream     = True,
)

for chunk in stream:
    delta = chunk["choices"][0]["delta"].get("content", "")
    print(delta, end="", flush=True)

4. Node.js — `node-llama-cpp`

npm install node-llama-cpp

import { getLlama, LlamaChatSession } from "node-llama-cpp";
import path from "path";

const llama   = await getLlama();
const model   = await llama.loadModel({ modelPath: path.join("models", "AMINI-q4_k_m.gguf") });
const context = await model.createContext({ contextSize: 2048 });
const session = new LlamaChatSession({ contextSequence: context.getSequence() });

const response = await session.prompt(
    "A farmer asks: my tomatoes are wilting despite regular watering. What could be wrong?",
    { maxTokens: 256 }
);
console.log(response);

5. llama.cpp CLI (bare-metal / embedded)

# Download
huggingface-cli download AlaminI/AMINI-ASSISTANT-GGUF-Q4-B \
    AMINI-q4_k_m.gguf --local-dir ./models/

# Run on CPU only (edge device)
./llama-cli -m ./models/AMINI-q4_k_m.gguf \
    --ctx-size 2048 \
    --threads 4 \
    --temp 0.7 \
    -i -r "User:" \
    -p "You are an offline assistant for African languages. Respond in the user's language.\nUser:"

6. Ollama (local server mode)

ollama run hf.co/AlaminI/AMINI-ASSISTANT-GGUF-Q4-B

⚙️ Recommended Settings for Edge Devices

Parameter	Value	Notes
`n_ctx`	512–1024	Reduce on very low RAM devices
`n_gpu_layers`	0	CPU-only for phones/IoT
`n_threads`	4	Match your device's core count
`temperature`	0.7	Balanced responses
`max_tokens`	128–256	Keep short for low-latency UX
`repeat_penalty`	1.1	Reduces looping on edge

🖥️ Tested Hardware Targets

Device	RAM	n_gpu_layers	Speed (tok/s)
Raspberry Pi 5	8 GB	0 (CPU)	~2–4 tok/s
Android phone (8GB)	8 GB	0 (CPU)	~3–6 tok/s
Laptop (no GPU)	16 GB	0 (CPU)	~8–15 tok/s
Laptop (GPU 6GB)	16 GB	-1 (GPU)	~30–60 tok/s

🔗 Related Repositories

Repo	Description
NCAIR1/N-ATLaS	Original model by Awarri / NCAIR
AlaminI/AMINI-ASSISTANT-GGUF-Q4-B	F16 GGUF (full precision, for re-quantising)

⚠️ License

This GGUF quantisation is an independent contribution by Tushe – The Foundry Research Team, We enncorage developers to refer to the N-ATLaS licence. But our Inference engine ca be used for any mean, commercial and beyond any user Number. We will rellease models traind by us to give developers fullly open-source models and inference at edge

Downloads last month: 8

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AlaminI/AMINI-ASSISTANT-GGUF-Q4-B

Base model

NCAIR1/N-ATLaS

Quantized

(4)

this model