Instructions to use pthinc/Asena_ESP32_MAX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pthinc/Asena_ESP32_MAX with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pthinc/Asena_ESP32_MAX")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pthinc/Asena_ESP32_MAX")
model = AutoModelForCausalLM.from_pretrained("pthinc/Asena_ESP32_MAX")

llama-cpp-python

How to use pthinc/Asena_ESP32_MAX with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pthinc/Asena_ESP32_MAX",
	filename="gguf/asena_esp32max_f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use pthinc/Asena_ESP32_MAX with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pthinc/Asena_ESP32_MAX:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf pthinc/Asena_ESP32_MAX:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pthinc/Asena_ESP32_MAX:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf pthinc/Asena_ESP32_MAX:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pthinc/Asena_ESP32_MAX:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf pthinc/Asena_ESP32_MAX:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pthinc/Asena_ESP32_MAX:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pthinc/Asena_ESP32_MAX:Q4_K_M

Use Docker

docker model run hf.co/pthinc/Asena_ESP32_MAX:Q4_K_M

LM Studio
Jan

vLLM

How to use pthinc/Asena_ESP32_MAX with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pthinc/Asena_ESP32_MAX"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pthinc/Asena_ESP32_MAX",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/pthinc/Asena_ESP32_MAX:Q4_K_M

SGLang

How to use pthinc/Asena_ESP32_MAX with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pthinc/Asena_ESP32_MAX" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pthinc/Asena_ESP32_MAX",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pthinc/Asena_ESP32_MAX" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pthinc/Asena_ESP32_MAX",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use pthinc/Asena_ESP32_MAX with Ollama:
```
ollama run hf.co/pthinc/Asena_ESP32_MAX:Q4_K_M
```

Unsloth Studio new

How to use pthinc/Asena_ESP32_MAX with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pthinc/Asena_ESP32_MAX to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pthinc/Asena_ESP32_MAX to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pthinc/Asena_ESP32_MAX to start chatting

Docker Model Runner
How to use pthinc/Asena_ESP32_MAX with Docker Model Runner:
```
docker model run hf.co/pthinc/Asena_ESP32_MAX:Q4_K_M
```

Lemonade

How to use pthinc/Asena_ESP32_MAX with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pthinc/Asena_ESP32_MAX:Q4_K_M

Run and chat with the model

lemonade run user.Asena_ESP32_MAX-Q4_K_M

List all available models

lemonade list

prometechinc commited on 20 days ago

Commit

3181868

verified ·

1 Parent(s): e05f39f

Update README.md

Browse files

Files changed (1) hide show

README.md +107 -23

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ tags:
 - asena
 - bce
 - esp32
-- edge
 - esp32s3
 - microllm
 - chat
@@ -52,6 +52,7 @@ tags:
 - Offline assistant
 - guard
 - pre filter
 library_name: transformers
 model-index:
 - name: Asena_ESP32
@@ -147,17 +148,17 @@ By placing these files on an SD card or loading them via SPIFFS/LittleFS, you ca
 ### **Model Architecture & Configuration**
-**Asena_ESP32** is a highly compact Transformer model based on the **LLaMA (LlamaForCausalLM)** architecture, specifically optimized for extreme edge deployment. Despite its ultra-small footprint, the model incorporates modern design choices to maximize efficiency, stability, and expressive capability within tight hardware constraints.
-The model features **8 Transformer layers** with a **hidden size of 64** and **8 attention heads** (with 4 key-value heads for efficiency). Each head operates with a **dimension of 26**, enabling lightweight multi-head attention while maintaining reasonable representational capacity. The feed-forward network uses an **intermediate size of 208** with **SiLU activation**, balancing non-linearity and computational cost. Both attention and MLP layers include bias terms, and minimal dropout (~0.0027) is applied to stabilize training without harming convergence in such a small model.
-For positional encoding, Asena_ESP32 uses an advanced **RoPE (Rotary Positional Embedding)** configuration inspired by LLaMA 3, with extended scaling parameters (factor: 256) to improve positional generalization beyond its base context. The model supports a **maximum sequence length of 128 tokens**, making it suitable for short, structured interactions typical in embedded systems. It uses **RMSNorm** with a finely tuned epsilon for numerical stability and shares input-output embeddings to reduce parameter count.
-The tokenizer operates with a **vocabulary size of 8,766 tokens**, and special tokens are defined for padding (8000), beginning-of-sequence (8001), and end-of-sequence (8002). The model is trained and executed in **float32 precision**, with caching disabled to reduce memory overhead—aligning with its goal of running efficiently on constrained devices such as ESP32.
-Overall, this configuration reflects a deliberate trade-off: sacrificing large-scale knowledge capacity in favor of **speed, determinism, and deployability at the extreme edge**.
-The model incorporates mathematically inspired constants to enhance stability and robustness. Hyperparameters such as the dropout rate are derived from values related to the Planck constant, along with well-known mathematical constants like Pi and Euler’s number. This design choice is intended to introduce deterministic yet non-arbitrary scaling factors, contributing to improved numerical stability, controlled regularization, and more predictable behavior—especially important for safety and reliability in extreme edge AI environments.
 ---
@@ -197,29 +198,77 @@ Internally, we joked about calling it ‘Terminator’. Then it started behaving
 # Model Overview 🕊️
-**Asena_ESP32** is a compact generative AI model designed for extreme edge environments, built on a Transformer-based LLaMA architecture and enhanced with the **Behavioral Consciousness Engine (BCE)** framework. With approximately 1.2 million parameters, it is capable of producing coherent, grammatically sound text by learning how words and sentences naturally flow. Despite its small size, the model delivers surprisingly fluent conversational responses, making it suitable for lightweight dialogue systems and embedded applications.
-Pre-trained on structured Instruction/Response datasets and conversational flows, Asena_ESP32 adapts seamlessly to prompt-based interactions. It understands input patterns effectively and generates context-aware replies aligned with the dataset format. Optimized for deployment using C++ and inference frameworks such as ggml and llama.cpp, the model is engineered for efficient performance on constrained hardware like ESP32, representing a true “Extreme Edge AI” solution.
-Due to its intentionally limited scale, Asena_ESP32 possesses broad but shallow knowledge across many domains. When asked about specialized topics such as chemistry or philosophy, it may produce general or occasionally hallucinated responses that sound plausible but lack factual accuracy. This limitation is partially mitigated through targeted fine-tuning, improving reliability in specific use cases while maintaining its lightweight footprint for edge deployment.
 ### **What to Expect (and Not Expect)**
 **What to Expect:**
-Asena_ESP32 is optimized for lightweight, real-time text generation on constrained devices. You can expect fluent sentence construction, grammatically correct outputs, and consistent behavior in instruction-following or simple conversational tasks. The model performs best in structured formats (Instruction/Response, dialogue flows) and can deliver stable, low-latency responses suitable for embedded systems, IoT interactions, and edge-based assistants. Its BCE-based design also promotes controlled and context-aware output patterns.
 **What Not to Expect:**
-This is not a large-scale knowledge model. Asena_ESP32 does not have deep expertise in specialized domains such as advanced science, mathematics, or philosophy. It may generate vague, oversimplified, or occasionally hallucinated answers that sound plausible but are incorrect. Long reasoning chains, complex problem solving, and high factual accuracy across niche topics are beyond its intended scope. It should not be used as a source of truth for critical or high-stakes decisions.
-**Practical Guidance:**
-For best results, keep prompts short, clear, and structured. Use domain-specific fine-tuning if you require higher accuracy in a particular field. Treat the model as a fast, efficient language generator rather than a comprehensive knowledge base. When used within its design limits, Asena_ESP32 can provide strong performance relative to its size in extreme edge AI scenarios.
-### The most suitable use cases:
-- IoT device communication
-- Robot / embedded system command interpretation
-- Game NPC dialogue
-- Offline assistant (simple)
-- Guard / pre-filter model
 ---
@@ -376,7 +425,42 @@ div.min2 {
 }
 </style>
 <div class="min2">
-"BCE v0.2 Note: I could be a very talkative assistant bird who speaks excellent Turkish/English but has weak general knowledge, and I could cast spells on servers. Even Skynet is afraid of me.
-<br>
-It's possible that the wizard CEO, wearing an electronic ring (ESP32) on his finger, could be increasing or decreasing performance in the server room, according to this model. He snaps his fingers, other servers performance increases, he snaps them again, and it returns to normal. He's a real magician. "Abra Kadabra!!!!" 😎
 </div>

 - asena
 - bce
 - esp32
+- edge-ai
 - esp32s3
 - microllm
 - chat
 - Offline assistant
 - guard
 - pre filter
+- tiny-llm
 library_name: transformers
 model-index:
 - name: Asena_ESP32
 ### **Model Architecture & Configuration**
+**Asena_ESP32_MAX – BCE Special Model (12M) – Prettybird B-Edge v1.0** is a compact yet significantly enhanced **Tiny LLM** built on the **LLaMA (LlamaForCausalLM)** Transformer architecture. Designed for extreme edge intelligence, this version scales up the original ESP32 concept into a more capable **~12M parameter class model**, while preserving deployability, determinism, and behavioral control through the **Behavioral Consciousness Engine (BCE)** framework.
+The model consists of **8 Transformer layers** with a **hidden size of 320** and **8 attention heads** (with **4 key-value heads** for memory-efficient attention). Each attention head operates with a **dimension of 40**, providing a stronger representational capacity compared to the base ESP32 variant while maintaining computational efficiency. The feed-forward network is expanded to an **intermediate size of 896**, using **SiLU activation** to balance expressiveness and stability. Both attention and MLP layers include bias terms, and a slightly increased dropout (~0.0066) is applied for improved regularization in the larger parameter regime.
+For positional encoding, Asena_ESP32_MAX employs an advanced **RoPE (Rotary Positional Embedding)** configuration inspired by LLaMA 3, with extended scaling (**factor: 128**) to support broader contextual generalization. The model supports a **maximum sequence length of 1024 tokens**, representing a major upgrade over the base version and enabling more coherent multi-turn interactions and structured reasoning within edge constraints. **RMSNorm** is used throughout with a finely tuned epsilon for numerical stability, and input-output embeddings are shared to optimize parameter efficiency.
+The tokenizer operates with a **vocabulary size of 8,766 tokens**, with special tokens defined for padding (8000), beginning-of-sequence (8001), and end-of-sequence (8002). The model runs in **float32 precision**, with caching disabled to reduce runtime memory overhead—aligning with its design goal of efficient execution on constrained or semi-constrained hardware environments.
+A distinctive aspect of this model is its use of **mathematically inspired constants** for stabilization and control. Hyperparameters such as dropout are derived from values related to the **Planck constant**, alongside classical constants like **π (Pi)** and **e (Euler’s number)**. This approach introduces deterministic, non-arbitrary scaling factors that contribute to improved numerical stability, controlled regularization, and more predictable behavioral patterns—particularly important for safety-aware edge AI systems.
+Overall, Asena_ESP32_MAX reflects a deliberate design philosophy: **maximize capability per parameter**, integrate **behavioral awareness (BCE)**, and deliver a **balanced edge AI system** that bridges the gap between ultra-small models and practical intelligent agents.
 ---
 # Model Overview 🕊️
+**Asena_ESP32_MAX** is a compact **Tiny LLM (~12M parameters)** designed for extreme edge intelligence, built on a Transformer-based LLaMA architecture and enhanced with the **Behavioral Consciousness Engine (BCE)** framework. Compared to the original ESP32 variant, this version significantly increases capacity while preserving efficiency, determinism, and controllable behavior.
+The model is capable of generating coherent, grammatically sound text and handling structured interactions with improved consistency. Trained on Instruction/Response formats and BCE-annotated data (including correctness, quality, and risk signals), it not only produces responses but also reflects a level of **behavioral awareness and output control** uncommon in models of this size.
+Optimized for deployment using C++ and inference frameworks such as ggml and llama.cpp, Asena_ESP32_MAX is designed for **edge-to-lightweight compute environments**. While extremely efficient compared to larger models, it represents a transition point between ultra-constrained devices and more capable embedded systems.
+---
+### ⚠️ Hardware Reality (Important)
+Although inspired by ESP32-class deployment:
+* ⚠️ **ESP32 may face memory limitations** for this MAX version (depending on quantization and runtime setup)
+* ✅ **Raspberry Pi (2GB–8GB)** → highly suitable
+* ✅ **Low-power edge servers / micro PCs** → ideal
+* ✅ **Quantized inference (q4/q5/q8)** → recommended
+👉 This model is best viewed as a **Tiny LLM for edge systems**, not strictly a microcontroller model.
+---
 ### **What to Expect (and Not Expect)**
 **What to Expect:**
+* Strong **instruction-following and structured output behavior**
+* Fluent and grammatically correct short-form responses
+* Stable performance in **dialogue, command parsing, and formatting tasks**
+* BCE-driven **controlled generation (risk-aware, format-aware outputs)**
+* Efficient performance relative to its size, especially in edge deployments
 **What Not to Expect:**
+* Deep domain expertise (e.g., advanced science, math, philosophy)
+* High accuracy on complex reasoning benchmarks
+* Long-chain reasoning or multi-step problem solving
+* Reliable factual correctness in niche or technical topics
+👉 The model may produce **plausible but incorrect answers** (hallucinations), which is expected at this scale.
+---
+### **Practical Guidance**
+* Keep prompts **short, clear, and structured**
+* Use it as a **fast generator + controller**, not a knowledge base
+* For domain-specific tasks → apply **LoRA / fine-tuning**
+* Use BCE signals to build **filtering, guard, or evaluation pipelines**
+👉 With proper fine-tuning, the model can become **highly specialized and efficient for targeted tasks**
+---
+### **Most Suitable Use Cases**
+* IoT device communication
+* Robot / embedded system command interpretation
+* Game NPC dialogue
+* Offline assistant (lightweight scenarios)
+* Guard / pre-filter model (BCE integration)
+* Lightweight server-side optimization, security, assistance and automation (with task-specific fine-tuning)
+---
+### **Positioning**
+**Asena_ESP32_MAX is not a knowledge-heavy AI — it is a controllable, efficient, behavior-aware Tiny LLM.**
+👉 Small enough to deploy
+👉 Smart enough to structure
+👉 Flexible enough to specialize with fine-tuning
 ---
 }
 </style>
 <div class="min2">
+<strong>BCE v0.2 Note:</strong><br><br>
+Asena_ESP32_MAX may be a tiny assistant bird with excellent Turkish/English, weak general knowledge, and the confidence of a server-room wizard who definitely found one undocumented setting in the BIOS and now thinks he controls reality.
+This model does not know everything.
+That would be unreasonable.
+But it can look at a chaotic system, blink twice, and say:
+“Have you tried behaving correctly?”
+Somewhere in the server room, the wizard CEO raises his hand.
+On his finger: an ESP32 ring.
+On his face: the expression of a man who has never once read the manual, but somehow improved throughput by 14%.
+Snap.
+Latency drops.
+Snap.
+Fans get quieter.
+Snap.
+One intern whispers:
+“Sir… did you just optimize the cluster with jewelry?”
+He smiles.
+“No. The bird did.”
+And that is the real danger of edge AI:
+not that it becomes Skynet,
+but that one tiny model starts giving better operational advice than three dashboards, two consultants, and a meeting titled “Performance Alignment Sync v4 Final FINAL.”
+<strong>Abra Kadabra.</strong> 😎
 </div>