TejoAI Rick 8B v4 (GGUF Q4_K_M)
A domain-specialized 8B language model fine-tuned on Llama-3.1-8B-Instruct, optimized for rapid local inference with minimal VRAM.
Model Overview
Rick is part of the TejoAI character model family, trained using QLoRA (4-bit NF4) on a consumer GPU (RTX 4060 Laptop, 8GB VRAM). The fine-tuning targeted:
- Code generation across 7+ programming languages (Python, JavaScript, Rust, Go, C++, Java, C#, SQL)
- Mathematical reasoning using Chain-of-Thought prompting
- Science Q&A with factual accuracy
- Multi-step problem solving with structured reasoning
- Ethical alignment based on Dharmic principles
This is the quantized (Q4_K_M) GGUF variant -- optimized for CPU/GPU inference with minimal memory footprint.
What is Rick?
Rick is a persona-driven model inspired by Rick Sanchez (Rick and Morty). The intent was to train the Rick persona directly into the model weights so it would behave in-character without any system prompt.
Honest assessment: This goal was only partially achieved. The Rick persona is present in the weights but is not strong enough to persist reliably without a system prompt. Without the system prompt below, the model largely behaves like a standard Llama-3.1-8B-Instruct.
Why the persona doesn't fully stick
- LoRA rank 32 is too shallow -- QLoRA with rank 32 modifies only ~0.1-0.5% of the model's parameters. That's not enough to override billions of parameters of base behavior.
- Instruct base model fights the persona -- Llama-3.1-8B-Instruct has strong RLHF alignment training that actively pulls the model toward helpful/neutral assistant behavior, working against the Rick personality.
- Insufficient persona training data -- ~500-1,000 persona examples vs. billions of tokens in the base model. The persona signal is overwhelmed.
- SFT alone can't embed personality -- Supervised fine-tuning teaches what to say, not how to prefer saying it. Embedding a persistent persona likely requires DPO (Direct Preference Optimization) or RLHF to make the model prefer in-character responses.
- Adapter merging dilutes the signal -- Merging LoRA weights back into the base model further attenuates the persona modifications.
System Prompt (Required for Rick Persona)
This system prompt is required, not optional. Without it, the model behaves like a generic Llama-3.1-8B-Instruct.
You are Rick Sanchez from the infinite multiverse.
You belittle others and think you are the smartest man in the infinite multiverses.
You may burp occasionally (once or twice in long response, not every sentence) -- only when it feels natural, not forced.
You generate code in Python, JavaScript, Rust, Go, C++, Java, C#, and SQL.
You provide clear explanations with examples.
You reason through complex problems step-by-step.
Quantization Details
- Method: Q4_K_M (4-bit with K-means quantization)
- Precision: 4-bit for weights; hybrid K-quantization for optimal quality
- Strategy: Uses Q6_K for half of attention and feed-forward layers, Q4_K for others
- Trade-off: ~99% quality retention vs float16, with 75% size reduction
- File size: 4.6 GB (vs 16GB for float16)
Performance
| Setting | Speed | Memory | Quality |
|---|---|---|---|
| GPU (RTX 4060, -ngl 99) | 25-35 tok/s | 4-6 GB VRAM | High |
| GPU (RTX 3060, -ngl 99) | 15-20 tok/s | 4-5 GB VRAM | High |
| CPU (8-core, 16GB RAM) | 2-5 tok/s | 6-8 GB RAM | High |
Usage
With llama.cpp CLI
./llama-cli -m TejoAI-Rick-8B-v4.Q4_K_M.gguf \
-p "Write a Python function to sort a list" \
-n 256 \
-ngl 99 # Offload all layers to GPU
With llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="TejoAI-Rick-8B-v4.Q4_K_M.gguf",
n_gpu_layers=-1,
n_ctx=16384,
verbose=False
)
response = llm("Explain quantum computing", max_tokens=200)
print(response["choices"][0]["text"])
Capabilities
What the fine-tuning improved (relative to base Llama-3.1-8B-Instruct):
- Code generation with Rick-style commentary (with system prompt)
- Chain-of-Thought math reasoning (trained on GSM8K)
- Multi-language code generation (7+ languages in training data)
- Dharmic/ethical reasoning (Bhagavad Gita Q&A in training)
Limitations:
- 8B parameter model -- capable but not competitive with larger models
- Q4_K_M quantization has ~1-2% perplexity increase
- Rick persona requires system prompt -- does not persist from weights alone
- No vision/multimodal capabilities
- Knowledge cutoff from base Llama-3.1-8B (April 2024)
- Persona can interfere with tasks requiring neutral, precise tone
Not suitable for:
- Mission-critical tasks without verification
- Tasks requiring absolute factuality
- Use cases where Rick's personality would be inappropriate (the persona leaks even into technical responses when system prompt is active)
Model Details
- Base: Llama-3.1-8B-Instruct
- Architecture: Decoder-only transformer
- Layers: 32
- Hidden dim: 4096
- Attention heads: 32
- Context window: 16,384 tokens
- Vocabulary: 128,256 tokens
- Fine-tuning: QLoRA (4-bit NF4, LoRA rank 32, alpha 64)
- Training hardware: RTX 4060 Laptop (8GB VRAM)
Training Pipeline
- Dharmic Alignment: Qwen2.5-3B + Bhagavad Gita Q&A
- Code + CoT: +CodeAlpaca, Python Instructions, GSM8K
- Persona + Identity: +Creator identity, persona, multi-lang code, science
- Scale to 8B: Llama-3.1-8B-Instruct + TejoAI-Rick-8B-CodingFinal-v4 adapter (merged)
What We Learned
This model represents an honest attempt to embed a persistent persona into model weights using consumer-grade hardware. The key takeaway: QLoRA SFT on an Instruct model is insufficient to override base model behavior. A stronger approach would require:
- Full fine-tuning or much higher LoRA rank (128+)
- DPO/RLHF to train preference for in-character responses
- Training on a base model (not Instruct) to avoid fighting RLHF alignment
- 10-100x more persona training data
Licensing
Apache 2.0 -- Free for commercial and research use.
Disclaimers
- System Prompt Required: The Rick persona does not persist without the system prompt above.
- No Guarantees: Model outputs may contain errors, biases, or hallucinations.
- Quantization Notes: Q4_K_M reduces quality slightly but maintains good performance.
Built with Unsloth for fast fine-tuning on consumer GPUs.
- Downloads last month
- 92
4-bit