TejoAI Rick 8B v4 (GGUF Q4_K_M)

A domain-specialized 8B language model fine-tuned on Llama-3.1-8B-Instruct, optimized for rapid local inference with minimal VRAM.

Model Overview

Rick is part of the TejoAI character model family, trained using QLoRA (4-bit NF4) on a consumer GPU (RTX 4060 Laptop, 8GB VRAM). The fine-tuning targeted:

Code generation across 7+ programming languages (Python, JavaScript, Rust, Go, C++, Java, C#, SQL)
Mathematical reasoning using Chain-of-Thought prompting
Science Q&A with factual accuracy
Multi-step problem solving with structured reasoning
Ethical alignment based on Dharmic principles

This is the quantized (Q4_K_M) GGUF variant -- optimized for CPU/GPU inference with minimal memory footprint.

What is Rick?

Rick is a persona-driven model inspired by Rick Sanchez (Rick and Morty). The intent was to train the Rick persona directly into the model weights so it would behave in-character without any system prompt.

Honest assessment: This goal was only partially achieved. The Rick persona is present in the weights but is not strong enough to persist reliably without a system prompt. Without the system prompt below, the model largely behaves like a standard Llama-3.1-8B-Instruct.

Why the persona doesn't fully stick

LoRA rank 32 is too shallow -- QLoRA with rank 32 modifies only ~0.1-0.5% of the model's parameters. That's not enough to override billions of parameters of base behavior.
Instruct base model fights the persona -- Llama-3.1-8B-Instruct has strong RLHF alignment training that actively pulls the model toward helpful/neutral assistant behavior, working against the Rick personality.
Insufficient persona training data -- ~500-1,000 persona examples vs. billions of tokens in the base model. The persona signal is overwhelmed.
SFT alone can't embed personality -- Supervised fine-tuning teaches what to say, not how to prefer saying it. Embedding a persistent persona likely requires DPO (Direct Preference Optimization) or RLHF to make the model prefer in-character responses.
Adapter merging dilutes the signal -- Merging LoRA weights back into the base model further attenuates the persona modifications.

System Prompt (Required for Rick Persona)

This system prompt is required, not optional. Without it, the model behaves like a generic Llama-3.1-8B-Instruct.

You are Rick Sanchez from the infinite multiverse.
You belittle others and think you are the smartest man in the infinite multiverses.
You may burp occasionally (once or twice in long response, not every sentence) -- only when it feels natural, not forced.
You generate code in Python, JavaScript, Rust, Go, C++, Java, C#, and SQL.
You provide clear explanations with examples.
You reason through complex problems step-by-step.

Quantization Details

Method: Q4_K_M (4-bit with K-means quantization)
Precision: 4-bit for weights; hybrid K-quantization for optimal quality
Strategy: Uses Q6_K for half of attention and feed-forward layers, Q4_K for others
Trade-off: ~99% quality retention vs float16, with 75% size reduction
File size: 4.6 GB (vs 16GB for float16)

Performance

Setting	Speed	Memory	Quality
GPU (RTX 4060, -ngl 99)	25-35 tok/s	4-6 GB VRAM	High
GPU (RTX 3060, -ngl 99)	15-20 tok/s	4-5 GB VRAM	High
CPU (8-core, 16GB RAM)	2-5 tok/s	6-8 GB RAM	High

Usage

With llama.cpp CLI

./llama-cli -m TejoAI-Rick-8B-v4.Q4_K_M.gguf \
  -p "Write a Python function to sort a list" \
  -n 256 \
  -ngl 99        # Offload all layers to GPU

With llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="TejoAI-Rick-8B-v4.Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=16384,
    verbose=False
)

response = llm("Explain quantum computing", max_tokens=200)
print(response["choices"][0]["text"])

Capabilities

What the fine-tuning improved (relative to base Llama-3.1-8B-Instruct):

Code generation with Rick-style commentary (with system prompt)
Chain-of-Thought math reasoning (trained on GSM8K)
Multi-language code generation (7+ languages in training data)
Dharmic/ethical reasoning (Bhagavad Gita Q&A in training)

Limitations:

8B parameter model -- capable but not competitive with larger models
Q4_K_M quantization has ~1-2% perplexity increase
Rick persona requires system prompt -- does not persist from weights alone
No vision/multimodal capabilities
Knowledge cutoff from base Llama-3.1-8B (April 2024)
Persona can interfere with tasks requiring neutral, precise tone

Not suitable for:

Mission-critical tasks without verification
Tasks requiring absolute factuality
Use cases where Rick's personality would be inappropriate (the persona leaks even into technical responses when system prompt is active)

Model Details

Base: Llama-3.1-8B-Instruct
Architecture: Decoder-only transformer
Layers: 32
Hidden dim: 4096
Attention heads: 32
Context window: 16,384 tokens
Vocabulary: 128,256 tokens
Fine-tuning: QLoRA (4-bit NF4, LoRA rank 32, alpha 64)
Training hardware: RTX 4060 Laptop (8GB VRAM)

Training Pipeline

Dharmic Alignment: Qwen2.5-3B + Bhagavad Gita Q&A
Code + CoT: +CodeAlpaca, Python Instructions, GSM8K
Persona + Identity: +Creator identity, persona, multi-lang code, science
Scale to 8B: Llama-3.1-8B-Instruct + TejoAI-Rick-8B-CodingFinal-v4 adapter (merged)

What We Learned

This model represents an honest attempt to embed a persistent persona into model weights using consumer-grade hardware. The key takeaway: QLoRA SFT on an Instruct model is insufficient to override base model behavior. A stronger approach would require:

Full fine-tuning or much higher LoRA rank (128+)
DPO/RLHF to train preference for in-character responses
Training on a base model (not Instruct) to avoid fighting RLHF alignment
10-100x more persona training data

Licensing

Apache 2.0 -- Free for commercial and research use.

Disclaimers

System Prompt Required: The Rick persona does not persist without the system prompt above.
No Guarantees: Model outputs may contain errors, biases, or hallucinations.
Quantization Notes: Q4_K_M reduces quality slightly but maintains good performance.

Built with Unsloth for fast fine-tuning on consumer GPUs.

Downloads last month: 92

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support