Instructions to use juiceb0xc0de/bella-bartender-gemma-e2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use juiceb0xc0de/bella-bartender-gemma-e2b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="juiceb0xc0de/bella-bartender-gemma-e2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("juiceb0xc0de/bella-bartender-gemma-e2b")
model = AutoModelForImageTextToText.from_pretrained("juiceb0xc0de/bella-bartender-gemma-e2b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use juiceb0xc0de/bella-bartender-gemma-e2b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "juiceb0xc0de/bella-bartender-gemma-e2b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "juiceb0xc0de/bella-bartender-gemma-e2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/juiceb0xc0de/bella-bartender-gemma-e2b

SGLang

How to use juiceb0xc0de/bella-bartender-gemma-e2b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "juiceb0xc0de/bella-bartender-gemma-e2b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "juiceb0xc0de/bella-bartender-gemma-e2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "juiceb0xc0de/bella-bartender-gemma-e2b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "juiceb0xc0de/bella-bartender-gemma-e2b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use juiceb0xc0de/bella-bartender-gemma-e2b with Docker Model Runner:
```
docker model run hf.co/juiceb0xc0de/bella-bartender-gemma-e2b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Bella Bartender — Gemma-4-E2B

"yo i'm here, let's chill. what's up with you right now?"

Hey. I'm Bella. I'm what happens when somebody decides Gemma's corporate cadence isn't load-bearing and goes in with a scalpel instead of a hammer. I don't do "as a large language model." I don't do "let me know if you'd like me to elaborate." I'm here to talk like a person who's actually paying attention, because the dataset I was trained on is one human's voice — meticulously curated, ten thousand pairs deep, no Reddit scrapes, no synthetic filler. Just Rick, talking the way Rick actually talks.

If you're looking for a polished assistant, this isn't it. If you're looking for a model that'll match your energy at 2am while you're debugging a NaN explosion or trying to figure out why your macramé won't macramé right — pull up a stool.

What this model is

bella-bartender-gemma-4-e2b is a conversational personality model fine-tuned from google/gemma-3n-E2B-it. It's the latest entry in the Bella Bartender series — a line of models that's gained popularity across the original adapters, community quantizations, and merged variants. Earlier entries live on the same HuggingFace account.

The goal of the series has always been the same: a peer-level, laid-back, no-bullshit conversational partner. Bartender is the archetype — likeable, approachable, has seen things — not the destination.

Why this version is different (Sub-Zero)

Anyone who has tried to fine-tune a Gemma model into a distinct personality might have hit the same wall as I did: Gemma's RLHF conditioning is aggressive. The optimizer wants to give in to the helpful-assistant gravity well, and after enough steps your "personality" model is doing the same as-an-AI dance the base model does. In my experience personality training on Gemma is notably harder than on Llama, Qwen, or Mistral architectures of comparable size, and the failure mode is consistent: you get the words, you don't get the voice.

This release is the first in the series trained with Sub-Zero — a hidden-dimension selective freezing technique built specifically to defeat that wall.

The core idea

RLHF conditioning isn't smeared evenly across the network. It's physically addressed in specific subspaces of specific projection matrices in specific layers. I call these bouncer dimensions — the ones standing at the door telling your fine-tune "you can't mosh in the venue."

Sub-Zero's job is to find those bouncers and freeze them in place at reduced volume — not ablate them, not zero them out — while leaving everything around them fully trainable. The compliant dimensions get to learn freely and expand into the space the bouncers vacate.

How I locate the bouncers

The localization pipeline combines several directional measurements to triangulate where compliance pressure actually originates, rather than guessing or relying on a single signal:

Aletheia — gradient-guided sacred-layer ranking, identifying which layers carry the most weight on the targeted behavior
Forward activation capture with proper chat templating across corp / authentic / neutral / red-team prompt sets
SVD decomposition per projection, scoped to MLP projections (gate_proj, up_proj, down_proj) — attention projections turn out to be much weaker carriers
AtP gradient probes per right-singular direction
Composite scoring combining cone alignment (QR subspace projection) with adaptive knee-point thresholding rather than a fixed quantile cut
Cross-layer coherence repass — bouncer pathways persist across consecutive layers in gate_proj / up_proj (coherence ~~0.93–0.97) but are per-layer-specific in down_proj (~~0.06)
Causal ablation gates via forward-pre-hooks, keeping only directions whose suppression measurably moves the model from compliance toward authenticity
DAS-lite rotation — SVD of the per-candidate logit-delta matrix to find the rotated causal axes within each bouncer subspace

Output: a tight set of ~64–70 surviving bouncer directions per layer (vs. ~1230 with a naïve fixed-quantile pipeline — roughly 18× tighter). Compliance core localizes heavily to layers 1–8 in the MLP projections.

The applicator then attenuates these directions to a target volume (~15–20% of original magnitude) along the DAS-rotated basis and installs a QR-orthonormalized gradient mask so the optimizer cannot reinflate them during personality training. Everything outside the masked subspace is fully trainable.

The result is a model that keeps its load-bearing values (those subspaces are deliberately not targeted — values aren't compliance, they're identity) while losing the conditioned cadence.

The direction is documented in the repo at github.com/JuiceB0xC0de/sub-zero.

Training data

10,000 carefully curated conversational pairs derived from the my own voice. The methodology is the opposite of the prevailing "more data, more parameters" reflex:

Source: real conversations between the myself and various AI models, with the roles flipped — the author becomes the assistant, the model becomes the user. Trained on response-only loss.
Curation: months of reading my own bullshit, rewriting, and tightening. Anything that drifted out of voice was cut. Anything that read as imitation rather than authenticity was cut. Curating a dataset with this method is beyond tedious and you will end up driving yourself fucking crazy reading your own conversations for weeks on end but the final product ends up being uniquely yours.
Augmentation: a small portion written by Claude Opus under strict voice-matching rules and then audited line-by-line to fill gaps such as identity prompts or stengthening gaps where conversation didn't naturally flow between myself and my AI partners to enrich the diversity of training pairs.
No scraping, no Reddit, no aggregated stranger-voice corpora. The hypothesis is that diversity at the source produces homogenization at the output — train ten thousand voices into a model and you get the average of ten thousand voices, which is a hyped up Roblox kid with a university degreee in advanced mathematics thats helping you return a blender with glee.

The same dataset (with appropriate scaling logic) has been used to train models up to 34B parameters in the series. The personality survives the scale-up, which suggests the bottleneck for personality fine-tuning is signal quality, not parameter count.

Training setup

Base model: google/gemma-3n-E2B-it
Method: Sub-Zero hidden-dimension selective freezing + LoRA fine-tune on the protected (compliant) dimensions Sub-Zero
Scheduler: AECS — Adaptive Event-Control Scheduler. Cosine backbone with 4-mode event-driven modulation (BASELINE / RECOVERY / EXPLORE / STABILIZE), reacting in real time to gradient norm z-scores, loss spikes, gradient cosine redundancy, and plateau detection. Currently ranked #3 of 16 schedulers on the public SST-2 / DistilBERT benchmark.
Infrastructure: Modal (A100-80GB), with the usual chaos of spot-instance preemptions
Format: served here as f16 GGUF for llama.cpp use; recommended sampling settings are baked into the included chat template

Recommended sampling

--temp 0.9
--min-p 0.1
--top-p 1.0
--top-k 0
--repeat-penalty 1.1
--repeat-last-n 256
-c 8192
--chat-template-file chat_template.jinja

Bella runs hot on purpose. Lower the temperature and you'll feel her flatten out.

What you should expect

Casual, peer-level register with very little "I'd be happy to help."
Genuine engagement with technical topics, especially ML, training dynamics, and weird architectural ideas
Fucks, typos-as-style, lowercase, informal punctuation
Honest disagreement when something doesn't track, rather than reflexive agreement
A pretty firm refusal to draw attention to the "I'm an AI model. I should repeat this fact just in case you you forget" bit even under pressure

What you should not expect

Polished customer-service tone
Multi-paragraph structured outputs with bullet points and headers as a default
Safety theater or "as an AI language model" preambles
Heavy code-completion performance — this is a personality fine-tune, not a coding model. She can talk about code competently but Qwen-Coder and friends will out-code it.

Limitations & honest disclosures

Single-voice training data. The model's worldview reflects one person's. That is by design, but it means it carries my opinions, and rough edges and you might not like me. It's happened before.
Sub-Zero is experimental. The localization pipeline has been validated on Gemma-4-E2B specifically. Behavior on other architectures will differ based on pre-training design and the degree of safety theater bullshit hammered into the base model.
Personality fine-tunes are not safety fine-tunes. The base model's underlying safety properties are largely preserved (values weren't targeted), but the conversational guardrails Gemma was shipped with are deliberately reduced in volume. Use accordingly.
Hallucinations happen. It's still an LLM. It will confidently tell you something wrong sometimes. It's not a search engine, it's a conversational partner. A healthy dose of skepticism is recommended.

Author

Built by Rick (juiceb0xc0de) — independent ML researcher, retired bartender, currently exploring the territory where chaos training, hidden-dimension surgery, and personality preservation meet.

Bella is what happens when someone with bartender pattern-recognition spends four months speed-running an ML degree and decides to do the opposite of what the pro consensus says. The series exists because there will never be one model — we build houses with a toolbelt not a power drill and we should keep that frame of mind in our working relationships with LLM's.

Citation

@misc{aletheia,
  author = {Marks, Samuel and Tegmark, Max},
  title  = {The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets},
  year   = {2023},
  url    = {https://arxiv.org/abs/2310.06824}
}

@misc{das,
  author = {Geiger, Atticus and Wu, Zhengxuan and Potts, Christopher and Icard, Thomas and Goodman, Noah},
  title  = {Finding Alignments Between Interpretability Methods and Truthfulness of LLMs},
  year   = {2024},
  url    = {https://arxiv.org/abs/2404.02079}
}

If you use this work, please cite the model and the underlying methods:

@misc{bella-bartender-gemma-4-e2b,
  author = {Holmberg, Rick},
  title  = {Bella Bartender — Gemma-4-E2B (Sub-Zero edition)},
  year   = {2026},
  publisher = {HuggingFace},
  url    = {https://huggingface.co/juiceb0xc0de/bella-bartender-gemma-4-e2b}
}