gemma-4-31B-uncensored-heretic · MLX 8-bit MLX conversion of llmfan46/gemma-4-31B-it-uncensored-heretic, a fine-tune of Google's Gemma 4 31B Instruct. Quantized to ~8.6 bits per weight using mlx-vlm v0.4.3 on Apple Silicon. Performance on Apple M4 Max · 128 GB
Peak memory: ~34 GB Prompt throughput: ~20.6 tok/s Generation speed: ~14.5 tok/s
Requirements bashpip install -U mlx-vlm
Gemma 4 support requires mlx-vlm >= 0.4.3. Standard mlx-lm does not yet support the gemma4 architecture.
Usage
Text only
bashpython -m mlx_vlm generate
--model TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit
--prompt "Your prompt here"
--max-tokens 512
With image
bashpython -m mlx_vlm generate
--model TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit
--prompt "Describe this image."
--image path/to/image.jpg
--max-tokens 512
Python API
pythonfrom mlx_vlm import load, generate
model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit")
response = generate(
model,
processor,
prompt="Your prompt here",
max_tokens=512,
temperature=0.7,
)
print(response)
Memory requirements
PrecisionVRAMBF16 (full)62 GBQ8 (this model)34 GBQ4~18 GB
Notes
The model activates Gemma 4's thinking channel (<|channel>thought) on reasoning-heavy prompts — this is expected behaviour. The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference. Unofficial community conversion. For the original fine-tune see llmfan46/gemma-4-31B-it-uncensored-heretic.
Conversion
bashpython -m mlx_vlm convert
--hf-path llmfan46/gemma-4-31B-it-uncensored-heretic
--mlx-path ./gemma-4-31B-uncensored-heretic-mlx-8bit
--quantize --q-bits 8
Credits
Google DeepMind — Gemma 4 base model llmfan46 — uncensored-heretic fine-tune ml-explore — MLX framework Blaizzy — mlx-vlm library
- Downloads last month
- 1,877
8-bit
Model tree for TxemAI/gemma-4-31B-uncensored-heretic-mlx-8bit
Base model
google/gemma-4-31B-it