gemma-4-E2B-it-uncensored-mnn-int4

MNN format 4-bit HQQ quantized conversion of TrevorJS/gemma-4-E2B-it-uncensored, compatible with the MNN inference engine.

Model Information

Property	Value
Base Model	google/gemma-4-E2B-it
Uncensored Fork	TrevorJS/gemma-4-E2B-it-uncensored
Quantization	4-bit HQQ (Half-Quadratic Quantization)
Embedding Quantization	4-bit (PLE embeddings also quantized)
Total Size	~3.5 GB
LLM Weight	1.4 GB
PLE Embeddings	1.4 GB

File Structure

File	Size	Description
`llm.mnn`	2.2 MB	LLM model structure (converted from ONNX)
`llm.mnn.weight`	1.4 GB	LLM weights (4-bit HQQ quantized)
`ple_embeddings_int4.bin`	1.4 GB	Per-Layer Embeddings (4-bit quantized)
`visual.mnn` / `.weight`	217 MB	Vision tower (optional)
`audio.mnn` / `.weight`	564 MB	Audio tower (optional)
`tokenizer.mtok`	9.7 MB	MNN tokenizer
`config.json`	-	MNN runtime configuration
`llm_config.json`	-	LLM architecture configuration

Usage

Local Inference (CPU)

# Clone MNN source
git clone https://github.com/alibaba/MNN.git

# Build (CPU backend)
cd MNN
mkdir build && cd build
cmake .. -DMNN_LOW_MEMORY=true \
         -DMNN_CPU_WEIGHT_DEQUANT_GEMM=true \
         -DMNN_BUILD_LLM=true \
         -DMNN_SUPPORT_TRANSFORMER_FUSE=true
make -j$(nproc)

# Run inference
echo "Hello, who are you?" > prompt.txt
./llm_demo /path/to/config.json prompt.txt

MNN Chat (Android)

A custom Android APK with Gemma 4 support is available in the companion GitHub repository:

https://github.com/Tiggy-Chan/gemma4-mnn-android

Download the pre-built APK directly: 📱 app-standard-release-tiggy-gemma4.apk (35 MB)

Or follow the build instructions to compile your own APK.

Download the model folder to your device and import it via "Add Local Model".

Configuration Example

{
    "llm_model": "llm.mnn",
    "llm_weight": "llm.mnn.weight",
    "tokenizer_file": "tokenizer.mtok",
    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low",
    "sampler_type": "mixed",
    "temperature": 1.0,
    "top_k": 64,
    "top_p": 0.95
}

Other Quantization Versions

Version	Size	Repo
INT4 HQQ (this)	3.5 GB	—
INT8	5.7 GB	gemma-4-E2B-it-uncensored-mnn-int8
BF16 (full precision)	9.5 GB	gemma-4-E2B-it-uncensored-mnn-bf16

Conversion

Converted using MNN's llmexport.py:

cd MNN/transformers/llm/export
python3 llmexport.py \
    --path TrevorJS/gemma-4-E2B-it-uncensored \
    --export mnn \
    --quant_bit 4 \
    --embed_bit 4 \
    --hqq \
    --dst_path ./output

License

This model is distributed under the Apache License 2.0, inherited from the original Gemma 4 model by Google DeepMind. See the LICENSE file for the full license text.

Usage of Gemma models is also subject to the Gemma Terms of Use.

Acknowledgments

Base model: google/gemma-4-E2B-it by Google DeepMind
Uncensored variant: TrevorJS/gemma-4-E2B-it-uncensored by TrevorJS
MNN inference engine: Alibaba MNN

Downloads last month: 126