gemma-3-1b-it-FlashHead

Optimized version of gemma-3-1b-it using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
vLLM plugin via flash-head

FlashHead matches the gemma-3-1b-it baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.

Quickstart

pip install flash-head
vllm serve embedl/gemma-3-1b-it-FlashHead

Model Details

Field	Value
Base Model	gemma-3-1b-it
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head
Developers	Embedl
Licenses	Upstream: Gemma Terms of Use. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
vLLM Plugin Integration - compatible with vLLM (0.14.0+) via the flash-head plugin.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	148	1.0×
FlashHead (Embedl)	178	1.20×
W4A16 baseline	243	1.64x×
FlashHead W4A16 (Embedl)	336	2.27×

FlashHead improves end-to-end speed by 1.38× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	IFEval	BBH	TruthfulQA	GSM8K
Baseline	0.15	0.55	0.38	0.31	0.42
FlashHead	0.15	0.49	0.38	0.31	0.39

FlashHead closely matches baseline accuracy.

Installation

pip install flash-head

The flash-head vLLM plugin is required. It activates automatically at startup.

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import LLM, SamplingParams

model_id = "embedl/gemma-3-1b-it-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)