GenomeOcean-100M-v1.2-AWQ

This is a 4-bit AWQ (Activation-aware Weight Quantization) version of GenomeOcean-100M-v1.2.

Model Details

Base Model: GenomeOcean-100M-v1.2
Quantization Method: AWQ
Bits: 4-bit
Group Size: 128

Benchmark Results

Metric	Original (FP16)	Quantized (AWQ)	Change
VRAM Footprint	228.3 MB	68.4 MB	-70.0%
NLL Loss	6.2110	6.2790	-
Perplexity (PPL)	498.1917	533.2738	+7.04% (Drift)
Total Gen Time (50 seqs)	28.5s	21.4s	-25.0% (Faster)

Analysis

Inference Speed: AWQ quantization provides a significant speedup (approx. 25%) when used with optimized kernels (e.g., vLLM). This is due to reduced memory bandwidth pressure and highly optimized CUDA kernels for 4-bit GEMM.
Precision: There is a moderate perplexity drift (+7.04%), which is higher than GGUF (Q4_K_M) but acceptable for most downstream tasks.
Repetitive Patterns: In smaller models like this 100M version, AWQ quantization helps maintain a reasonable filter pass rate, although it performs slightly worse than GGUF in breaking repetitive sequences.

Quick Start

Using vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="models/GenomeOcean-100M-v1.2-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

outputs = llm.generate(["ATGCGATCGATCGATCGATCG"], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ")

inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 13

Safetensors

Model size

0.1B params

Tensor type

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support