GenomeOcean-100M-v1.2-AWQ

This is a 4-bit AWQ (Activation-aware Weight Quantization) version of GenomeOcean-100M-v1.2.

Model Details

  • Base Model: GenomeOcean-100M-v1.2
  • Quantization Method: AWQ
  • Bits: 4-bit
  • Group Size: 128

Benchmark Results

Metric Original (FP16) Quantized (AWQ) Change
VRAM Footprint 228.3 MB 68.4 MB -70.0%
NLL Loss 6.2110 6.2790 -
Perplexity (PPL) 498.1917 533.2738 +7.04% (Drift)
Total Gen Time (50 seqs) 28.5s 21.4s -25.0% (Faster)

Analysis

  • Inference Speed: AWQ quantization provides a significant speedup (approx. 25%) when used with optimized kernels (e.g., vLLM). This is due to reduced memory bandwidth pressure and highly optimized CUDA kernels for 4-bit GEMM.
  • Precision: There is a moderate perplexity drift (+7.04%), which is higher than GGUF (Q4_K_M) but acceptable for most downstream tasks.
  • Repetitive Patterns: In smaller models like this 100M version, AWQ quantization helps maintain a reasonable filter pass rate, although it performs slightly worse than GGUF in breaking repetitive sequences.

Quick Start

Using vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="models/GenomeOcean-100M-v1.2-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

outputs = llm.generate(["ATGCGATCGATCGATCGATCG"], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ")

inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support