GenomeOcean-500M-v1.2-AWQ

This is a 4-bit AWQ (Activation-aware Weight Quantization) version of GenomeOcean-500M-v1.2.

Model Details

Base Model: GenomeOcean-500M-v1.2
Quantization Method: AWQ
Bits: 4-bit
Group Size: 128

Benchmark Results

Metric	Original (FP16)	Quantized (AWQ)	Change
VRAM Footprint	1032.3 MB	286.2 MB	-72.3%
NLL Loss	5.9931	6.0442	-
Perplexity (PPL)	400.6447	421.6561	+5.24% (Drift)
Total Gen Time (50 seqs)	34.2s	26.2s	-23.4% (Faster)

Analysis

Inference Speed: Optimized for GPU inference. In vLLM environments, it provides a ~23% speedup over FP16 while using ~72% less VRAM.
Fidelity/Speed Trade-off: While GGUF offers slightly better PPL, AWQ is the preferred choice for production deployment on CUDA devices due to its superior throughput and lower latency.
Diversity: Shows good performance in generating diverse sequences, with an 80% pass rate in complexity filters, outperforming the base FP16 model in avoiding low-entropy repetitions.

Quick Start

Using vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="models/GenomeOcean-500M-v1.2-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

outputs = llm.generate(["ATGCGATCGATCGATCGATCG"], sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("models/GenomeOcean-500M-v1.2-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-500M-v1.2-AWQ")

inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 20

Safetensors

Model size

0.5B params

Tensor type

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support