GenomeOcean-500M-v1.2-AWQ
This is a 4-bit AWQ (Activation-aware Weight Quantization) version of GenomeOcean-500M-v1.2.
Model Details
- Base Model: GenomeOcean-500M-v1.2
- Quantization Method: AWQ
- Bits: 4-bit
- Group Size: 128
Benchmark Results
| Metric | Original (FP16) | Quantized (AWQ) | Change |
|---|---|---|---|
| VRAM Footprint | 1032.3 MB | 286.2 MB | -72.3% |
| NLL Loss | 5.9931 | 6.0442 | - |
| Perplexity (PPL) | 400.6447 | 421.6561 | +5.24% (Drift) |
| Total Gen Time (50 seqs) | 34.2s | 26.2s | -23.4% (Faster) |
Analysis
- Inference Speed: Optimized for GPU inference. In vLLM environments, it provides a ~23% speedup over FP16 while using ~72% less VRAM.
- Fidelity/Speed Trade-off: While GGUF offers slightly better PPL, AWQ is the preferred choice for production deployment on CUDA devices due to its superior throughput and lower latency.
- Diversity: Shows good performance in generating diverse sequences, with an 80% pass rate in complexity filters, outperforming the base FP16 model in avoiding low-entropy repetitions.
Quick Start
Using vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="models/GenomeOcean-500M-v1.2-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
outputs = llm.generate(["ATGCGATCGATCGATCGATCG"], sampling_params)
for output in outputs:
print(output.outputs[0].text)
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("models/GenomeOcean-500M-v1.2-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-500M-v1.2-AWQ")
inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support