GenomeOcean-100M-v1.2-AWQ
This is a 4-bit AWQ (Activation-aware Weight Quantization) version of GenomeOcean-100M-v1.2.
Model Details
- Base Model: GenomeOcean-100M-v1.2
- Quantization Method: AWQ
- Bits: 4-bit
- Group Size: 128
Benchmark Results
| Metric | Original (FP16) | Quantized (AWQ) | Change |
|---|---|---|---|
| VRAM Footprint | 228.3 MB | 68.4 MB | -70.0% |
| NLL Loss | 6.2110 | 6.2790 | - |
| Perplexity (PPL) | 498.1917 | 533.2738 | +7.04% (Drift) |
| Total Gen Time (50 seqs) | 28.5s | 21.4s | -25.0% (Faster) |
Analysis
- Inference Speed: AWQ quantization provides a significant speedup (approx. 25%) when used with optimized kernels (e.g., vLLM). This is due to reduced memory bandwidth pressure and highly optimized CUDA kernels for 4-bit GEMM.
- Precision: There is a moderate perplexity drift (+7.04%), which is higher than GGUF (Q4_K_M) but acceptable for most downstream tasks.
- Repetitive Patterns: In smaller models like this 100M version, AWQ quantization helps maintain a reasonable filter pass rate, although it performs slightly worse than GGUF in breaking repetitive sequences.
Quick Start
Using vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="models/GenomeOcean-100M-v1.2-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
outputs = llm.generate(["ATGCGATCGATCGATCGATCG"], sampling_params)
for output in outputs:
print(output.outputs[0].text)
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-100M-v1.2-AWQ")
inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support