Update README

Files changed (2) hide show

README.md CHANGED Viewed

@@ -17,6 +17,28 @@ tags:
 - Gemma-4-31B-IT
 - lighthouse
 pipeline_tag: text-generation
 ---
@@ -104,7 +126,7 @@ vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
 - `--quantization modelopt` — required, activates NVIDIA's optimized CUTLASS kernels
 - `--kv-cache-dtype fp8` — halves KV cache memory on Blackwell
-- `--max-model-len 16384` — maximum context length per request. Limited to ~30-40K on RTX 5090, full 262K on PRO 6000
 ## Compatibility

 - Gemma-4-31B-IT
 - lighthouse
 pipeline_tag: text-generation
+model-index:
+- name: gemma-4-31B-it-NVFP4-turbo
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      name: GPQA Diamond
+      type: Idavidrein/gpqa
+      config: gpqa_diamond
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 72.73
+  - task:
+      type: text-generation
+    dataset:
+      name: MMLU Pro
+      type: TIGER-Lab/MMLU-Pro
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 83.93
 ---
 - `--quantization modelopt` — required, activates NVIDIA's optimized CUTLASS kernels
 - `--kv-cache-dtype fp8` — halves KV cache memory on Blackwell
+- `--max-model-len 16384` — maximum context length per request. See [Compatibility](#compatibility) for max value per GPU.
 ## Compatibility

config.json CHANGED Viewed

@@ -154,6 +154,7 @@
       "model.embed_vision*",
       "model.vision_tower*"
     ],
     "quant_algo": "NVFP4",
     "kv_cache_scheme": {
       "dynamic": false,

       "model.embed_vision*",
       "model.vision_tower*"
     ],
+    "bits": 4,
     "quant_algo": "NVFP4",
     "kv_cache_scheme": {
       "dynamic": false,