Add benchmark chart

Browse files

Files changed (4) hide show

.gitignore +1 -0
README.md +9 -10
bench/chart/chart.py +125 -0
bench/results/cyankiwi--gemma-4-31B-it-AWQ-4bit.csv +31 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ bench/chart/benchmark.png

README.md CHANGED Viewed

@@ -55,14 +55,13 @@ This variant is **text-only**, video/audio weights and encoders have been stripp
 ## Benchmark
 > [!NOTE]
 > RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh).
 >
 > Note: We also ran ***⚡Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.
-[CHART HERE]
 |                  | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***⚡ Turbo*** (this model)                  |
 |------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------|
 | GPU memory       | 58.9 GiB                                                   | 31 GiB                                                             | **18.5 GiB** *(-68% base, -40% nvidia)*     |
@@ -77,13 +76,13 @@ This variant is **text-only**, video/audio weights and encoders have been stripp
 Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:
-|                  | [prithivMLmods NVFP4 quant](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ quant](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***⚡ Turbo*** (this model) |
-|------------------|----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|----------------------------|
-| GPU memory       | 19.6 GiB                                                                               | 19.6 GiB                                                                      | **18.5 GiB**               |
-| Prefill          | 6647 tok/s                                                                             | 6626 tok/s                                                                    | **15359 tok/s**            |
-| Decode (single)  | 64.3 tok/s                                                                             | 64.4 tok/s                                                                    | **51 tok/s**               |
-| Decode (batched) | 757 tok/s                                                                              | ??? tok/s                                                                     | **1244 tok/s**             |
-| Concurrency      | 3.79 req/s                                                                             | ??? req/s                                                                     | **6.22 req/s**             |
 ## Usage

 ## Benchmark
+![Benchmark chart](bench/chart/benchmark.png)
 > [!NOTE]
 > RTX PRO 6000, `vllm bench` @ 1K input / 200 output tokens. See [bench.sh](/bench/bench.sh).
 >
 > Note: We also ran ***⚡Turbo*** benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.
 |                  | [Base model](https://huggingface.co/google/gemma-4-31B-it) | [NVIDIA quant](https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4) | ***⚡ Turbo*** (this model)                  |
 |------------------|------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------|
 | GPU memory       | 58.9 GiB                                                   | 31 GiB                                                             | **18.5 GiB** *(-68% base, -40% nvidia)*     |
 Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:
+|                  | [prithivMLmods NVFP4](https://huggingface.co/prithivMLmods/gemma-4-31B-it-NVFP4) | [cyankiwi AWQ](https://huggingface.co/cyankiwi/gemma-4-31B-it-AWQ-4bit) | ***⚡ Turbo*** (this model) |
+|------------------|----------------------------------------------------------------------------------|-------------------------------------------------------------------------|----------------------------|
+| GPU memory       | 19.6 GiB                                                                         | 19.6 GiB                                                                | **18.5 GiB**               |
+| Prefill          | 6647 tok/s                                                                       | 6626 tok/s                                                              | **15359 tok/s**            |
+| Decode (single)  | 64.3 tok/s                                                                       | 64.4 tok/s                                                              | **51 tok/s**               |
+| Decode (batched) | 757 tok/s                                                                        | 757 tok/s                                                               | **1244 tok/s**             |
+| Concurrency      | 3.79 req/s                                                                       | 3.78 req/s                                                              | **6.22 req/s**             |
 ## Usage

bench/chart/chart.py ADDED Viewed

	@@ -0,0 +1,125 @@

+#!/usr/bin/env python3
+"""Generate P95 E2E latency vs RPS benchmark chart."""
+import csv
+import os
+import numpy as np
+import matplotlib.pyplot as plt
+import matplotlib.ticker as ticker
+import matplotlib.patheffects as pe
+from scipy.ndimage import uniform_filter1d
+RESULTS_DIR = os.path.join(os.path.dirname(__file__), "..", "results")
+OUTPUT = os.path.join(os.path.dirname(__file__), "benchmark.png")
+MODELS = [
+    {
+        "file": "LilaRest--gemma-4-31B-it-NVFP4-turbo",
+        "label": "⚡ Turbo (this model)",
+        "color": "#58a6ff",
+        "linewidth": 3.2,
+        "zorder": 10,
+        "alpha": 1.0,
+        "glow": True,
+    },
+    {
+        "file": "nvidia--Gemma-4-31B-IT-NVFP4",
+        "label": "NVIDIA NVFP4",
+        "color": "#76b900",
+        "linewidth": 2.0,
+        "zorder": 5,
+        "alpha": 0.85,
+    },
+    {
+        "file": "google--gemma-4-31B-it",
+        "label": "Google BF16 (base)",
+        "color": "#f97316",
+        "linewidth": 2.0,
+        "zorder": 4,
+        "alpha": 0.85,
+    },
+]
+def read_csv(filename):
+    rps, e2e = [], []
+    with open(os.path.join(RESULTS_DIR, filename + ".csv")) as f:
+        for row in csv.DictReader(f):
+            rps.append(float(row["rps"]))
+            e2e.append(float(row["p95_e2e_ms"]))
+    return np.array(rps), np.array(e2e)
+def smooth(y, size=3):
+    return uniform_filter1d(y, size=size, mode="nearest")
+# --- Style ---
+plt.rcParams.update({
+    "figure.facecolor": "#0d1117",
+    "axes.facecolor": "#0d1117",
+    "axes.edgecolor": "#30363d",
+    "axes.labelcolor": "#8b949e",
+    "text.color": "#e6edf3",
+    "xtick.color": "#8b949e",
+    "ytick.color": "#8b949e",
+    "grid.color": "#21262d",
+    "legend.facecolor": "#161b22",
+    "legend.edgecolor": "#30363d",
+    "font.family": "sans-serif",
+    "font.size": 12,
+})
+fig, ax = plt.subplots(figsize=(12, 6.5))
+for m in MODELS:
+    rps, e2e = read_csv(m["file"])
+    e2e_smooth = smooth(e2e)
+    plot_kwargs = dict(
+        label=m["label"], color=m["color"],
+        linewidth=m["linewidth"], zorder=m["zorder"], alpha=m["alpha"],
+        linestyle=m.get("linestyle", "-"),
+    )
+    # Subtle glow on turbo line
+    if m.get("glow"):
+        plot_kwargs["path_effects"] = [
+            pe.withStroke(linewidth=6, foreground=m["color"], alpha=0.15),
+        ]
+    ax.plot(rps, e2e_smooth, **plot_kwargs)
+ax.set_yscale("log")
+ax.set_xlabel("Request Rate (RPS)", fontsize=14, labelpad=10)
+ax.set_ylabel("P95 End-to-End Latency", fontsize=14, labelpad=10)
+ax.text(0.5, 0.97, "lower is better ↓", transform=ax.transAxes, fontsize=15,
+        color="white", ha="center", va="top", alpha=1.0, style="italic",
+        bbox=dict(boxstyle="round,pad=0.4", facecolor="#161b22", edgecolor="#30363d", alpha=0.8))
+# Y-axis: clean labels
+ax.yaxis.set_major_formatter(ticker.FuncFormatter(
+    lambda x, _: f"{x/1000:.0f}s"))
+ax.set_ylim(2_500, 200_000)
+# X-axis
+ax.set_xlim(0, 15.5)
+ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
+ax.grid(True, which="major", alpha=0.35)
+# 10s threshold — boundary between linear and log
+ax.axhline(y=10_000, color="#f0883e", linestyle="--", linewidth=1, alpha=0.35)
+ax.text(15.3, 11_500, "10s", color="#f0883e", fontsize=10, alpha=0.5,
+        ha="right", va="bottom")
+# Legend — larger, with padding
+legend = ax.legend(loc="upper left", fontsize=12, framealpha=0.95,
+                   borderpad=0.8, labelspacing=0.6)
+for text in legend.get_texts():
+    if "Turbo" in text.get_text():
+        text.set_fontweight("bold")
+plt.tight_layout(pad=1.5)
+plt.savefig(OUTPUT, dpi=200, bbox_inches="tight")
+print(f"Saved to {OUTPUT}")

bench/results/cyankiwi--gemma-4-31B-it-AWQ-4bit.csv CHANGED Viewed

	@@ -0,0 +1,31 @@

+rps,effective_rps,p95_ttft_ms,p95_tpot_ms,p95_e2e_ms
+0.5,0.45,58.19,16.35,3302.55
+1.0,0.90,60.98,17.21,3478.73
+1.5,1.34,391.80,26.34,5485.43
+2.0,1.75,616.11,34.33,6927.24
+2.5,2.22,449.66,36.03,7351.67
+3.0,2.60,1222.34,35.56,7326.99
+3.5,3.02,735.37,38.73,7776.74
+4.0,3.49,281.78,39.45,7941.53
+4.5,3.78,935.95,37.79,7586.05
+5.0,3.35,6610.82,185.39,37592.90
+5.5,2.82,10223.96,246.77,49863.89
+6.0,2.83,26880.89,254.89,51985.53
+6.5,2.88,31920.09,256.79,53213.68
+7.0,2.91,35328.45,255.08,55262.87
+7.5,2.90,38643.28,256.16,57394.03
+8.0,2.92,41402.93,251.52,59681.00
+8.5,2.94,44821.73,250.68,65312.49
+9.0,2.95,48704.06,252.64,71196.65
+9.5,2.96,51980.99,250.13,77310.91
+10.0,2.96,55519.13,250.40,83422.20
+10.5,2.84,58139.61,255.72,90126.81
+11.0,2.87,69658.86,253.41,92134.56
+11.5,2.89,83025.85,255.71,95146.62
+12.0,2.91,85717.79,253.46,97926.27
+12.5,2.91,89763.07,252.60,101964.48
+13.0,2.92,93397.80,252.92,108076.43
+13.5,2.93,95972.66,252.58,113769.49
+14.0,2.94,98589.56,248.68,118258.77
+14.5,2.96,102418.07,249.98,123505.64
+15.0,2.95,106032.85,249.78,129777.93