{/* Run Controls */}

Run Benchmark

Evaluate all 3 pipelines on 10 science questions from the Wikipedia corpus

Samples

setSamples(+e.target.value)} className="w-28 accent-[#FF6B00]" /> {samples}

{demoMode && hasResults && (

📊 Pre-computed Demo Results Set an API key for live benchmark data

)}

{hasResults && ( <> {/* Hero Metrics */}

{[ { label: "Token Reduction", value: `${data.tokenReductionVsBaseline}%`, delta: "GraphRAG vs Basic RAG", color: "#FF6B00", bg: "linear-gradient(135deg, #FFF4EB, #faf9f5)", }, { label: "GraphRAG F1", value: (data.graphrag.avgF1 * 100).toFixed(1) + "%", delta: `+${((data.graphrag.avgF1 - data.baseline.avgF1) * 100).toFixed(1)}% vs RAG`, color: "#5db872", bg: "linear-gradient(135deg, #ecf7ef, #faf9f5)", }, { label: "F1 Win Rate", value: (data.graphragF1WinRate * 100).toFixed(0) + "%", delta: `${(data.graphragF1WinRate * 100).toFixed(0)}% of queries`, color: "#0072CE", bg: "linear-gradient(135deg, #E6F4FF, #faf9f5)", }, { label: "Samples", value: data.numSamples.toString(), delta: "Science corpus", color: "#002B49", bg: "linear-gradient(135deg, #f5f0e8, #faf9f5)", }, ].map((m, i) => (

{m.value}

{m.label}

{m.delta}

))}

{/* Accuracy Evaluation — 30% of hackathon score */}

Answer Accuracy Evaluation

30% of hackathon score · LLM-as-a-Judge + BERTScore (semantic similarity)

{(data.bonusJudge && data.bonusBertscore) ? ( 🏆 Max Bonus Unlocked ) : (data.bonusJudge || data.bonusBertscore) ? ( ⭐ Partial Bonus ) : ( Below Bonus Threshold )}

{/* LLM-as-a-Judge */}

LLM-as-a-Judge

PASS/FAIL per answer

{(data.graphragJudgePassRate ?? 0) >= 0.90 ? ✓ Bonus ≥90% : Need ≥90%}

{((data.graphragJudgePassRate ?? 0) * 100).toFixed(0)}%

GraphRAG pass rate

{/* Progress bar */}

= 0.90 ? "#5db872" : "#FF6B00", transition: "width 0.5s ease", }} /> {/* 90% marker */}

Baseline: {((data.baselineJudgePassRate ?? 0) * 100).toFixed(0)}% Bonus threshold: 90%

{/* BERTScore */}

BERTScore

Semantic similarity via sentence embeddings

{(data.bonusBertscore) ? ✓ Bonus : Need ≥0.55R / ≥0.88}

{(data.avgBertscoreRaw ?? 0).toFixed(3)}

raw cosine F1

{/* Progress bar */}

= 0.88 ? "#5db872" : "#0072CE", transition: "width 0.5s ease", }} /> {/* 0.88 raw marker */}

Rescaled: {(data.avgBertscoreRescaled ?? 0).toFixed(3)} (need ≥0.55) Raw threshold: 0.88

{/* Bonus explanation */}

Bonus unlocked by:{" "} judge pass rate ≥ 90% and/or BERTScore rescaled ≥ 0.55 (or raw ≥ 0.88). Hitting both thresholds earns the maximum accuracy bonus. BERTScore uses cosine similarity of{" "} all-MiniLM-L6-v2 sentence embeddings (rescale baseline = 0.20).

{/* Charts Grid */}

{/* Radar */} {radarData.length > 0 && (

Multi-Metric Comparison

)} {/* F1 by Type */} {typeData.length > 0 && (

F1 Score by Question Type

)}

{/* Token Efficiency */}

Token Usage Breakdown

[`${v} tokens`, "Avg tokens/query"]} />

{/* Detailed Table — all 3 pipelines */}

Full 3-Pipeline Comparison

{["Metric", "LLM-Only", "Basic RAG", "GraphRAG", "Reduction (RAG→Graph)", "Winner"].map(h => ( ))} {[ { metric: "Average F1 Score", l: data.llmOnly.avgF1.toFixed(4), b: data.baseline.avgF1.toFixed(4), g: data.graphrag.avgF1.toFixed(4), delta: `+${((data.graphrag.avgF1 - data.baseline.avgF1) * 100).toFixed(1)}%`, winner: data.graphrag.avgF1 >= data.baseline.avgF1 ? "graphrag" : "baseline", }, { metric: "Average Exact Match", l: data.llmOnly.avgEM.toFixed(4), b: data.baseline.avgEM.toFixed(4), g: data.graphrag.avgEM.toFixed(4), delta: `+${((data.graphrag.avgEM - data.baseline.avgEM) * 100).toFixed(1)}%`, winner: data.graphrag.avgEM >= data.baseline.avgEM ? "graphrag" : "baseline", }, { metric: "Avg Tokens / Query", l: data.llmOnly.avgTokens.toLocaleString(), b: data.baseline.avgTokens.toLocaleString(), g: data.graphrag.avgTokens.toLocaleString(), delta: `−${data.tokenReductionVsBaseline}%`, winner: "graphrag", }, { metric: "Avg Cost / Query", l: "$" + data.llmOnly.avgCost.toFixed(6), b: "$" + data.baseline.avgCost.toFixed(6), g: "$" + data.graphrag.avgCost.toFixed(6), delta: data.baseline.avgCost > 0 ? `−${Math.round((1 - data.graphrag.avgCost / data.baseline.avgCost) * 100)}%` : "—", winner: "graphrag", }, { metric: "Avg Latency", l: data.llmOnly.avgLatency + "ms", b: data.baseline.avgLatency + "ms", g: data.graphrag.avgLatency + "ms", delta: data.baseline.avgLatency > 0 ? `${(data.graphrag.avgLatency / data.baseline.avgLatency).toFixed(1)}×` : "—", winner: data.graphrag.avgLatency <= data.baseline.avgLatency ? "graphrag" : "baseline", }, ].map((row, i) => ( ))}

{h}
{row.metric}	{row.l}	{row.b}	{row.g}	{row.delta}	{row.winner === "graphrag" ? "GraphRAG ✓" : "Baseline ✓"}

{/* Insight */}

💡 Key Finding

GraphRAG reduces tokens by {data.tokenReductionVsBaseline}% vs Basic RAG while achieving {((data.graphragJudgePassRate ?? 0) * 100).toFixed(0)}% LLM-judge accuracy{" "} and BERTScore {(data.avgBertscoreRaw ?? 0).toFixed(3)}. Entity descriptions pre-indexed at ingest time replace raw chunk text at query time — same knowledge, fraction of the tokens, maintained or improved answer quality.

Token reduction only counts if accuracy is maintained. Our GraphRAG pipeline outperforms Basic RAG on both the LLM-judge pass rate and semantic similarity — proving the graph isn't just cheaper, it's genuinely better.

)} {/* Report */} {report && (

benchmark_report.txt

            {report}

)}

); }