Fix #10: Rewrite README — 3-pipeline system, TG GraphRAG integration, LLM-Judge + BERTScore, NoveltyEngine wiring
Browse files
README.md
CHANGED
|
@@ -1,533 +1,163 @@
|
|
| 1 |
-
# 🔍 GraphRAG Inference Hackathon —
|
| 2 |
|
| 3 |
<div align="center">
|
| 4 |
|
| 5 |
-
[](#-testing)
|
| 10 |
-
[](#-deployment)
|
| 11 |
|
| 12 |
-
**
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
---
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
- [What This Is](#-what-this-is)
|
| 23 |
-
- [The Problem We're Solving](#-the-problem-were-solving)
|
| 24 |
-
- [Architecture (AI Factory — 4 Layers)](#-architecture-ai-factory--4-layers)
|
| 25 |
-
- [14 Novel Techniques](#-14-novel-techniques)
|
| 26 |
-
- [Graph Schema & GSQL Queries](#-graph-schema--gsql-queries)
|
| 27 |
-
- [Evaluation Framework](#-evaluation-framework)
|
| 28 |
-
- [12 LLM Providers](#-12-llm-providers)
|
| 29 |
-
- [Expected Benchmarks](#-expected-benchmarks)
|
| 30 |
-
- [Quick Start](#-quick-start)
|
| 31 |
-
- [Deployment](#-deployment)
|
| 32 |
-
- [OpenClaw Agent Integration](#-openclaw-agent-integration)
|
| 33 |
-
- [Testing](#-testing)
|
| 34 |
-
- [Project Structure](#-project-structure)
|
| 35 |
-
- [References & Citation Graph](#-references--citation-graph)
|
| 36 |
|
| 37 |
---
|
| 38 |
|
| 39 |
## 🎯 What This Is
|
| 40 |
|
| 41 |
-
A **3-pipeline GraphRAG benchmarking system** with **14 novel techniques** from
|
| 42 |
|
| 43 |
-
| Pipeline 1
|
| 44 |
|---|---|---|
|
| 45 |
-
| Query → LLM → Answer | Query → Embed → Top-K Chunks → LLM
|
| 46 |
-
| No retrieval. Worst-case baseline. | Vector embeddings. Industry standard. |
|
| 47 |
-
|
| 48 |
-
**The headline metric**: token reduction with maintained accuracy. GraphRAG community summaries achieve **26–97% fewer tokens vs full-text summarization** ([Edge et al., 2024](https://arxiv.org/abs/2404.16130)) while delivering **72–83% comprehensiveness win rate** over vector RAG (p < .001).
|
| 49 |
|
| 50 |
---
|
| 51 |
|
| 52 |
-
##
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
| **Multi-hop reasoning** | ❌ Retrieves *similar* chunks but can't chain facts across documents | ✅ Traverses entity relationships: `Einstein →BORN_IN→ Germany, Newton →BORN_IN→ England` |
|
| 59 |
-
| **Context efficiency** | 🟡 Top-K chunks (~3,600 tokens per query, [Han et al., 2025](https://arxiv.org/abs/2502.11371)) | ✅ Token Budget Controller caps at 2,000 tokens — **97% reduction** vs unbounded retrieval ([TERAG](https://arxiv.org/abs/2509.18667)) |
|
| 60 |
-
| **Global sensemaking** | ❌ Can't answer "What are the main themes across 1M tokens?" | ✅ Community-level summaries via Leiden hierarchical detection ([GraphRAG](https://arxiv.org/abs/2404.16130)) |
|
| 61 |
-
| **Temporal reasoning** | ❌ 30.7% accuracy on time-dependent queries | ✅ **50.6% accuracy** (+64% improvement, [Han et al., 2025](https://arxiv.org/abs/2502.11371)) |
|
| 62 |
-
| **Complex reasoning** | 41.35% accuracy on novel corpus | ✅ **50.93% accuracy** (+23%, [GraphRAG-Bench](https://arxiv.org/abs/2506.05690)) |
|
| 63 |
|
| 64 |
-
|
|
|
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|---|---|---|
|
| 70 |
-
| **GraphRAG vs. Full-Text Summarization** | C0 (root communities) uses **97% fewer tokens**; C3 uses **26–33% fewer** | [Edge et al., Table 2](https://arxiv.org/abs/2404.16130) |
|
| 71 |
-
| **GraphRAG vs. Top-K Vector RAG** | Community-GraphRAG retrieves ~2.7× MORE tokens (9,770 vs 3,631) | [Han et al., 2025](https://arxiv.org/abs/2502.11371) |
|
| 72 |
-
| **With Token Budget Controller** | TERAG achieves **97% token reduction at 80%+ accuracy** vs unbounded | [TERAG, 2024](https://arxiv.org/abs/2509.18667) |
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
-
## 🏗️
|
| 79 |
|
| 80 |
```
|
| 81 |
┌──────────────────────────────────────────────────────────────────────────────┐
|
| 82 |
│ LAYER 4: EVALUATION │
|
| 83 |
-
│
|
| 84 |
-
│ │ LLM-as-a-Judge │ BERTScore F1 │ RAGAS │ Token Tracking │ │
|
| 85 |
-
│ │ (PASS/FAIL) │ (≥0.55 rescaled) │ (Faithfulness, │ (per-query) │ │
|
| 86 |
-
│ │ Target: ≥90% │ (≥0.88 raw) │ Relevancy) │ │ │
|
| 87 |
-
│ └─────────────────┴──────────────────┴──────────────────┴────────────────┘ │
|
| 88 |
-
│ F1/EM (SQuAD) │ Context Hit Rate │ Live Benchmark │ Next.js Dashboard │
|
| 89 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 90 |
-
│ LAYER 3: UNIVERSAL LLM (12 Providers
|
| 91 |
-
│ OpenAI │ Claude │ Gemini │ Mistral │ Ollama │ Groq │ DeepSeek │ xAI │ … │
|
| 92 |
-
│ Single interface: model routing, cost tracking, fallback chains │
|
| 93 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 94 |
-
│ LAYER 2:
|
| 95 |
-
│
|
| 96 |
-
│
|
| 97 |
-
│ │ Pipeline 2: Baseline RAG (query → embed → vector top-K → LLM) │ │
|
| 98 |
-
│ │ Pipeline 3: GraphRAG (novelty-enhanced, see below) │ │
|
| 99 |
-
│ │ PolyG Router → PPR Scoring → Spreading Activation → │ │
|
| 100 |
-
│ │ Path Pruning → Token Budget → Structured Context → LLM │ │
|
| 101 |
-
│ │ Adaptive Router: complexity scorer 0.0–1.0 → route to optimal pipe │ │
|
| 102 |
-
│ └────────────────────────────────────────────────────────────────────────┘ │
|
| 103 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 104 |
-
│ LAYER 1: GRAPH
|
| 105 |
-
│
|
| 106 |
-
│
|
| 107 |
-
│ Incremental Updates (O(new) cost) │ Schema-Bounded Extraction (9 types) │
|
| 108 |
└──────────────────────────────────────────────────────────────────────────────┘
|
| 109 |
```
|
| 110 |
|
| 111 |
-
###
|
| 112 |
|
| 113 |
```
|
| 114 |
-
Query
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
Step 2: Dual-level keyword extraction (LightRAG-inspired)
|
| 118 |
-
→ high_level: ["nationality", "comparison"]
|
| 119 |
-
→ low_level: ["Einstein", "Newton"]
|
| 120 |
-
Step 3: Vector search → seed entities [Einstein, Newton] from TigerGraph
|
| 121 |
-
Step 4: PPR from seeds → score all reachable entities by graph proximity
|
| 122 |
-
Step 5: Spreading Activation → expand to 2-hop neighborhood with decay=0.7
|
| 123 |
-
Step 6: Combined scoring: 0.6 × PPR + 0.4 × Activation per chunk
|
| 124 |
-
Step 7: Token Budget Controller → select top chunks within 2,000 tokens (prune 60%+)
|
| 125 |
-
Step 8: Path Serialization → "Einstein →BORN_IN→ Germany, Newton →BORN_IN→ England"
|
| 126 |
-
(high-reliability paths placed FIRST — exploits lost-in-the-middle bias)
|
| 127 |
-
Step 9: LLM generates answer with ranked, pruned, path-structured graph context
|
| 128 |
-
|
| 129 |
-
Result: "No. Einstein was born in Germany and Newton was born in England."
|
| 130 |
-
Tokens used: 1,847 (vs 3,600+ for vector RAG, vs 12,000+ for LLM-only)
|
| 131 |
```
|
| 132 |
|
| 133 |
---
|
| 134 |
|
| 135 |
## 🌟 14 Novel Techniques
|
| 136 |
|
| 137 |
-
### Graph Retrieval
|
| 138 |
-
|
| 139 |
-
| # | Technique | Paper | Key Result | Implementation |
|
| 140 |
-
|---|-----------|-------|------------|----------------|
|
| 141 |
-
| 1 | **PPR Confidence-Weighted Retrieval** | [CatRAG](https://arxiv.org/abs/2602.01965) (Feb 2025) | Best reasoning completeness on 4 benchmarks | `PPRConfidenceScorer` — Personalized PageRank from seed entities with damping=0.85, power iteration convergence |
|
| 142 |
-
| 2 | **Spreading Activation Context Scoring** | [SA-RAG](https://arxiv.org/abs/2512.15922) (Dec 2024) | **+39% answer correctness** on MuSiQue | `SpreadingActivation` — propagates signal through graph edges with decay=0.7, ranks chunks by accumulated activation |
|
| 143 |
-
| 3 | **Flow-Pruned Path Serialization** | [PathRAG](https://arxiv.org/abs/2502.14902) (Feb 2025) | **62–65% win rate** vs LightRAG | `PathPruner` — DFS path discovery, multiplicative edge-weight scoring, threshold pruning, lost-in-the-middle exploit |
|
| 144 |
-
| 4 | **Graph Token Budget Controller** | [TERAG](https://arxiv.org/abs/2509.18667) (Sep 2024) | **97% token reduction** at 80%+ accuracy | `TokenBudgetController` — caps context at configurable token limit, prioritizes by score × relevance |
|
| 145 |
-
| 5 | **PolyG Hybrid Retrieval Router** | [RAGRouter-Bench](https://arxiv.org/abs/2602.00296) (Feb 2025) | Adaptive > any fixed paradigm | `PolyGRouter` — 4-class query taxonomy → optimal retrieval strategy per query |
|
| 146 |
-
| 6 | **Incremental Graph Updates** | [TG-RAG](https://arxiv.org/abs/2510.13590) (Oct 2024) | O(new) vs O(all) recomputation | `IncrementalGraphUpdater` — embedding-similarity entity merging, scoped community re-detection |
|
| 147 |
|
| 148 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|---|-----------|-------------|-------------|
|
| 152 |
-
| 7 | **Schema-Bounded Entity Extraction** | [Youtu-GraphRAG](https://arxiv.org/abs/2508.19855) (Tencent, 2025) | 9 entity types (PERSON, ORG, LOCATION, EVENT, DATE, CONCEPT, WORK, PRODUCT, TECHNOLOGY) + 10 relation types → ~90% extraction cost reduction, +16% accuracy vs unconstrained extraction |
|
| 153 |
-
| 8 | **Dual-Level Keyword Retrieval** | [LightRAG](https://arxiv.org/abs/2410.05779) (Oct 2024, 34K⭐) | High-level (themes/topics) + low-level (entities/names) keywords for dual-channel retrieval |
|
| 154 |
-
| 9 | **Adaptive Query Complexity Router** | Original | LLM scores query complexity 0.0–1.0 → routes simple queries to baseline (saves cost), complex to GraphRAG (better accuracy) |
|
| 155 |
-
| 10 | **Graph Reasoning Path Explanation** | Original | Natural language step-by-step traversal explanation: Entry → Traversal → Evidence → Conclusion |
|
| 156 |
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
| # | Technique | Description |
|
| 160 |
-
|---|-----------|-------------|
|
| 161 |
-
| 11 | **12-Provider Universal LLM** | Single interface for OpenAI, Claude, Gemini, Mistral, Ollama, Groq, DeepSeek, xAI, Together, Cohere, HuggingFace, OpenRouter — with cost tracking and fallback chains |
|
| 162 |
-
| 12 | **OpenClaw Agent Skills** | GraphRAG as autonomous agent capabilities following the CIK model (SOUL + IDENTITY + MEMORY + Skills) |
|
| 163 |
-
| 13 | **Live Dashboard Benchmarking** | Interactive comparison: one query → all 3 pipelines run → side-by-side responses + metrics. "Run Benchmark Now" button evaluates on HotpotQA in real-time |
|
| 164 |
-
| 14 | **Advanced GSQL Queries** | PPR, shortest paths, spreading activation, neighborhood extraction — all as installable TigerGraph queries via `gsql_advanced.py` |
|
| 165 |
-
|
| 166 |
-
---
|
| 167 |
-
|
| 168 |
-
## 📐 Graph Schema & GSQL Queries
|
| 169 |
-
|
| 170 |
-
### TigerGraph Schema
|
| 171 |
-
|
| 172 |
-
```
|
| 173 |
-
┌──────────┐ PART_OF ┌──────────┐ MENTIONS ┌──────────┐
|
| 174 |
-
│ Document │ ←─────────── │ Chunk │ ─────────────→ │ Entity │
|
| 175 |
-
│ │ (position) │ │ (count, conf) │ │
|
| 176 |
-
│ doc_id │ │ chunk_id │ │ entity_id│
|
| 177 |
-
│ title │ │ text │ │ name │
|
| 178 |
-
│ content │ │ embedding│ RELATED_TO │ type │
|
| 179 |
-
│ source │ │ tokens │ ←───────────→ │ desc │
|
| 180 |
-
└──────────┘ └──────────┘ (type, weight) │ embedding│
|
| 181 |
-
└────┬─────┘
|
| 182 |
-
│ IN_COMMUNITY
|
| 183 |
-
┌────▼─────┐
|
| 184 |
-
│ Community│
|
| 185 |
-
│ comm_id │
|
| 186 |
-
│ summary │
|
| 187 |
-
│ level │
|
| 188 |
-
└──────────┘
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
-
### Installed GSQL Queries
|
| 192 |
-
|
| 193 |
-
| Query | Parameters | Purpose |
|
| 194 |
-
|---|---|---|
|
| 195 |
-
| `vectorSearchChunks` | `queryVec LIST<DOUBLE>, topK INT` | Cosine similarity chunk retrieval |
|
| 196 |
-
| `vectorSearchEntities` | `queryVec LIST<DOUBLE>, topK INT` | Entity vector search for seed discovery |
|
| 197 |
-
| `graphRAGTraverse` | `seedEntityIds SET<STRING>, hops INT` | Multi-hop neighborhood expansion |
|
| 198 |
-
| `pprFromSeeds` | `seedEntityIds, damping FLOAT, maxIter INT` | Personalized PageRank (Novelty #1) |
|
| 199 |
-
| `findReasoningPaths` | `sourceId, targetId STRING, maxDepth INT` | Shortest path between entities (Novelty #3) |
|
| 200 |
-
| `spreadingActivation` | `seedEntityIds, decayFactor, maxSteps, threshold` | Activation propagation (Novelty #2) |
|
| 201 |
-
| `getEntityNeighborhood` | `entityIds SET<STRING>, hops INT` | Subgraph extraction for context building |
|
| 202 |
|
| 203 |
---
|
| 204 |
|
| 205 |
## 📊 Evaluation Framework
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
### Metric 1: LLM-as-a-Judge (PASS/FAIL)
|
| 210 |
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
Based on the methodology from [Zheng et al., NeurIPS 2023](https://arxiv.org/abs/2306.05685), using **single-answer reference-guided grading** — the most reliable LLM judge configuration (versus pairwise, which has position bias).
|
| 214 |
-
|
| 215 |
-
**Best practices implemented:**
|
| 216 |
-
- ✅ Reference answer always provided (maximizes human correlation per [Prometheus 2](https://arxiv.org/abs/2405.01535))
|
| 217 |
-
- ✅ Chain-of-thought before verdict (Explain-then-Rate improves alignment per [survey](https://arxiv.org/abs/2412.05579))
|
| 218 |
-
- ✅ Structured JSON output: `{"feedback": "...", "verdict": "PASS"|"FAIL"}`
|
| 219 |
-
- ✅ Temperature = 0 for deterministic grading
|
| 220 |
-
- ✅ Anti-self-enhancement: judge model ≠ generation model (GPT-4 self-favors 10%, Claude 25% — [Zheng et al.](https://arxiv.org/abs/2306.05685))
|
| 221 |
-
|
| 222 |
-
**Recommended judge models (free):**
|
| 223 |
-
| Model | HF ID | Why |
|
| 224 |
|---|---|---|
|
| 225 |
-
|
|
| 226 |
-
|
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
**
|
| 231 |
-
|
| 232 |
-
Based on [Zhang et al., ICLR 2020](https://arxiv.org/abs/1904.09675). BERTScore computes token-level semantic similarity using contextual embeddings with greedy cosine matching:
|
| 233 |
-
|
| 234 |
-
```
|
| 235 |
-
P_BERT = (1/|x̂|) × Σ max cosine(xi, x̂j) ← candidate faithfulness
|
| 236 |
-
R_BERT = (1/|x|) × Σ max cosine(xi, x̂j) ← reference coverage
|
| 237 |
-
F_BERT = harmonic_mean(P, R) ← primary metric
|
| 238 |
-
```
|
| 239 |
-
|
| 240 |
-
**Why the thresholds are equivalent:** Raw scores with `roberta-large` cluster in 0.84–0.96 (inflated by learned geometry). Rescaling maps against a random-baseline lower bound (`b ≈ 0.84`), so raw ≥ 0.88 ≈ rescaled ≥ 0.55 for English. This represents "semantically similar" text — not identical, but capturing the same meaning.
|
| 241 |
-
|
| 242 |
-
| Raw F1 | Rescaled F1 | Interpretation |
|
| 243 |
-
|---|---|---|
|
| 244 |
-
| < 0.84 | ~0 | Poor — nearly unrelated |
|
| 245 |
-
| 0.84–0.87 | 0.0–0.45 | Weak — partial overlap |
|
| 246 |
-
| **≥ 0.88** | **≥ 0.55** | **✅ Hackathon PASS — semantically similar** |
|
| 247 |
-
| 0.90–0.92 | 0.65–0.75 | Good — high semantic match |
|
| 248 |
-
| ≥ 0.95 | ≥ 0.88 | Near-paraphrase quality |
|
| 249 |
-
|
| 250 |
-
**Usage:**
|
| 251 |
-
```python
|
| 252 |
-
from evaluate import load
|
| 253 |
-
bertscore = load("bertscore")
|
| 254 |
-
results = bertscore.compute(
|
| 255 |
-
predictions=candidates, references=references,
|
| 256 |
-
model_type="roberta-large", rescale_with_baseline=True, lang="en"
|
| 257 |
-
)
|
| 258 |
-
# results["f1"][i] >= 0.55 → PASS for sample i
|
| 259 |
-
```
|
| 260 |
-
|
| 261 |
-
### Metric 3: RAGAS (Component Diagnostics)
|
| 262 |
-
|
| 263 |
-
[RAGAS](https://arxiv.org/abs/2309.15217) provides **reference-free, LLM-powered** evaluation of individual RAG components:
|
| 264 |
-
|
| 265 |
-
| RAGAS Metric | What It Catches | Formula |
|
| 266 |
-
|---|---|---|
|
| 267 |
-
| **Faithfulness** | Hallucinations — statements not grounded in context | `|verified_statements| / |total_statements|` |
|
| 268 |
-
| **Answer Relevancy** | Off-topic or incomplete answers | `avg cosine_sim(query, generated_questions_from_answer)` |
|
| 269 |
-
| **Context Precision** | Retrieval noise — irrelevant chunks returned | Precision of relevant retrieved contexts |
|
| 270 |
-
| **Context Recall** | Missing knowledge — relevant info not retrieved | Coverage of reference by retrieved contexts |
|
| 271 |
-
|
| 272 |
-
### Metric 4: Custom Metrics (No LLM Dependency)
|
| 273 |
-
|
| 274 |
-
| Metric | Description | Standard |
|
| 275 |
-
|---|---|---|
|
| 276 |
-
| **F1 Score** | Token-level F1 vs gold answer | SQuAD/HotpotQA |
|
| 277 |
-
| **Exact Match** | Normalized string match | SQuAD/HotpotQA |
|
| 278 |
-
| **Context Hit Rate** | Fraction of supporting facts found in retrieved contexts | Custom |
|
| 279 |
-
| **Token Efficiency** | `graphrag_tokens / baseline_tokens` ratio | Custom |
|
| 280 |
-
| **Cost per Query** | `tokens × provider_pricing` | Custom |
|
| 281 |
-
| **Response Latency** | End-to-end ms from question to answer | Custom |
|
| 282 |
-
|
| 283 |
-
### Evaluation Code Path
|
| 284 |
|
| 285 |
```python
|
| 286 |
-
from graphrag.layers.evaluation_layer import
|
| 287 |
-
|
| 288 |
-
evaluator = EvaluationLayer(eval_llm_model="gpt-4o-mini")
|
| 289 |
-
evaluator.initialize() # loads RAGAS if available
|
| 290 |
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
baseline_answer="They were both scientists.",
|
| 295 |
-
graphrag_answer="No. Einstein was born in Germany while Newton was born in England.",
|
| 296 |
-
supporting_facts=["Einstein was born in Ulm, Germany", "Newton was born in Woolsthorpe, England"]
|
| 297 |
-
)
|
| 298 |
|
| 299 |
-
|
| 300 |
-
|
|
|
|
| 301 |
```
|
| 302 |
|
| 303 |
---
|
| 304 |
|
| 305 |
-
## 🤖 12 LLM Providers
|
| 306 |
-
|
| 307 |
-
All providers unified through a single `UniversalLLM` interface with automatic detection, cost tracking, and fallback chains.
|
| 308 |
-
|
| 309 |
-
| Provider | Model | Cost (per 1K tokens) | Speed | Free Tier |
|
| 310 |
-
|----------|-------|------|-------|-----------|
|
| 311 |
-
| **Ollama** 🦙 | llama3.2 | **$0.00** | ⚡ Local | ✅ Unlimited |
|
| 312 |
-
| **HuggingFace** | Llama 3.3 70B | **$0.00** | 🔵 Medium | ✅ Rate-limited |
|
| 313 |
-
| **DeepSeek** | DeepSeek V3 | $0.00014 | ⚡ Fast | ✅ Generous |
|
| 314 |
-
| **Gemini** | 2.0 Flash | $0.0001 | ⚡ Fast | ✅ Generous |
|
| 315 |
-
| **OpenAI** | GPT-4o-mini | $0.00015 | ⚡ Fast | 🟡 Trial credits |
|
| 316 |
-
| **Groq** | Llama 3.3 70B | $0.0006 | ⚡⚡ Blazing | ✅ Free tier |
|
| 317 |
-
| **Together** | Llama 3.1 70B | $0.0009 | ⚡ Fast | 🟡 Trial credits |
|
| 318 |
-
| **Mistral** | Large | $0.002 | 🔵 Medium | 🟡 Trial credits |
|
| 319 |
-
| **Cohere** | Command R+ | $0.0025 | 🔵 Medium | ✅ Trial |
|
| 320 |
-
| **Anthropic** | Claude Sonnet 4 | $0.003 | 🔵 Medium | 🟡 Trial credits |
|
| 321 |
-
| **xAI** | Grok 3 | $0.003 | 🔵 Medium | 🟡 Trial credits |
|
| 322 |
-
| **OpenRouter** | 200+ models | Varies | Varies | 🟡 Trial credits |
|
| 323 |
-
|
| 324 |
-
**Zero-cost hackathon setup:** Ollama (local, unlimited) + Gemini free tier + HuggingFace Inference API = full 3-pipeline benchmarking at $0.
|
| 325 |
-
|
| 326 |
-
---
|
| 327 |
-
|
| 328 |
-
## 📈 Expected Benchmarks
|
| 329 |
-
|
| 330 |
-
### Pipeline Comparison (HotpotQA)
|
| 331 |
-
|
| 332 |
-
| Metric | Pipeline 1 (LLM-Only) | Pipeline 2 (Basic RAG) | Pipeline 3 (GraphRAG) | GraphRAG vs Basic RAG |
|
| 333 |
-
|--------|----------------------|----------------------|---------------------|----------------------|
|
| 334 |
-
| **F1 Score** | ~0.30–0.40 | ~0.45–0.60 | ~0.55–0.70 | **+13–21%** ✅ |
|
| 335 |
-
| **Exact Match** | ~0.15–0.25 | ~0.30–0.45 | ~0.35–0.50 | **+11%** ✅ |
|
| 336 |
-
| **Tokens/Query** | ~2,000–12,000+ | ~800–1,000 | ~1,200–2,000* | bounded by budget |
|
| 337 |
-
| **Win Rate** | — | — | ~55–70% | ✅ GraphRAG |
|
| 338 |
-
|
| 339 |
-
*\*With Token Budget Controller (Novelty #4), GraphRAG context is capped at 2,000 tokens.*
|
| 340 |
-
|
| 341 |
-
### By Question Type (Literature-Backed Predictions)
|
| 342 |
-
|
| 343 |
-
| Question Type | Basic RAG F1 | GraphRAG F1 | Δ | Evidence |
|
| 344 |
-
|---|---|---|---|---|
|
| 345 |
-
| **Bridge** (multi-hop) | ~0.52 | ~0.63 | **+21%** | Graph traversal connects cross-document facts |
|
| 346 |
-
| **Comparison** | ~0.58 | ~0.61 | +5% | Entity-pair paths provide structured comparison context |
|
| 347 |
-
| **Temporal** | ~0.31 | ~0.51 | **+64%** | [Han et al., 2025](https://arxiv.org/abs/2502.11371) Table 32 |
|
| 348 |
-
| **Summarization** | ~0.45 | ~0.51 | **+23%** | [GraphRAG-Bench](https://arxiv.org/abs/2506.05690) on novel corpus |
|
| 349 |
-
| **Simple Factoid** | ~0.65 | ~0.63 | −3% | Vector RAG is faster/cheaper for single-hop (router handles this) |
|
| 350 |
-
|
| 351 |
-
### Token Efficiency Claims (With Citations)
|
| 352 |
-
|
| 353 |
-
| Claim | Number | Source | Context |
|
| 354 |
-
|---|---|---|---|
|
| 355 |
-
| GraphRAG comprehensiveness win rate | 72–83% | [Edge et al., 2024](https://arxiv.org/abs/2404.16130), Appendix G, p < .001 | vs vector RAG across Podcast (1M tokens) and News (1.7M tokens) corpora |
|
| 356 |
-
| Community summaries vs full-text | 26–97% fewer tokens | [Edge et al., 2024](https://arxiv.org/abs/2404.16130), Table 2 | C0 = 97% fewer, C3 = 26–33% fewer |
|
| 357 |
-
| Token Budget Controller reduction | 97% at 80%+ accuracy | [TERAG, 2024](https://arxiv.org/abs/2509.18667) | 3–11% of LightRAG's token cost |
|
| 358 |
-
| Spreading Activation correctness | +39% | [SA-RAG, 2024](https://arxiv.org/abs/2512.15922) | On MuSiQue multi-hop benchmark |
|
| 359 |
-
| Path retrieval win rate | 62–65% | [PathRAG, 2025](https://arxiv.org/abs/2502.14902) | vs LightRAG comprehensiveness |
|
| 360 |
-
| Complex reasoning accuracy | +9.58% | [GraphRAG-Bench, 2025](https://arxiv.org/abs/2506.05690), Table 2 | Novel dataset, ACC: 50.93 vs 41.35 |
|
| 361 |
-
| ROUGE-L on complex reasoning | +59% | [GraphRAG-Bench, 2025](https://arxiv.org/abs/2506.05690), Table 2 | 24.09 vs 15.12 |
|
| 362 |
-
|
| 363 |
-
---
|
| 364 |
-
|
| 365 |
## 🚀 Quick Start
|
| 366 |
|
| 367 |
-
### Prerequisites
|
| 368 |
-
- Python ≥ 3.10
|
| 369 |
-
- TigerGraph Savanna account ([tgcloud.io](https://tgcloud.io)) or Community Edition ([dl.tigergraph.com](https://dl.tigergraph.com))
|
| 370 |
-
- At least one LLM API key (or Ollama for free local inference)
|
| 371 |
-
|
| 372 |
-
### Option A: Next.js Dashboard (Recommended)
|
| 373 |
-
|
| 374 |
```bash
|
| 375 |
git clone https://huggingface.co/muthuk1/graphrag-inference-hackathon
|
| 376 |
-
cd graphrag-inference-hackathon
|
| 377 |
-
|
| 378 |
-
# Configure environment
|
| 379 |
-
cp .env.example .env
|
| 380 |
-
# Edit .env: set TG_HOST, TG_PASSWORD, and at least one LLM provider key
|
| 381 |
-
|
| 382 |
-
# Setup TigerGraph (one-time: creates schema + installs GSQL queries)
|
| 383 |
-
pip install -r requirements.txt
|
| 384 |
-
python graphrag/setup_tigergraph.py
|
| 385 |
-
|
| 386 |
-
# Launch Next.js dashboard
|
| 387 |
-
cd web && npm install && npm run dev # → http://localhost:3000
|
| 388 |
-
```
|
| 389 |
-
|
| 390 |
-
### Option B: Docker (One Command)
|
| 391 |
-
|
| 392 |
-
```bash
|
| 393 |
-
docker build -t graphrag .
|
| 394 |
-
docker run -p 3000:3000 -p 7860:7860 --env-file .env graphrag
|
| 395 |
-
# → Next.js at :3000, Gradio at :7860
|
| 396 |
-
```
|
| 397 |
-
|
| 398 |
-
### Option C: Python CLI
|
| 399 |
-
|
| 400 |
-
```bash
|
| 401 |
pip install -r requirements.txt
|
| 402 |
|
| 403 |
-
#
|
| 404 |
-
python
|
| 405 |
-
|
| 406 |
-
# Run benchmark (HotpotQA evaluation with F1/EM)
|
| 407 |
-
python -m graphrag.main benchmark --samples 50 --top-k 5 --hops 2 --output results.json
|
| 408 |
-
|
| 409 |
-
# Launch Gradio dashboard
|
| 410 |
-
python -m graphrag.main dashboard --port 7860 --share
|
| 411 |
-
|
| 412 |
-
# Quick demo comparison
|
| 413 |
-
python -m graphrag.main demo
|
| 414 |
-
```
|
| 415 |
-
|
| 416 |
-
### Option D: Ollama (100% Free, No API Keys)
|
| 417 |
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
ollama pull llama3.2
|
| 421 |
|
| 422 |
-
#
|
| 423 |
-
|
| 424 |
-
# OLLAMA_BASE_URL=http://localhost:11434
|
| 425 |
|
|
|
|
| 426 |
cd web && npm install && npm run dev
|
| 427 |
-
```
|
| 428 |
-
|
| 429 |
-
### Key Configuration Parameters
|
| 430 |
-
|
| 431 |
-
| Parameter | Default | Description | Tuning Guidance |
|
| 432 |
-
|---|---|---|---|
|
| 433 |
-
| `top_k` | 5 | Chunks/entities from vector search | Higher = more context, more tokens |
|
| 434 |
-
| `hops` | 2 | Graph traversal depth | 2–3 optimal; >3 introduces noise |
|
| 435 |
-
| `chunk_size` | 1000 | Characters per chunk during ingestion | 600–1000 for most domains |
|
| 436 |
-
| `chunk_overlap` | 100 | Overlap between chunks | 10–20% of chunk_size |
|
| 437 |
-
| `token_budget` | 2000 | Max tokens in final context | Lower = cheaper, test accuracy impact |
|
| 438 |
-
| `damping` | 0.85 | PPR teleportation probability | Standard; lower = more exploration |
|
| 439 |
-
| `decay_factor` | 0.7 | Spreading activation propagation | 0.5–0.8; lower = more focused |
|
| 440 |
-
| `complexity_threshold` | 0.6 | Router: above = GraphRAG, below = baseline | Tune based on your query distribution |
|
| 441 |
-
|
| 442 |
-
---
|
| 443 |
|
| 444 |
-
#
|
|
|
|
| 445 |
|
| 446 |
-
#
|
| 447 |
-
|
| 448 |
-
```dockerfile
|
| 449 |
-
# Multi-stage build: Node 20 frontend + Python 3 venv backend
|
| 450 |
-
docker build -t graphrag .
|
| 451 |
-
docker run -p 3000:3000 -p 7860:7860 \
|
| 452 |
-
-e TG_HOST=https://YOUR_SUBDOMAIN.tgcloud.io \
|
| 453 |
-
-e TG_PASSWORD=your_password \
|
| 454 |
-
-e ANTHROPIC_API_KEY=sk-ant-... \
|
| 455 |
-
graphrag
|
| 456 |
-
```
|
| 457 |
-
|
| 458 |
-
### Environment Variables
|
| 459 |
-
|
| 460 |
-
```bash
|
| 461 |
-
# TigerGraph (required)
|
| 462 |
-
TG_HOST=https://YOUR_SUBDOMAIN.tgcloud.io
|
| 463 |
-
TG_GRAPH=GraphRAG
|
| 464 |
-
TG_USERNAME=tigergraph
|
| 465 |
-
TG_PASSWORD= # required
|
| 466 |
-
|
| 467 |
-
# LLM (set any — auto-detected)
|
| 468 |
-
OPENAI_API_KEY=sk-... # GPT-4o, GPT-4o-mini
|
| 469 |
-
ANTHROPIC_API_KEY=sk-ant-... # Claude Sonnet 4
|
| 470 |
-
GEMINI_API_KEY=AIza... # Gemini 2.0 Flash
|
| 471 |
-
OLLAMA_BASE_URL=http://localhost:11434 # Free local
|
| 472 |
-
|
| 473 |
-
# Defaults
|
| 474 |
-
LLM_PROVIDER=anthropic
|
| 475 |
-
LLM_MODEL=claude-sonnet-4-20250514
|
| 476 |
-
DASHBOARD_PORT=7860
|
| 477 |
-
```
|
| 478 |
-
|
| 479 |
-
### TigerGraph MCP Integration
|
| 480 |
-
|
| 481 |
-
Connect TigerGraph directly to AI coding tools (Cursor, VS Code Copilot) — build with natural language instead of GSQL:
|
| 482 |
-
|
| 483 |
-
```json
|
| 484 |
-
{
|
| 485 |
-
"mcpServers": {
|
| 486 |
-
"tigergraph": {
|
| 487 |
-
"command": "uvx",
|
| 488 |
-
"args": ["pyTigerGraph-mcp"],
|
| 489 |
-
"env": {
|
| 490 |
-
"TG_HOST": "https://yoursubdomain.tgcloud.io",
|
| 491 |
-
"TG_GRAPH": "GraphRAG",
|
| 492 |
-
"TG_USERNAME": "tigergraph",
|
| 493 |
-
"TG_PASSWORD": "your_password"
|
| 494 |
-
}
|
| 495 |
-
}
|
| 496 |
-
}
|
| 497 |
-
}
|
| 498 |
-
```
|
| 499 |
-
|
| 500 |
-
---
|
| 501 |
-
|
| 502 |
-
## 🦞 OpenClaw Agent Integration
|
| 503 |
-
|
| 504 |
-
GraphRAG capabilities exposed as autonomous agent skills following the CIK (Cognition-Identity-Knowledge) model:
|
| 505 |
-
|
| 506 |
-
| Component | File | Purpose |
|
| 507 |
-
|-----------|------|---------|
|
| 508 |
-
| `SOUL.md` | `openclaw/SOUL.md` | Agent identity, values, operational boundaries |
|
| 509 |
-
| `IDENTITY.md` | `openclaw/IDENTITY.md` | Provider config, graph schema awareness, channels |
|
| 510 |
-
| `MEMORY.md` | `openclaw/MEMORY.md` | Learned performance knowledge across runs |
|
| 511 |
-
| `graph_query` | `openclaw/skills/graph_query/` | Natural language → knowledge graph traversal |
|
| 512 |
-
| `compare_pipelines` | `openclaw/skills/compare_pipelines/` | Dual-pipeline comparison with metrics |
|
| 513 |
-
| `cost_estimate` | `openclaw/skills/cost_estimate/` | 12-provider cost projection and optimization |
|
| 514 |
-
|
| 515 |
-
---
|
| 516 |
-
|
| 517 |
-
## 🧪 Testing
|
| 518 |
-
|
| 519 |
-
```bash
|
| 520 |
-
python tests/test_core.py # 31 tests — core pipeline functions
|
| 521 |
-
python tests/test_novelties.py # 24 tests — all 6 novelty techniques
|
| 522 |
-
|
| 523 |
-
# Total: 55 tests covering:
|
| 524 |
-
# - PPR convergence, damping, seed weighting
|
| 525 |
-
# - Spreading activation decay, threshold, multi-hop
|
| 526 |
-
# - PolyG query classification (entity/relation/multi-hop/summarization)
|
| 527 |
-
# - Path finding, pruning, serialization
|
| 528 |
-
# - Token budget controller, utilization tracking
|
| 529 |
-
# - F1/EM computation, context hit rate
|
| 530 |
-
# - Incremental graph update planning
|
| 531 |
```
|
| 532 |
|
| 533 |
---
|
|
@@ -535,132 +165,48 @@ python tests/test_novelties.py # 24 tests — all 6 novelty techniques
|
|
| 535 |
## 📁 Project Structure
|
| 536 |
|
| 537 |
```
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
├── web/ # Next.js 15 Dashboard (port 3000)
|
| 555 |
-
│ ├── src/app/api/
|
| 556 |
-
│ │ ├── compare/route.ts # Multi-provider 3-pipeline comparison API
|
| 557 |
-
│ │ ├── benchmark/route.ts # Live benchmark with F1/EM/tokens
|
| 558 |
-
│ │ └── providers/route.ts # Provider health checking
|
| 559 |
-
│ ├── src/components/
|
| 560 |
-
│ │ ├── tabs/LiveCompare.tsx # Side-by-side pipeline comparison
|
| 561 |
-
│ │ ├── tabs/Benchmark.tsx # "Run Benchmark Now" + charts
|
| 562 |
-
│ │ ├── tabs/CostAnalysis.tsx # 12-provider cost projections
|
| 563 |
-
│ │ └── tabs/GraphExplorer.tsx # Interactive graph visualization
|
| 564 |
-
│ └── src/lib/
|
| 565 |
-
│ ├── llm-providers.ts # 12-provider universal client (TS)
|
| 566 |
-
│ └── design-tokens.ts # TigerGraph design system tokens
|
| 567 |
-
│
|
| 568 |
-
├── openclaw/ # OpenClaw Agent (CIK model)
|
| 569 |
-
│ ├── SOUL.md / IDENTITY.md / MEMORY.md
|
| 570 |
-
│ └── skills/ # graph_query, compare_pipelines, cost_estimate
|
| 571 |
-
│
|
| 572 |
-
├── tests/
|
| 573 |
-
│ ├── test_core.py # 31 core tests
|
| 574 |
-
│ └── test_novelties.py # 24 novelty technique tests
|
| 575 |
-
│
|
| 576 |
-
├── Dockerfile # Multi-stage: Node 20 + Python 3
|
| 577 |
-
├── requirements.txt # Python dependencies
|
| 578 |
-
├── .env.example # Full configuration template
|
| 579 |
-
└── README.md # This file
|
| 580 |
```
|
| 581 |
|
| 582 |
---
|
| 583 |
|
| 584 |
-
## 📚 References
|
| 585 |
-
|
| 586 |
-
### Directly Implemented (6 papers → novel techniques)
|
| 587 |
-
|
| 588 |
-
| # | Paper | ArXiv | Key Contribution | Our Implementation |
|
| 589 |
-
|---|-------|-------|------------------|--------------------|
|
| 590 |
-
| 1 | **CatRAG** — PPR + Dynamic Edge Weighting | [2602.01965](https://arxiv.org/abs/2602.01965) (Feb 2025) | Personalized PageRank for reasoning completeness | `PPRConfidenceScorer` |
|
| 591 |
-
| 2 | **SA-RAG** — Spreading Activation Retrieval | [2512.15922](https://arxiv.org/abs/2512.15922) (Dec 2024) | +39% correctness via activation propagation | `SpreadingActivation` |
|
| 592 |
-
| 3 | **PathRAG** — Flow-Pruned Path Retrieval | [2502.14902](https://arxiv.org/abs/2502.14902) (Feb 2025) | 62–65% win rate via path serialization | `PathPruner` |
|
| 593 |
-
| 4 | **TERAG** — Token-Efficient Graph RAG | [2509.18667](https://arxiv.org/abs/2509.18667) (Sep 2024) | 97% token reduction at 80%+ accuracy | `TokenBudgetController` |
|
| 594 |
-
| 5 | **RAGRouter-Bench** — Hybrid Routing | [2602.00296](https://arxiv.org/abs/2602.00296) (Feb 2025) | Adaptive routing > fixed paradigm | `PolyGRouter` |
|
| 595 |
-
| 6 | **TG-RAG** — Incremental Temporal Graph | [2510.13590](https://arxiv.org/abs/2510.13590) (Oct 2024) | O(new) incremental updates | `IncrementalGraphUpdater` |
|
| 596 |
-
|
| 597 |
-
### Architecture Inspiration (4 papers)
|
| 598 |
-
|
| 599 |
-
| # | Paper | ArXiv | Contribution |
|
| 600 |
-
|---|-------|-------|-------------|
|
| 601 |
-
| 7 | **GraphRAG** — Microsoft's Community-Based RAG | [2404.16130](https://arxiv.org/abs/2404.16130) (Apr 2024) | Hierarchical Leiden community detection + map-reduce summarization; 72–83% comprehensiveness win rate |
|
| 602 |
-
| 8 | **LightRAG** — Dual-Level Retrieval | [2410.05779](https://arxiv.org/abs/2410.05779) (Oct 2024, 34K⭐) | High-level + low-level keyword dual-channel retrieval |
|
| 603 |
-
| 9 | **Youtu-GraphRAG** — Schema-Bounded Extraction | [2508.19855](https://arxiv.org/abs/2508.19855) (Tencent, 2025) | Constrained entity types → 90% extraction cost reduction, +16% accuracy |
|
| 604 |
-
| 10 | **HippoRAG 2** — PPR + Passage Integration | [2502.14802](https://arxiv.org/abs/2502.14802) (Feb 2025) | Hippocampus-inspired graph, 87.9–90.9% evidence recall on complex questions |
|
| 605 |
-
|
| 606 |
-
### Evaluation Methodology (2 papers)
|
| 607 |
|
| 608 |
-
|
| 609 |
-
|---|-------|-------|----------|
|
| 610 |
-
| 11 | **Judging LLM-as-a-Judge** | [2306.05685](https://arxiv.org/abs/2306.05685) (NeurIPS 2023) | LLM judge methodology, bias mitigation |
|
| 611 |
-
| 12 | **BERTScore** | [1904.09675](https://arxiv.org/abs/1904.09675) (ICLR 2020) | Token-level semantic similarity metric |
|
| 612 |
|
| 613 |
-
|
| 614 |
|
| 615 |
-
|
| 616 |
-
|---|-------|-------|------------|
|
| 617 |
-
| — | **RAG vs. GraphRAG: Systematic Evaluation** | [2502.11371](https://arxiv.org/abs/2502.11371) (Feb 2025) | Integration improves best single-method by +6.4%; Temporal: GraphRAG 50.6% vs RAG 30.7% |
|
| 618 |
-
| — | **GraphRAG-Bench** | [2506.05690](https://arxiv.org/abs/2506.05690) (Jun 2025) | GraphRAG excels on complex reasoning (+9.58% ACC), RAG better on simple factoid |
|
| 619 |
-
| — | **GraphRAG Survey** | [2501.13958](https://arxiv.org/abs/2501.13958) (Jan 2025) | Comprehensive taxonomy: Index-Graph vs KG-based; TigerGraph architecture comparison |
|
| 620 |
-
|
| 621 |
-
### Citation Flow
|
| 622 |
-
|
| 623 |
-
```
|
| 624 |
-
Microsoft GraphRAG (2404.16130) ─── cited by ──→ LightRAG (2410.05779)
|
| 625 |
-
│ │
|
| 626 |
-
├──────── cited by ──→ CatRAG (2602.01965) ├──→ TERAG (2509.18667)
|
| 627 |
-
├──────── cited by ──→ PathRAG (2502.14902) ├──→ TG-RAG (2510.13590)
|
| 628 |
-
├──────── cited by ──→ SA-RAG (2512.15922) └──→ RAGRouter-Bench (2602.00296)
|
| 629 |
-
└──────── cited by ──→ GraphRAG-Bench (2506.05690)
|
| 630 |
-
│
|
| 631 |
-
HippoRAG 2 (2502.14802) ───────────────┘
|
| 632 |
-
Youtu-GraphRAG (2508.19855) ── builds on ──→ Microsoft GraphRAG schema-bounded variant
|
| 633 |
-
```
|
| 634 |
-
|
| 635 |
-
### Datasets & Evaluation Frameworks
|
| 636 |
-
|
| 637 |
-
- [**HotpotQA**](https://arxiv.org/abs/1809.09600) — Multi-hop QA benchmark (bridge + comparison questions)
|
| 638 |
-
- [**RAGAS**](https://arxiv.org/abs/2309.15217) — RAG evaluation: Faithfulness, Relevancy, Context Precision/Recall
|
| 639 |
-
- [**Prometheus 2**](https://arxiv.org/abs/2405.01535) — Open-source LLM judge (Apache 2.0, GPT-4-comparable)
|
| 640 |
|
| 641 |
---
|
| 642 |
|
| 643 |
-
## 🔗
|
| 644 |
|
| 645 |
-
|
| 646 |
-
|---|---|
|
| 647 |
-
| TigerGraph GraphRAG Repo | [github.com/tigergraph/graphrag](https://github.com/tigergraph/graphrag) |
|
| 648 |
-
| TigerGraph MCP | [github.com/tigergraph/tigergraph-mcp](https://github.com/tigergraph/tigergraph-mcp) |
|
| 649 |
-
| TigerGraph Savanna | [tgcloud.io](https://tgcloud.io) |
|
| 650 |
-
| Community Edition | [dl.tigergraph.com](https://dl.tigergraph.com) |
|
| 651 |
-
| TigerGraph Docs | [docs.tigergraph.com](https://docs.tigergraph.com) |
|
| 652 |
-
| Discord Community | [discord.gg/Djy8xxDR](https://discord.gg/Djy8xxDR) |
|
| 653 |
|
| 654 |
---
|
| 655 |
|
| 656 |
<div align="center">
|
| 657 |
|
| 658 |
-
|
| 659 |
|
| 660 |
-
|
| 661 |
|
| 662 |
*Build it. Benchmark it. Prove graph beats tokens.*
|
| 663 |
|
| 664 |
-
**Token reduction with maintained accuracy — that's the whole game.**
|
| 665 |
-
|
| 666 |
</div>
|
|
|
|
| 1 |
+
# 🔍 GraphRAG Inference Hackathon — 3-Pipeline Benchmarking System
|
| 2 |
|
| 3 |
<div align="center">
|
| 4 |
|
| 5 |
+
[](https://github.com/tigergraph/graphrag)
|
| 6 |
+
[-002B49?style=for-the-badge)](#-3-pipeline-architecture)
|
| 7 |
+
[](#-14-novel-techniques)
|
| 8 |
+
[](#-12-llm-providers)
|
| 9 |
+
[](#-references)
|
| 10 |
[](#-testing)
|
|
|
|
| 11 |
|
| 12 |
+
**One query in → three pipelines run → side-by-side responses + metrics out.**
|
| 13 |
|
| 14 |
+
Proving that graphs make LLM inference faster, cheaper, and smarter — backed by 12 research papers, 6 novel retrieval techniques, and the full hackathon evaluation stack.
|
| 15 |
|
| 16 |
+
[3-Pipeline Architecture](#-3-pipeline-architecture) · [TG GraphRAG Integration](#-tigergraph-graphrag-integration) · [Novelties](#-14-novel-techniques) · [Evaluation](#-evaluation-framework) · [Quick Start](#-quick-start)
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
## 🎯 What This Is
|
| 23 |
|
| 24 |
+
A **3-pipeline GraphRAG benchmarking system** built on top of the [TigerGraph GraphRAG repo](https://github.com/tigergraph/graphrag), with **14 novel techniques** from 2024–2025 research, **12 LLM providers**, and a **production dashboard** showing all three pipelines side-by-side with LLM-as-a-Judge + BERTScore evaluation.
|
| 25 |
|
| 26 |
+
| Pipeline 1: LLM-Only | Pipeline 2: Basic RAG | Pipeline 3: GraphRAG |
|
| 27 |
|---|---|---|
|
| 28 |
+
| Query → LLM → Answer | Query → Embed → Top-K Chunks → LLM | Query → **TG GraphRAG Service** → **NoveltyEngine** → LLM |
|
| 29 |
+
| No retrieval. Worst-case baseline. | Vector embeddings. Industry standard. | Built on [tigergraph/graphrag](https://github.com/tigergraph/graphrag) + 6 novelties. |
|
|
|
|
|
|
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
+
## 🐯 TigerGraph GraphRAG Integration
|
| 34 |
|
| 35 |
+
Pipeline 3 is **built on top of the official [TigerGraph GraphRAG repo](https://github.com/tigergraph/graphrag)** (Path B: customize). The integration layer (`tg_graphrag_client.py`) wraps the official service:
|
| 36 |
|
| 37 |
+
```python
|
| 38 |
+
from graphrag.layers.tg_graphrag_client import TGGraphRAGClient
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
client = TGGraphRAGClient(service_url="http://localhost:8000")
|
| 41 |
+
client.connect()
|
| 42 |
|
| 43 |
+
# Official retrievers: Hybrid Search, Community, Sibling
|
| 44 |
+
result = client.retrieve(query="What did Einstein discover?",
|
| 45 |
+
retriever="hybrid", top_k=5, num_hops=2)
|
| 46 |
+
result = client.retrieve(query="Main themes?",
|
| 47 |
+
retriever="community", community_level=2)
|
| 48 |
+
```
|
| 49 |
|
| 50 |
+
**Modes:** REST API (official service) → Direct pyTigerGraph (fallback) → Offline (passage-based).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
```bash
|
| 53 |
+
# Deploy official TG GraphRAG + point our system at it
|
| 54 |
+
git clone https://github.com/tigergraph/graphrag && cd graphrag && docker-compose up -d
|
| 55 |
+
export GRAPHRAG_SERVICE_URL=http://localhost:8000
|
| 56 |
+
python -m graphrag.main benchmark --samples 50
|
| 57 |
+
```
|
| 58 |
|
| 59 |
---
|
| 60 |
|
| 61 |
+
## 🏗️ 3-Pipeline Architecture
|
| 62 |
|
| 63 |
```
|
| 64 |
┌──────────────────────────────────────────────────────────────────────────────┐
|
| 65 |
│ LAYER 4: EVALUATION │
|
| 66 |
+
│ LLM-as-a-Judge (PASS/FAIL, ≥90%) │ BERTScore F1 (≥0.55) │ RAGAS │ F1/EM │
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 68 |
+
│ LAYER 3: UNIVERSAL LLM (12 Providers) │
|
|
|
|
|
|
|
| 69 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 70 |
+
│ LAYER 2: 3-PIPELINE ORCHESTRATION + NOVELTY ENGINE │
|
| 71 |
+
│ Pipeline 1: LLM-Only │ Pipeline 2: Basic RAG │ Pipeline 3: GraphRAG │
|
| 72 |
+
│ NoveltyEngine: PolyG Router → PPR → Spreading Activation → Token Budget │
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
├──────────────────────────────────────────────────────────────────────────────┤
|
| 74 |
+
│ LAYER 1: GRAPH │
|
| 75 |
+
│ TG GraphRAG Service (official repo) ←→ Direct pyTigerGraph (fallback) │
|
| 76 |
+
│ Retrievers: Hybrid, Community, Sibling │ GSQL: PPR, Paths, Activation │
|
|
|
|
| 77 |
└──────────────────────────────────────────────────────────────────────────────┘
|
| 78 |
```
|
| 79 |
|
| 80 |
+
### Pipeline 3 Flow
|
| 81 |
|
| 82 |
```
|
| 83 |
+
Query → keyword extraction → TG GraphRAG Service (hybrid retriever)
|
| 84 |
+
→ NoveltyEngine: PolyG Router → PPR → Spreading Activation → Token Budget
|
| 85 |
+
→ Structured context (entities + relationships + passages) → LLM → Answer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
```
|
| 87 |
|
| 88 |
---
|
| 89 |
|
| 90 |
## 🌟 14 Novel Techniques
|
| 91 |
|
| 92 |
+
### Graph Retrieval (6 papers, wired into Pipeline 3 via NoveltyEngine)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
| # | Technique | Paper | Result | Code |
|
| 95 |
+
|---|-----------|-------|--------|------|
|
| 96 |
+
| 1 | PPR Confidence Retrieval | [CatRAG](https://arxiv.org/abs/2602.01965) | Best reasoning on 4 benchmarks | `PPRConfidenceScorer` |
|
| 97 |
+
| 2 | Spreading Activation | [SA-RAG](https://arxiv.org/abs/2512.15922) | +39% correctness | `SpreadingActivation` |
|
| 98 |
+
| 3 | Flow-Pruned Paths | [PathRAG](https://arxiv.org/abs/2502.14902) | 62–65% win rate | `PathPruner` |
|
| 99 |
+
| 4 | Token Budget Controller | [TERAG](https://arxiv.org/abs/2509.18667) | 97% token reduction | `TokenBudgetController` |
|
| 100 |
+
| 5 | PolyG Hybrid Router | [RAGRouter-Bench](https://arxiv.org/abs/2602.00296) | Adaptive > fixed | `PolyGRouter` |
|
| 101 |
+
| 6 | Incremental Updates | [TG-RAG](https://arxiv.org/abs/2510.13590) | O(new) cost | `IncrementalGraphUpdater` |
|
| 102 |
|
| 103 |
+
### Architecture + System (#7–14)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
+
Schema-bounded extraction, dual-level keywords, adaptive routing, graph reasoning explanation, 12-provider LLM, OpenClaw agent, live 3-pipeline dashboard, advanced GSQL queries.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
---
|
| 108 |
|
| 109 |
## 📊 Evaluation Framework
|
| 110 |
|
| 111 |
+
All hackathon-required metrics implemented in `evaluation_layer.py`:
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
| Metric | Target | Implementation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|---|---|---|
|
| 115 |
+
| **LLM-as-a-Judge** (PASS/FAIL) | ≥ 90% pass rate | `compute_llm_judge()` — reference-guided, CoT, JSON output |
|
| 116 |
+
| **BERTScore F1** | ≥ 0.55 rescaled / ≥ 0.88 raw | `compute_bertscore()` — roberta-large with rescaling |
|
| 117 |
+
| **F1 / Exact Match** | — | SQuAD/HotpotQA standard |
|
| 118 |
+
| **RAGAS** | — | Faithfulness, Relevancy, Context Precision/Recall |
|
| 119 |
+
| **Token Efficiency** | — | Per-pipeline per-query tracking |
|
| 120 |
+
| **Cost per Query** | — | `tokens × provider_pricing` |
|
| 121 |
+
| **Latency** | — | End-to-end ms |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
```python
|
| 124 |
+
from graphrag.layers.evaluation_layer import compute_llm_judge, compute_bertscore
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
# LLM-as-a-Judge
|
| 127 |
+
result = compute_llm_judge(question, reference, candidate, llm_fn)
|
| 128 |
+
# → {"verdict": "PASS", "feedback": "..."}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
# BERTScore
|
| 131 |
+
results = compute_bertscore(predictions, references, rescale=True)
|
| 132 |
+
# → {"mean_f1": 0.62, "pass_rate": 0.85}
|
| 133 |
```
|
| 134 |
|
| 135 |
---
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
## 🚀 Quick Start
|
| 138 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
```bash
|
| 140 |
git clone https://huggingface.co/muthuk1/graphrag-inference-hackathon
|
| 141 |
+
cd graphrag-inference-hackathon && cp .env.example .env
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
pip install -r requirements.txt
|
| 143 |
|
| 144 |
+
# Setup TigerGraph (schema + core + advanced GSQL queries)
|
| 145 |
+
python graphrag/setup_tigergraph.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
+
# 3-pipeline benchmark
|
| 148 |
+
python -m graphrag.main benchmark --samples 50 --output results.json
|
|
|
|
| 149 |
|
| 150 |
+
# 3-column Gradio dashboard
|
| 151 |
+
python -m graphrag.main dashboard
|
|
|
|
| 152 |
|
| 153 |
+
# Next.js dashboard
|
| 154 |
cd web && npm install && npm run dev
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
+
# Docker
|
| 157 |
+
docker build -t graphrag . && docker run -p 3000:3000 -p 7860:7860 --env-file .env graphrag
|
| 158 |
|
| 159 |
+
# Free (Ollama)
|
| 160 |
+
ollama pull llama3.2 && python -m graphrag.main demo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
```
|
| 162 |
|
| 163 |
---
|
|
|
|
| 165 |
## 📁 Project Structure
|
| 166 |
|
| 167 |
```
|
| 168 |
+
graphrag/layers/
|
| 169 |
+
tg_graphrag_client.py # 🆕 Official TG GraphRAG service integration
|
| 170 |
+
orchestration_layer.py # 🆕 3-pipeline + NoveltyEngine wiring
|
| 171 |
+
evaluation_layer.py # 🆕 LLM-Judge + BERTScore + RAGAS + F1/EM
|
| 172 |
+
novelties.py # 6 novel techniques (PPR, activation, paths, budget, router, incremental)
|
| 173 |
+
graph_layer.py # TigerGraph GSQL + schema
|
| 174 |
+
gsql_advanced.py # Advanced GSQL (PPR, paths, activation)
|
| 175 |
+
llm_layer.py / universal_llm.py # 12-provider LLM
|
| 176 |
+
graphrag/
|
| 177 |
+
benchmark.py # 🆕 3-pipeline HotpotQA benchmark
|
| 178 |
+
dashboard.py # 🆕 3-column Gradio dashboard
|
| 179 |
+
setup_tigergraph.py # 🆕 Schema + core + advanced query install
|
| 180 |
+
ingestion.py / main.py
|
| 181 |
+
web/src/app/api/compare/ # 🆕 3-pipeline Next.js API
|
| 182 |
+
openclaw/ # Agent skills
|
| 183 |
+
tests/ # 55 tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
```
|
| 185 |
|
| 186 |
---
|
| 187 |
|
| 188 |
+
## 📚 References (12 Papers)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
+
**Implemented:** [CatRAG](https://arxiv.org/abs/2602.01965), [SA-RAG](https://arxiv.org/abs/2512.15922), [PathRAG](https://arxiv.org/abs/2502.14902), [TERAG](https://arxiv.org/abs/2509.18667), [RAGRouter-Bench](https://arxiv.org/abs/2602.00296), [TG-RAG](https://arxiv.org/abs/2510.13590)
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
+
**Architecture:** [Microsoft GraphRAG](https://arxiv.org/abs/2404.16130), [LightRAG](https://arxiv.org/abs/2410.05779), [Youtu-GraphRAG](https://arxiv.org/abs/2508.19855), [HippoRAG 2](https://arxiv.org/abs/2502.14802)
|
| 193 |
|
| 194 |
+
**Evaluation:** [LLM-as-a-Judge](https://arxiv.org/abs/2306.05685) (NeurIPS 2023), [BERTScore](https://arxiv.org/abs/1904.09675) (ICLR 2020)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
---
|
| 197 |
|
| 198 |
+
## 🔗 Links
|
| 199 |
|
| 200 |
+
[TigerGraph GraphRAG](https://github.com/tigergraph/graphrag) · [TigerGraph Savanna](https://tgcloud.io) · [TigerGraph MCP](https://github.com/tigergraph/tigergraph-mcp) · [TigerGraph Docs](https://docs.tigergraph.com)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
---
|
| 203 |
|
| 204 |
<div align="center">
|
| 205 |
|
| 206 |
+
**🏆 Built for the GraphRAG Inference Hackathon by TigerGraph**
|
| 207 |
|
| 208 |
+
3 Pipelines · 14 Novelties · 12 Papers · 12 LLMs · 55 Tests · LLM-Judge + BERTScore · Docker
|
| 209 |
|
| 210 |
*Build it. Benchmark it. Prove graph beats tokens.*
|
| 211 |
|
|
|
|
|
|
|
| 212 |
</div>
|