muthuk1 commited on
Commit
c6818ea
·
verified ·
1 Parent(s): 9333670

Fix #10: Rewrite README — 3-pipeline system, TG GraphRAG integration, LLM-Judge + BERTScore, NoveltyEngine wiring

Browse files
Files changed (1) hide show
  1. README.md +109 -563
README.md CHANGED
@@ -1,533 +1,163 @@
1
- # 🔍 GraphRAG Inference Hackathon — Dual Pipeline System
2
 
3
  <div align="center">
4
 
5
- [![TigerGraph](https://img.shields.io/badge/Graph_DB-TigerGraph-FF6B00?style=for-the-badge&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjQiIGhlaWdodD0iMjQiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyI+PGNpcmNsZSBjeD0iMTIiIGN5PSIxMiIgcj0iMTAiIGZpbGw9IiNGRjZCMDAiLz48L3N2Zz4=)](https://www.tigergraph.com/)
6
- [![14 Novelties](https://img.shields.io/badge/Novelties-14_Techniques-002B49?style=for-the-badge)](#-14-novel-techniques)
7
- [![12 LLMs](https://img.shields.io/badge/LLMs-12_Providers-0072CE?style=for-the-badge)](#-12-llm-providers)
8
- [![12 Papers](https://img.shields.io/badge/Papers-12_Cited-cc785c?style=for-the-badge)](#-references--citation-graph)
 
9
  [![55 Tests](https://img.shields.io/badge/Tests-55_Passing-5db872?style=for-the-badge)](#-testing)
10
- [![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?style=for-the-badge&logo=docker&logoColor=white)](#-deployment)
11
 
12
- **Proving that graphs make LLM inference faster, cheaper, and smarter — backed by 12 research papers and 6 novel retrieval techniques.**
13
 
14
- [Architecture](#-architecture-ai-factory--4-layers) · [Novelties](#-14-novel-techniques) · [Evaluation](#-evaluation-framework) · [Quick Start](#-quick-start) · [Benchmarks](#-expected-benchmarks) · [Papers](#-references--citation-graph)
15
 
16
- </div>
17
-
18
- ---
19
 
20
- ## 📋 Table of Contents
21
-
22
- - [What This Is](#-what-this-is)
23
- - [The Problem We're Solving](#-the-problem-were-solving)
24
- - [Architecture (AI Factory — 4 Layers)](#-architecture-ai-factory--4-layers)
25
- - [14 Novel Techniques](#-14-novel-techniques)
26
- - [Graph Schema & GSQL Queries](#-graph-schema--gsql-queries)
27
- - [Evaluation Framework](#-evaluation-framework)
28
- - [12 LLM Providers](#-12-llm-providers)
29
- - [Expected Benchmarks](#-expected-benchmarks)
30
- - [Quick Start](#-quick-start)
31
- - [Deployment](#-deployment)
32
- - [OpenClaw Agent Integration](#-openclaw-agent-integration)
33
- - [Testing](#-testing)
34
- - [Project Structure](#-project-structure)
35
- - [References & Citation Graph](#-references--citation-graph)
36
 
37
  ---
38
 
39
  ## 🎯 What This Is
40
 
41
- A **3-pipeline GraphRAG benchmarking system** with **14 novel techniques** from cutting-edge 2024–2025 research, **12 LLM providers** (including free Ollama local), **OpenClaw agent integration**, and a **production Next.js + Gradio dashboard** all built on TigerGraph for the [GraphRAG Inference Hackathon](https://www.tigergraph.com/).
42
 
43
- | Pipeline 1 (LLM-Only) | Pipeline 2 (Basic RAG) | Pipeline 3 (GraphRAG) |
44
  |---|---|---|
45
- | Query → LLM → Answer | Query → Embed → Top-K Chunks → LLM → Answer | Query → **PolyG Router** **PPR Scoring** → **Spreading Activation** → **Path Pruning** → **Token Budget** → LLM → Answer |
46
- | No retrieval. Worst-case baseline. | Vector embeddings. Industry standard. | Graph-enhanced, cost-controlled. |
47
-
48
- **The headline metric**: token reduction with maintained accuracy. GraphRAG community summaries achieve **26–97% fewer tokens vs full-text summarization** ([Edge et al., 2024](https://arxiv.org/abs/2404.16130)) while delivering **72–83% comprehensiveness win rate** over vector RAG (p < .001).
49
 
50
  ---
51
 
52
- ## 🧩 The Problem We're Solving
53
 
54
- LLMs burn through thousands of tokens to answer complex questions. At scale, that gets expensive fast:
55
 
56
- | Challenge | Vector RAG (Baseline) | GraphRAG (Our Approach) |
57
- |---|---|---|
58
- | **Multi-hop reasoning** | ❌ Retrieves *similar* chunks but can't chain facts across documents | ✅ Traverses entity relationships: `Einstein →BORN_IN→ Germany, Newton →BORN_IN→ England` |
59
- | **Context efficiency** | 🟡 Top-K chunks (~3,600 tokens per query, [Han et al., 2025](https://arxiv.org/abs/2502.11371)) | ✅ Token Budget Controller caps at 2,000 tokens — **97% reduction** vs unbounded retrieval ([TERAG](https://arxiv.org/abs/2509.18667)) |
60
- | **Global sensemaking** | ❌ Can't answer "What are the main themes across 1M tokens?" | ✅ Community-level summaries via Leiden hierarchical detection ([GraphRAG](https://arxiv.org/abs/2404.16130)) |
61
- | **Temporal reasoning** | ❌ 30.7% accuracy on time-dependent queries | ✅ **50.6% accuracy** (+64% improvement, [Han et al., 2025](https://arxiv.org/abs/2502.11371)) |
62
- | **Complex reasoning** | 41.35% accuracy on novel corpus | ✅ **50.93% accuracy** (+23%, [GraphRAG-Bench](https://arxiv.org/abs/2506.05690)) |
63
 
64
- ### ⚠️ Nuance: The Token Story
 
65
 
66
- The token efficiency claim has two distinct dimensions that the literature separates clearly:
 
 
 
 
 
67
 
68
- | Comparison | What the Data Shows | Source |
69
- |---|---|---|
70
- | **GraphRAG vs. Full-Text Summarization** | C0 (root communities) uses **97% fewer tokens**; C3 uses **26–33% fewer** | [Edge et al., Table 2](https://arxiv.org/abs/2404.16130) |
71
- | **GraphRAG vs. Top-K Vector RAG** | Community-GraphRAG retrieves ~2.7× MORE tokens (9,770 vs 3,631) | [Han et al., 2025](https://arxiv.org/abs/2502.11371) |
72
- | **With Token Budget Controller** | TERAG achieves **97% token reduction at 80%+ accuracy** vs unbounded | [TERAG, 2024](https://arxiv.org/abs/2509.18667) |
73
 
74
- **Our approach**: We use the Token Budget Controller (Novelty #4) to cap GraphRAG context at 2,000 tokens, combining the *structural advantage* of graph reasoning with the *cost advantage* of bounded context. This gives us both better answers AND controlled token cost.
 
 
 
 
 
75
 
76
  ---
77
 
78
- ## 🏗️ Architecture (AI Factory — 4 Layers)
79
 
80
  ```
81
  ┌──────────────────────────────────────────────────────────────────────────────┐
82
  │ LAYER 4: EVALUATION │
83
- ┌─────────────────┬──────────────────┬──────────────────┬────────────────┐
84
- │ │ LLM-as-a-Judge │ BERTScore F1 │ RAGAS │ Token Tracking │ │
85
- │ │ (PASS/FAIL) │ (≥0.55 rescaled) │ (Faithfulness, │ (per-query) │ │
86
- │ │ Target: ≥90% │ (≥0.88 raw) │ Relevancy) │ │ │
87
- │ └─────────────────┴──────────────────┴──────────────────┴────────────────┘ │
88
- │ F1/EM (SQuAD) │ Context Hit Rate │ Live Benchmark │ Next.js Dashboard │
89
  ├──────────────────────────────────────────────────────────────────────────────┤
90
- │ LAYER 3: UNIVERSAL LLM (12 Providers via LiteLLM)
91
- │ OpenAI │ Claude │ Gemini │ Mistral │ Ollama │ Groq │ DeepSeek │ xAI │ … │
92
- │ Single interface: model routing, cost tracking, fallback chains │
93
  ├──────────────────────────────────────────────────────────────────────────────┤
94
- │ LAYER 2: INFERENCE ORCHESTRATION + NOVELTY ENGINE
95
- ┌────────────────────────────────────────────────────────────────────────┐
96
- │ Pipeline 1: LLM-Only (queryLLManswer, no retrieval)
97
- │ │ Pipeline 2: Baseline RAG (query → embed → vector top-K → LLM) │ │
98
- │ │ Pipeline 3: GraphRAG (novelty-enhanced, see below) │ │
99
- │ │ PolyG Router → PPR Scoring → Spreading Activation → │ │
100
- │ │ Path Pruning → Token Budget → Structured Context → LLM │ │
101
- │ │ Adaptive Router: complexity scorer 0.0–1.0 → route to optimal pipe │ │
102
- │ └────────────────────────────────────────────────────────────────────────┘ │
103
  ├──────────────────────────────────────────────────────────────────────────────┤
104
- │ LAYER 1: GRAPH (TigerGraph via pyTigerGraph ≥1.6)
105
- GSQL: PPR Shortest Paths Spreading Activation Vector Search │
106
- Schema: Document Chunk Entity Community (Leiden hierarchy)
107
- │ Incremental Updates (O(new) cost) │ Schema-Bounded Extraction (9 types) │
108
  └──────────────────────────────────────────────────────────────────────────────┘
109
  ```
110
 
111
- ### How Pipeline 3 (GraphRAG) Processes a Query
112
 
113
  ```
114
- Query: "Were Einstein and Newton of the same nationality?"
115
-
116
- Step 1: PolyG Router classifies as "multi_hop" (score=0.7) → graph_traversal strategy
117
- Step 2: Dual-level keyword extraction (LightRAG-inspired)
118
- → high_level: ["nationality", "comparison"]
119
- → low_level: ["Einstein", "Newton"]
120
- Step 3: Vector search → seed entities [Einstein, Newton] from TigerGraph
121
- Step 4: PPR from seeds → score all reachable entities by graph proximity
122
- Step 5: Spreading Activation → expand to 2-hop neighborhood with decay=0.7
123
- Step 6: Combined scoring: 0.6 × PPR + 0.4 × Activation per chunk
124
- Step 7: Token Budget Controller → select top chunks within 2,000 tokens (prune 60%+)
125
- Step 8: Path Serialization → "Einstein →BORN_IN→ Germany, Newton →BORN_IN→ England"
126
- (high-reliability paths placed FIRST — exploits lost-in-the-middle bias)
127
- Step 9: LLM generates answer with ranked, pruned, path-structured graph context
128
-
129
- Result: "No. Einstein was born in Germany and Newton was born in England."
130
- Tokens used: 1,847 (vs 3,600+ for vector RAG, vs 12,000+ for LLM-only)
131
  ```
132
 
133
  ---
134
 
135
  ## 🌟 14 Novel Techniques
136
 
137
- ### Graph Retrieval Innovations (from 6 papers)
138
-
139
- | # | Technique | Paper | Key Result | Implementation |
140
- |---|-----------|-------|------------|----------------|
141
- | 1 | **PPR Confidence-Weighted Retrieval** | [CatRAG](https://arxiv.org/abs/2602.01965) (Feb 2025) | Best reasoning completeness on 4 benchmarks | `PPRConfidenceScorer` — Personalized PageRank from seed entities with damping=0.85, power iteration convergence |
142
- | 2 | **Spreading Activation Context Scoring** | [SA-RAG](https://arxiv.org/abs/2512.15922) (Dec 2024) | **+39% answer correctness** on MuSiQue | `SpreadingActivation` — propagates signal through graph edges with decay=0.7, ranks chunks by accumulated activation |
143
- | 3 | **Flow-Pruned Path Serialization** | [PathRAG](https://arxiv.org/abs/2502.14902) (Feb 2025) | **62–65% win rate** vs LightRAG | `PathPruner` — DFS path discovery, multiplicative edge-weight scoring, threshold pruning, lost-in-the-middle exploit |
144
- | 4 | **Graph Token Budget Controller** | [TERAG](https://arxiv.org/abs/2509.18667) (Sep 2024) | **97% token reduction** at 80%+ accuracy | `TokenBudgetController` — caps context at configurable token limit, prioritizes by score × relevance |
145
- | 5 | **PolyG Hybrid Retrieval Router** | [RAGRouter-Bench](https://arxiv.org/abs/2602.00296) (Feb 2025) | Adaptive > any fixed paradigm | `PolyGRouter` — 4-class query taxonomy → optimal retrieval strategy per query |
146
- | 6 | **Incremental Graph Updates** | [TG-RAG](https://arxiv.org/abs/2510.13590) (Oct 2024) | O(new) vs O(all) recomputation | `IncrementalGraphUpdater` — embedding-similarity entity merging, scoped community re-detection |
147
 
148
- ### Architecture Innovations
 
 
 
 
 
 
 
149
 
150
- | # | Technique | Inspiration | Description |
151
- |---|-----------|-------------|-------------|
152
- | 7 | **Schema-Bounded Entity Extraction** | [Youtu-GraphRAG](https://arxiv.org/abs/2508.19855) (Tencent, 2025) | 9 entity types (PERSON, ORG, LOCATION, EVENT, DATE, CONCEPT, WORK, PRODUCT, TECHNOLOGY) + 10 relation types → ~90% extraction cost reduction, +16% accuracy vs unconstrained extraction |
153
- | 8 | **Dual-Level Keyword Retrieval** | [LightRAG](https://arxiv.org/abs/2410.05779) (Oct 2024, 34K⭐) | High-level (themes/topics) + low-level (entities/names) keywords for dual-channel retrieval |
154
- | 9 | **Adaptive Query Complexity Router** | Original | LLM scores query complexity 0.0–1.0 → routes simple queries to baseline (saves cost), complex to GraphRAG (better accuracy) |
155
- | 10 | **Graph Reasoning Path Explanation** | Original | Natural language step-by-step traversal explanation: Entry → Traversal → Evidence → Conclusion |
156
 
157
- ### System Innovations
158
-
159
- | # | Technique | Description |
160
- |---|-----------|-------------|
161
- | 11 | **12-Provider Universal LLM** | Single interface for OpenAI, Claude, Gemini, Mistral, Ollama, Groq, DeepSeek, xAI, Together, Cohere, HuggingFace, OpenRouter — with cost tracking and fallback chains |
162
- | 12 | **OpenClaw Agent Skills** | GraphRAG as autonomous agent capabilities following the CIK model (SOUL + IDENTITY + MEMORY + Skills) |
163
- | 13 | **Live Dashboard Benchmarking** | Interactive comparison: one query → all 3 pipelines run → side-by-side responses + metrics. "Run Benchmark Now" button evaluates on HotpotQA in real-time |
164
- | 14 | **Advanced GSQL Queries** | PPR, shortest paths, spreading activation, neighborhood extraction — all as installable TigerGraph queries via `gsql_advanced.py` |
165
-
166
- ---
167
-
168
- ## 📐 Graph Schema & GSQL Queries
169
-
170
- ### TigerGraph Schema
171
-
172
- ```
173
- ┌──────────┐ PART_OF ┌──────────┐ MENTIONS ┌──────────┐
174
- │ Document │ ←─────────── │ Chunk │ ─────────────→ │ Entity │
175
- │ │ (position) │ │ (count, conf) │ │
176
- │ doc_id │ │ chunk_id │ │ entity_id│
177
- │ title │ │ text │ │ name │
178
- │ content │ │ embedding│ RELATED_TO │ type │
179
- │ source │ │ tokens │ ←───────────→ │ desc │
180
- └──────────┘ └──────────┘ (type, weight) │ embedding│
181
- └────┬─────┘
182
- │ IN_COMMUNITY
183
- ┌────▼─────┐
184
- │ Community│
185
- │ comm_id │
186
- │ summary │
187
- │ level │
188
- └──────────┘
189
- ```
190
-
191
- ### Installed GSQL Queries
192
-
193
- | Query | Parameters | Purpose |
194
- |---|---|---|
195
- | `vectorSearchChunks` | `queryVec LIST<DOUBLE>, topK INT` | Cosine similarity chunk retrieval |
196
- | `vectorSearchEntities` | `queryVec LIST<DOUBLE>, topK INT` | Entity vector search for seed discovery |
197
- | `graphRAGTraverse` | `seedEntityIds SET<STRING>, hops INT` | Multi-hop neighborhood expansion |
198
- | `pprFromSeeds` | `seedEntityIds, damping FLOAT, maxIter INT` | Personalized PageRank (Novelty #1) |
199
- | `findReasoningPaths` | `sourceId, targetId STRING, maxDepth INT` | Shortest path between entities (Novelty #3) |
200
- | `spreadingActivation` | `seedEntityIds, decayFactor, maxSteps, threshold` | Activation propagation (Novelty #2) |
201
- | `getEntityNeighborhood` | `entityIds SET<STRING>, hops INT` | Subgraph extraction for context building |
202
 
203
  ---
204
 
205
  ## 📊 Evaluation Framework
206
 
207
- This system implements the full evaluation stack required by the hackathon, grounded in established evaluation literature.
208
-
209
- ### Metric 1: LLM-as-a-Judge (PASS/FAIL)
210
 
211
- **Target: 90% pass rate** (Hackathon bonus threshold)
212
-
213
- Based on the methodology from [Zheng et al., NeurIPS 2023](https://arxiv.org/abs/2306.05685), using **single-answer reference-guided grading** — the most reliable LLM judge configuration (versus pairwise, which has position bias).
214
-
215
- **Best practices implemented:**
216
- - ✅ Reference answer always provided (maximizes human correlation per [Prometheus 2](https://arxiv.org/abs/2405.01535))
217
- - ✅ Chain-of-thought before verdict (Explain-then-Rate improves alignment per [survey](https://arxiv.org/abs/2412.05579))
218
- - ✅ Structured JSON output: `{"feedback": "...", "verdict": "PASS"|"FAIL"}`
219
- - ✅ Temperature = 0 for deterministic grading
220
- - ✅ Anti-self-enhancement: judge model ≠ generation model (GPT-4 self-favors 10%, Claude 25% — [Zheng et al.](https://arxiv.org/abs/2306.05685))
221
-
222
- **Recommended judge models (free):**
223
- | Model | HF ID | Why |
224
  |---|---|---|
225
- | Prometheus 2 (7B) | `prometheus-eval/prometheus-2-7b-v2.0` | Best open-source judge, Apache 2.0, GPT-4-comparable correlation |
226
- | Llama 3.1 8B Instruct | `meta-llama/Llama-3.1-8B-Instruct` | Strong CoT, good at structured output |
227
-
228
- ### Metric 2: BERTScore (Semantic Similarity)
229
-
230
- **Targets: F1 rescaled 0.55 OR F1 raw 0.88** (equivalent thresholds)
231
-
232
- Based on [Zhang et al., ICLR 2020](https://arxiv.org/abs/1904.09675). BERTScore computes token-level semantic similarity using contextual embeddings with greedy cosine matching:
233
-
234
- ```
235
- P_BERT = (1/|x̂|) × Σ max cosine(xi, x̂j) ← candidate faithfulness
236
- R_BERT = (1/|x|) × Σ max cosine(xi, x̂j) ← reference coverage
237
- F_BERT = harmonic_mean(P, R) ← primary metric
238
- ```
239
-
240
- **Why the thresholds are equivalent:** Raw scores with `roberta-large` cluster in 0.84–0.96 (inflated by learned geometry). Rescaling maps against a random-baseline lower bound (`b ≈ 0.84`), so raw ≥ 0.88 ≈ rescaled ≥ 0.55 for English. This represents "semantically similar" text — not identical, but capturing the same meaning.
241
-
242
- | Raw F1 | Rescaled F1 | Interpretation |
243
- |---|---|---|
244
- | < 0.84 | ~0 | Poor — nearly unrelated |
245
- | 0.84–0.87 | 0.0–0.45 | Weak — partial overlap |
246
- | **≥ 0.88** | **≥ 0.55** | **✅ Hackathon PASS — semantically similar** |
247
- | 0.90–0.92 | 0.65–0.75 | Good — high semantic match |
248
- | ≥ 0.95 | ≥ 0.88 | Near-paraphrase quality |
249
-
250
- **Usage:**
251
- ```python
252
- from evaluate import load
253
- bertscore = load("bertscore")
254
- results = bertscore.compute(
255
- predictions=candidates, references=references,
256
- model_type="roberta-large", rescale_with_baseline=True, lang="en"
257
- )
258
- # results["f1"][i] >= 0.55 → PASS for sample i
259
- ```
260
-
261
- ### Metric 3: RAGAS (Component Diagnostics)
262
-
263
- [RAGAS](https://arxiv.org/abs/2309.15217) provides **reference-free, LLM-powered** evaluation of individual RAG components:
264
-
265
- | RAGAS Metric | What It Catches | Formula |
266
- |---|---|---|
267
- | **Faithfulness** | Hallucinations — statements not grounded in context | `|verified_statements| / |total_statements|` |
268
- | **Answer Relevancy** | Off-topic or incomplete answers | `avg cosine_sim(query, generated_questions_from_answer)` |
269
- | **Context Precision** | Retrieval noise — irrelevant chunks returned | Precision of relevant retrieved contexts |
270
- | **Context Recall** | Missing knowledge — relevant info not retrieved | Coverage of reference by retrieved contexts |
271
-
272
- ### Metric 4: Custom Metrics (No LLM Dependency)
273
-
274
- | Metric | Description | Standard |
275
- |---|---|---|
276
- | **F1 Score** | Token-level F1 vs gold answer | SQuAD/HotpotQA |
277
- | **Exact Match** | Normalized string match | SQuAD/HotpotQA |
278
- | **Context Hit Rate** | Fraction of supporting facts found in retrieved contexts | Custom |
279
- | **Token Efficiency** | `graphrag_tokens / baseline_tokens` ratio | Custom |
280
- | **Cost per Query** | `tokens × provider_pricing` | Custom |
281
- | **Response Latency** | End-to-end ms from question to answer | Custom |
282
-
283
- ### Evaluation Code Path
284
 
285
  ```python
286
- from graphrag.layers.evaluation_layer import EvaluationLayer, EvalSample
287
-
288
- evaluator = EvaluationLayer(eval_llm_model="gpt-4o-mini")
289
- evaluator.initialize() # loads RAGAS if available
290
 
291
- sample = EvalSample(
292
- query="Were Einstein and Newton of the same nationality?",
293
- reference_answer="No, Einstein was German and Newton was English.",
294
- baseline_answer="They were both scientists.",
295
- graphrag_answer="No. Einstein was born in Germany while Newton was born in England.",
296
- supporting_facts=["Einstein was born in Ulm, Germany", "Newton was born in Woolsthorpe, England"]
297
- )
298
 
299
- result = evaluator.evaluate_sample(sample, baseline_tokens=800, graphrag_tokens=1847)
300
- report = evaluator.generate_report()
 
301
  ```
302
 
303
  ---
304
 
305
- ## 🤖 12 LLM Providers
306
-
307
- All providers unified through a single `UniversalLLM` interface with automatic detection, cost tracking, and fallback chains.
308
-
309
- | Provider | Model | Cost (per 1K tokens) | Speed | Free Tier |
310
- |----------|-------|------|-------|-----------|
311
- | **Ollama** 🦙 | llama3.2 | **$0.00** | ⚡ Local | ✅ Unlimited |
312
- | **HuggingFace** | Llama 3.3 70B | **$0.00** | 🔵 Medium | ✅ Rate-limited |
313
- | **DeepSeek** | DeepSeek V3 | $0.00014 | ⚡ Fast | ✅ Generous |
314
- | **Gemini** | 2.0 Flash | $0.0001 | ⚡ Fast | ✅ Generous |
315
- | **OpenAI** | GPT-4o-mini | $0.00015 | ⚡ Fast | 🟡 Trial credits |
316
- | **Groq** | Llama 3.3 70B | $0.0006 | ⚡⚡ Blazing | ✅ Free tier |
317
- | **Together** | Llama 3.1 70B | $0.0009 | ⚡ Fast | 🟡 Trial credits |
318
- | **Mistral** | Large | $0.002 | 🔵 Medium | 🟡 Trial credits |
319
- | **Cohere** | Command R+ | $0.0025 | 🔵 Medium | ✅ Trial |
320
- | **Anthropic** | Claude Sonnet 4 | $0.003 | 🔵 Medium | 🟡 Trial credits |
321
- | **xAI** | Grok 3 | $0.003 | 🔵 Medium | 🟡 Trial credits |
322
- | **OpenRouter** | 200+ models | Varies | Varies | 🟡 Trial credits |
323
-
324
- **Zero-cost hackathon setup:** Ollama (local, unlimited) + Gemini free tier + HuggingFace Inference API = full 3-pipeline benchmarking at $0.
325
-
326
- ---
327
-
328
- ## 📈 Expected Benchmarks
329
-
330
- ### Pipeline Comparison (HotpotQA)
331
-
332
- | Metric | Pipeline 1 (LLM-Only) | Pipeline 2 (Basic RAG) | Pipeline 3 (GraphRAG) | GraphRAG vs Basic RAG |
333
- |--------|----------------------|----------------------|---------------------|----------------------|
334
- | **F1 Score** | ~0.30–0.40 | ~0.45–0.60 | ~0.55–0.70 | **+13–21%** ✅ |
335
- | **Exact Match** | ~0.15–0.25 | ~0.30–0.45 | ~0.35–0.50 | **+11%** ✅ |
336
- | **Tokens/Query** | ~2,000–12,000+ | ~800–1,000 | ~1,200–2,000* | bounded by budget |
337
- | **Win Rate** | — | — | ~55–70% | ✅ GraphRAG |
338
-
339
- *\*With Token Budget Controller (Novelty #4), GraphRAG context is capped at 2,000 tokens.*
340
-
341
- ### By Question Type (Literature-Backed Predictions)
342
-
343
- | Question Type | Basic RAG F1 | GraphRAG F1 | Δ | Evidence |
344
- |---|---|---|---|---|
345
- | **Bridge** (multi-hop) | ~0.52 | ~0.63 | **+21%** | Graph traversal connects cross-document facts |
346
- | **Comparison** | ~0.58 | ~0.61 | +5% | Entity-pair paths provide structured comparison context |
347
- | **Temporal** | ~0.31 | ~0.51 | **+64%** | [Han et al., 2025](https://arxiv.org/abs/2502.11371) Table 32 |
348
- | **Summarization** | ~0.45 | ~0.51 | **+23%** | [GraphRAG-Bench](https://arxiv.org/abs/2506.05690) on novel corpus |
349
- | **Simple Factoid** | ~0.65 | ~0.63 | −3% | Vector RAG is faster/cheaper for single-hop (router handles this) |
350
-
351
- ### Token Efficiency Claims (With Citations)
352
-
353
- | Claim | Number | Source | Context |
354
- |---|---|---|---|
355
- | GraphRAG comprehensiveness win rate | 72–83% | [Edge et al., 2024](https://arxiv.org/abs/2404.16130), Appendix G, p < .001 | vs vector RAG across Podcast (1M tokens) and News (1.7M tokens) corpora |
356
- | Community summaries vs full-text | 26–97% fewer tokens | [Edge et al., 2024](https://arxiv.org/abs/2404.16130), Table 2 | C0 = 97% fewer, C3 = 26–33% fewer |
357
- | Token Budget Controller reduction | 97% at 80%+ accuracy | [TERAG, 2024](https://arxiv.org/abs/2509.18667) | 3–11% of LightRAG's token cost |
358
- | Spreading Activation correctness | +39% | [SA-RAG, 2024](https://arxiv.org/abs/2512.15922) | On MuSiQue multi-hop benchmark |
359
- | Path retrieval win rate | 62–65% | [PathRAG, 2025](https://arxiv.org/abs/2502.14902) | vs LightRAG comprehensiveness |
360
- | Complex reasoning accuracy | +9.58% | [GraphRAG-Bench, 2025](https://arxiv.org/abs/2506.05690), Table 2 | Novel dataset, ACC: 50.93 vs 41.35 |
361
- | ROUGE-L on complex reasoning | +59% | [GraphRAG-Bench, 2025](https://arxiv.org/abs/2506.05690), Table 2 | 24.09 vs 15.12 |
362
-
363
- ---
364
-
365
  ## 🚀 Quick Start
366
 
367
- ### Prerequisites
368
- - Python ≥ 3.10
369
- - TigerGraph Savanna account ([tgcloud.io](https://tgcloud.io)) or Community Edition ([dl.tigergraph.com](https://dl.tigergraph.com))
370
- - At least one LLM API key (or Ollama for free local inference)
371
-
372
- ### Option A: Next.js Dashboard (Recommended)
373
-
374
  ```bash
375
  git clone https://huggingface.co/muthuk1/graphrag-inference-hackathon
376
- cd graphrag-inference-hackathon
377
-
378
- # Configure environment
379
- cp .env.example .env
380
- # Edit .env: set TG_HOST, TG_PASSWORD, and at least one LLM provider key
381
-
382
- # Setup TigerGraph (one-time: creates schema + installs GSQL queries)
383
- pip install -r requirements.txt
384
- python graphrag/setup_tigergraph.py
385
-
386
- # Launch Next.js dashboard
387
- cd web && npm install && npm run dev # → http://localhost:3000
388
- ```
389
-
390
- ### Option B: Docker (One Command)
391
-
392
- ```bash
393
- docker build -t graphrag .
394
- docker run -p 3000:3000 -p 7860:7860 --env-file .env graphrag
395
- # → Next.js at :3000, Gradio at :7860
396
- ```
397
-
398
- ### Option C: Python CLI
399
-
400
- ```bash
401
  pip install -r requirements.txt
402
 
403
- # Ingest HotpotQA documents into graph
404
- python -m graphrag.main ingest --samples 100
405
-
406
- # Run benchmark (HotpotQA evaluation with F1/EM)
407
- python -m graphrag.main benchmark --samples 50 --top-k 5 --hops 2 --output results.json
408
-
409
- # Launch Gradio dashboard
410
- python -m graphrag.main dashboard --port 7860 --share
411
-
412
- # Quick demo comparison
413
- python -m graphrag.main demo
414
- ```
415
-
416
- ### Option D: Ollama (100% Free, No API Keys)
417
 
418
- ```bash
419
- # Install Ollama: https://ollama.ai
420
- ollama pull llama3.2
421
 
422
- # Set in .env:
423
- # LLM_PROVIDER=ollama
424
- # OLLAMA_BASE_URL=http://localhost:11434
425
 
 
426
  cd web && npm install && npm run dev
427
- ```
428
-
429
- ### Key Configuration Parameters
430
-
431
- | Parameter | Default | Description | Tuning Guidance |
432
- |---|---|---|---|
433
- | `top_k` | 5 | Chunks/entities from vector search | Higher = more context, more tokens |
434
- | `hops` | 2 | Graph traversal depth | 2–3 optimal; >3 introduces noise |
435
- | `chunk_size` | 1000 | Characters per chunk during ingestion | 600–1000 for most domains |
436
- | `chunk_overlap` | 100 | Overlap between chunks | 10–20% of chunk_size |
437
- | `token_budget` | 2000 | Max tokens in final context | Lower = cheaper, test accuracy impact |
438
- | `damping` | 0.85 | PPR teleportation probability | Standard; lower = more exploration |
439
- | `decay_factor` | 0.7 | Spreading activation propagation | 0.5–0.8; lower = more focused |
440
- | `complexity_threshold` | 0.6 | Router: above = GraphRAG, below = baseline | Tune based on your query distribution |
441
-
442
- ---
443
 
444
- ## 🚢 Deployment
 
445
 
446
- ### Docker
447
-
448
- ```dockerfile
449
- # Multi-stage build: Node 20 frontend + Python 3 venv backend
450
- docker build -t graphrag .
451
- docker run -p 3000:3000 -p 7860:7860 \
452
- -e TG_HOST=https://YOUR_SUBDOMAIN.tgcloud.io \
453
- -e TG_PASSWORD=your_password \
454
- -e ANTHROPIC_API_KEY=sk-ant-... \
455
- graphrag
456
- ```
457
-
458
- ### Environment Variables
459
-
460
- ```bash
461
- # TigerGraph (required)
462
- TG_HOST=https://YOUR_SUBDOMAIN.tgcloud.io
463
- TG_GRAPH=GraphRAG
464
- TG_USERNAME=tigergraph
465
- TG_PASSWORD= # required
466
-
467
- # LLM (set any — auto-detected)
468
- OPENAI_API_KEY=sk-... # GPT-4o, GPT-4o-mini
469
- ANTHROPIC_API_KEY=sk-ant-... # Claude Sonnet 4
470
- GEMINI_API_KEY=AIza... # Gemini 2.0 Flash
471
- OLLAMA_BASE_URL=http://localhost:11434 # Free local
472
-
473
- # Defaults
474
- LLM_PROVIDER=anthropic
475
- LLM_MODEL=claude-sonnet-4-20250514
476
- DASHBOARD_PORT=7860
477
- ```
478
-
479
- ### TigerGraph MCP Integration
480
-
481
- Connect TigerGraph directly to AI coding tools (Cursor, VS Code Copilot) — build with natural language instead of GSQL:
482
-
483
- ```json
484
- {
485
- "mcpServers": {
486
- "tigergraph": {
487
- "command": "uvx",
488
- "args": ["pyTigerGraph-mcp"],
489
- "env": {
490
- "TG_HOST": "https://yoursubdomain.tgcloud.io",
491
- "TG_GRAPH": "GraphRAG",
492
- "TG_USERNAME": "tigergraph",
493
- "TG_PASSWORD": "your_password"
494
- }
495
- }
496
- }
497
- }
498
- ```
499
-
500
- ---
501
-
502
- ## 🦞 OpenClaw Agent Integration
503
-
504
- GraphRAG capabilities exposed as autonomous agent skills following the CIK (Cognition-Identity-Knowledge) model:
505
-
506
- | Component | File | Purpose |
507
- |-----------|------|---------|
508
- | `SOUL.md` | `openclaw/SOUL.md` | Agent identity, values, operational boundaries |
509
- | `IDENTITY.md` | `openclaw/IDENTITY.md` | Provider config, graph schema awareness, channels |
510
- | `MEMORY.md` | `openclaw/MEMORY.md` | Learned performance knowledge across runs |
511
- | `graph_query` | `openclaw/skills/graph_query/` | Natural language → knowledge graph traversal |
512
- | `compare_pipelines` | `openclaw/skills/compare_pipelines/` | Dual-pipeline comparison with metrics |
513
- | `cost_estimate` | `openclaw/skills/cost_estimate/` | 12-provider cost projection and optimization |
514
-
515
- ---
516
-
517
- ## 🧪 Testing
518
-
519
- ```bash
520
- python tests/test_core.py # 31 tests — core pipeline functions
521
- python tests/test_novelties.py # 24 tests — all 6 novelty techniques
522
-
523
- # Total: 55 tests covering:
524
- # - PPR convergence, damping, seed weighting
525
- # - Spreading activation decay, threshold, multi-hop
526
- # - PolyG query classification (entity/relation/multi-hop/summarization)
527
- # - Path finding, pruning, serialization
528
- # - Token budget controller, utilization tracking
529
- # - F1/EM computation, context hit rate
530
- # - Incremental graph update planning
531
  ```
532
 
533
  ---
@@ -535,132 +165,48 @@ python tests/test_novelties.py # 24 tests — all 6 novelty techniques
535
  ## 📁 Project Structure
536
 
537
  ```
538
- ├── graphrag/ # Python backend (Layer 1–4)
539
- │ ├── layers/
540
- │ │ ├── graph_layer.py # Layer 1: TigerGraph connection + GSQL
541
- │ │ ├── gsql_advanced.py # Layer 1: PPR, paths, activation queries
542
- │ │ ├── orchestration_layer.py # Layer 2: 3-pipeline routing + comparison
543
- │ │ ├── novelties.py # Layer 2: 🌟 6 novel techniques engine
544
- │ │ ├── llm_layer.py # Layer 3: LLM interactions + prompts
545
- │ │ ├── universal_llm.py # Layer 3: 12-provider unified client
546
- │ │ └── evaluation_layer.py # Layer 4: RAGAS + F1/EM + BERTScore
547
- │ ├── configs/settings.py # Configuration management
548
- │ ├── benchmark.py # HotpotQA benchmark runner
549
- │ ├── dashboard.py # Gradio dashboard (port 7860)
550
- │ ├── ingestion.py # Document → Graph ingestion pipeline
551
- │ ├── setup_tigergraph.py # One-time schema + query installation
552
- │ └── main.py # CLI entry point
553
-
554
- ├── web/ # Next.js 15 Dashboard (port 3000)
555
- │ ├── src/app/api/
556
- │ │ ├── compare/route.ts # Multi-provider 3-pipeline comparison API
557
- │ │ ├── benchmark/route.ts # Live benchmark with F1/EM/tokens
558
- │ │ └── providers/route.ts # Provider health checking
559
- │ ├── src/components/
560
- │ │ ├── tabs/LiveCompare.tsx # Side-by-side pipeline comparison
561
- │ │ ├── tabs/Benchmark.tsx # "Run Benchmark Now" + charts
562
- │ │ ├── tabs/CostAnalysis.tsx # 12-provider cost projections
563
- │ │ └── tabs/GraphExplorer.tsx # Interactive graph visualization
564
- │ └── src/lib/
565
- │ ├── llm-providers.ts # 12-provider universal client (TS)
566
- │ └── design-tokens.ts # TigerGraph design system tokens
567
-
568
- ├── openclaw/ # OpenClaw Agent (CIK model)
569
- │ ├── SOUL.md / IDENTITY.md / MEMORY.md
570
- │ └── skills/ # graph_query, compare_pipelines, cost_estimate
571
-
572
- ├── tests/
573
- │ ├── test_core.py # 31 core tests
574
- │ └── test_novelties.py # 24 novelty technique tests
575
-
576
- ├── Dockerfile # Multi-stage: Node 20 + Python 3
577
- ├── requirements.txt # Python dependencies
578
- ├── .env.example # Full configuration template
579
- └── README.md # This file
580
  ```
581
 
582
  ---
583
 
584
- ## 📚 References & Citation Graph
585
-
586
- ### Directly Implemented (6 papers → novel techniques)
587
-
588
- | # | Paper | ArXiv | Key Contribution | Our Implementation |
589
- |---|-------|-------|------------------|--------------------|
590
- | 1 | **CatRAG** — PPR + Dynamic Edge Weighting | [2602.01965](https://arxiv.org/abs/2602.01965) (Feb 2025) | Personalized PageRank for reasoning completeness | `PPRConfidenceScorer` |
591
- | 2 | **SA-RAG** — Spreading Activation Retrieval | [2512.15922](https://arxiv.org/abs/2512.15922) (Dec 2024) | +39% correctness via activation propagation | `SpreadingActivation` |
592
- | 3 | **PathRAG** — Flow-Pruned Path Retrieval | [2502.14902](https://arxiv.org/abs/2502.14902) (Feb 2025) | 62–65% win rate via path serialization | `PathPruner` |
593
- | 4 | **TERAG** — Token-Efficient Graph RAG | [2509.18667](https://arxiv.org/abs/2509.18667) (Sep 2024) | 97% token reduction at 80%+ accuracy | `TokenBudgetController` |
594
- | 5 | **RAGRouter-Bench** — Hybrid Routing | [2602.00296](https://arxiv.org/abs/2602.00296) (Feb 2025) | Adaptive routing > fixed paradigm | `PolyGRouter` |
595
- | 6 | **TG-RAG** — Incremental Temporal Graph | [2510.13590](https://arxiv.org/abs/2510.13590) (Oct 2024) | O(new) incremental updates | `IncrementalGraphUpdater` |
596
-
597
- ### Architecture Inspiration (4 papers)
598
-
599
- | # | Paper | ArXiv | Contribution |
600
- |---|-------|-------|-------------|
601
- | 7 | **GraphRAG** — Microsoft's Community-Based RAG | [2404.16130](https://arxiv.org/abs/2404.16130) (Apr 2024) | Hierarchical Leiden community detection + map-reduce summarization; 72–83% comprehensiveness win rate |
602
- | 8 | **LightRAG** — Dual-Level Retrieval | [2410.05779](https://arxiv.org/abs/2410.05779) (Oct 2024, 34K⭐) | High-level + low-level keyword dual-channel retrieval |
603
- | 9 | **Youtu-GraphRAG** — Schema-Bounded Extraction | [2508.19855](https://arxiv.org/abs/2508.19855) (Tencent, 2025) | Constrained entity types → 90% extraction cost reduction, +16% accuracy |
604
- | 10 | **HippoRAG 2** — PPR + Passage Integration | [2502.14802](https://arxiv.org/abs/2502.14802) (Feb 2025) | Hippocampus-inspired graph, 87.9–90.9% evidence recall on complex questions |
605
-
606
- ### Evaluation Methodology (2 papers)
607
 
608
- | # | Paper | ArXiv | Used For |
609
- |---|-------|-------|----------|
610
- | 11 | **Judging LLM-as-a-Judge** | [2306.05685](https://arxiv.org/abs/2306.05685) (NeurIPS 2023) | LLM judge methodology, bias mitigation |
611
- | 12 | **BERTScore** | [1904.09675](https://arxiv.org/abs/1904.09675) (ICLR 2020) | Token-level semantic similarity metric |
612
 
613
- ### Benchmarking Evidence
614
 
615
- | # | Paper | ArXiv | Key Finding |
616
- |---|-------|-------|------------|
617
- | — | **RAG vs. GraphRAG: Systematic Evaluation** | [2502.11371](https://arxiv.org/abs/2502.11371) (Feb 2025) | Integration improves best single-method by +6.4%; Temporal: GraphRAG 50.6% vs RAG 30.7% |
618
- | — | **GraphRAG-Bench** | [2506.05690](https://arxiv.org/abs/2506.05690) (Jun 2025) | GraphRAG excels on complex reasoning (+9.58% ACC), RAG better on simple factoid |
619
- | — | **GraphRAG Survey** | [2501.13958](https://arxiv.org/abs/2501.13958) (Jan 2025) | Comprehensive taxonomy: Index-Graph vs KG-based; TigerGraph architecture comparison |
620
-
621
- ### Citation Flow
622
-
623
- ```
624
- Microsoft GraphRAG (2404.16130) ─── cited by ──→ LightRAG (2410.05779)
625
- │ │
626
- ├──────── cited by ──→ CatRAG (2602.01965) ├──→ TERAG (2509.18667)
627
- ├──────── cited by ──→ PathRAG (2502.14902) ├──→ TG-RAG (2510.13590)
628
- ├──────── cited by ──→ SA-RAG (2512.15922) └──→ RAGRouter-Bench (2602.00296)
629
- └──────── cited by ──→ GraphRAG-Bench (2506.05690)
630
-
631
- HippoRAG 2 (2502.14802) ───────────────┘
632
- Youtu-GraphRAG (2508.19855) ── builds on ──→ Microsoft GraphRAG schema-bounded variant
633
- ```
634
-
635
- ### Datasets & Evaluation Frameworks
636
-
637
- - [**HotpotQA**](https://arxiv.org/abs/1809.09600) — Multi-hop QA benchmark (bridge + comparison questions)
638
- - [**RAGAS**](https://arxiv.org/abs/2309.15217) — RAG evaluation: Faithfulness, Relevancy, Context Precision/Recall
639
- - [**Prometheus 2**](https://arxiv.org/abs/2405.01535) — Open-source LLM judge (Apache 2.0, GPT-4-comparable)
640
 
641
  ---
642
 
643
- ## 🔗 Important Links
644
 
645
- | Resource | Link |
646
- |---|---|
647
- | TigerGraph GraphRAG Repo | [github.com/tigergraph/graphrag](https://github.com/tigergraph/graphrag) |
648
- | TigerGraph MCP | [github.com/tigergraph/tigergraph-mcp](https://github.com/tigergraph/tigergraph-mcp) |
649
- | TigerGraph Savanna | [tgcloud.io](https://tgcloud.io) |
650
- | Community Edition | [dl.tigergraph.com](https://dl.tigergraph.com) |
651
- | TigerGraph Docs | [docs.tigergraph.com](https://docs.tigergraph.com) |
652
- | Discord Community | [discord.gg/Djy8xxDR](https://discord.gg/Djy8xxDR) |
653
 
654
  ---
655
 
656
  <div align="center">
657
 
658
- ### 🏆 Built for the GraphRAG Inference Hackathon by TigerGraph
659
 
660
- **14 Novel Techniques** · **12 Research Papers** · **12 LLM Providers** · **55 Unit Tests** · **OpenClaw Agent** · **Docker-Ready**
661
 
662
  *Build it. Benchmark it. Prove graph beats tokens.*
663
 
664
- **Token reduction with maintained accuracy — that's the whole game.**
665
-
666
  </div>
 
1
+ # 🔍 GraphRAG Inference Hackathon — 3-Pipeline Benchmarking System
2
 
3
  <div align="center">
4
 
5
+ [![TigerGraph](https://img.shields.io/badge/Built_On-TigerGraph_GraphRAG-FF6B00?style=for-the-badge)](https://github.com/tigergraph/graphrag)
6
+ [![3 Pipelines](https://img.shields.io/badge/Pipelines-3_(LLM+RAG+GraphRAG)-002B49?style=for-the-badge)](#-3-pipeline-architecture)
7
+ [![14 Novelties](https://img.shields.io/badge/Novelties-14_Techniques-0072CE?style=for-the-badge)](#-14-novel-techniques)
8
+ [![12 LLMs](https://img.shields.io/badge/LLMs-12_Providers-5865F2?style=for-the-badge)](#-12-llm-providers)
9
+ [![12 Papers](https://img.shields.io/badge/Papers-12_Cited-cc785c?style=for-the-badge)](#-references)
10
  [![55 Tests](https://img.shields.io/badge/Tests-55_Passing-5db872?style=for-the-badge)](#-testing)
 
11
 
12
+ **One query in three pipelines run side-by-side responses + metrics out.**
13
 
14
+ Proving that graphs make LLM inference faster, cheaper, and smarter backed by 12 research papers, 6 novel retrieval techniques, and the full hackathon evaluation stack.
15
 
16
+ [3-Pipeline Architecture](#-3-pipeline-architecture) · [TG GraphRAG Integration](#-tigergraph-graphrag-integration) · [Novelties](#-14-novel-techniques) · [Evaluation](#-evaluation-framework) · [Quick Start](#-quick-start)
 
 
17
 
18
+ </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ---
21
 
22
  ## 🎯 What This Is
23
 
24
+ A **3-pipeline GraphRAG benchmarking system** built on top of the [TigerGraph GraphRAG repo](https://github.com/tigergraph/graphrag), with **14 novel techniques** from 2024–2025 research, **12 LLM providers**, and a **production dashboard** showing all three pipelines side-by-side with LLM-as-a-Judge + BERTScore evaluation.
25
 
26
+ | Pipeline 1: LLM-Only | Pipeline 2: Basic RAG | Pipeline 3: GraphRAG |
27
  |---|---|---|
28
+ | Query → LLM → Answer | Query → Embed → Top-K Chunks → LLM | Query → **TG GraphRAG Service** → **NoveltyEngine** → LLM |
29
+ | No retrieval. Worst-case baseline. | Vector embeddings. Industry standard. | Built on [tigergraph/graphrag](https://github.com/tigergraph/graphrag) + 6 novelties. |
 
 
30
 
31
  ---
32
 
33
+ ## 🐯 TigerGraph GraphRAG Integration
34
 
35
+ Pipeline 3 is **built on top of the official [TigerGraph GraphRAG repo](https://github.com/tigergraph/graphrag)** (Path B: customize). The integration layer (`tg_graphrag_client.py`) wraps the official service:
36
 
37
+ ```python
38
+ from graphrag.layers.tg_graphrag_client import TGGraphRAGClient
 
 
 
 
 
39
 
40
+ client = TGGraphRAGClient(service_url="http://localhost:8000")
41
+ client.connect()
42
 
43
+ # Official retrievers: Hybrid Search, Community, Sibling
44
+ result = client.retrieve(query="What did Einstein discover?",
45
+ retriever="hybrid", top_k=5, num_hops=2)
46
+ result = client.retrieve(query="Main themes?",
47
+ retriever="community", community_level=2)
48
+ ```
49
 
50
+ **Modes:** REST API (official service) Direct pyTigerGraph (fallback) → Offline (passage-based).
 
 
 
 
51
 
52
+ ```bash
53
+ # Deploy official TG GraphRAG + point our system at it
54
+ git clone https://github.com/tigergraph/graphrag && cd graphrag && docker-compose up -d
55
+ export GRAPHRAG_SERVICE_URL=http://localhost:8000
56
+ python -m graphrag.main benchmark --samples 50
57
+ ```
58
 
59
  ---
60
 
61
+ ## 🏗️ 3-Pipeline Architecture
62
 
63
  ```
64
  ┌──────────────────────────────────────────────────────────────────────────────┐
65
  │ LAYER 4: EVALUATION │
66
+ LLM-as-a-Judge (PASS/FAIL, ≥90%) BERTScore F1 (≥0.55) │ RAGAS │ F1/EM │
 
 
 
 
 
67
  ├──────────────────────────────────────────────────────────────────────────────┤
68
+ │ LAYER 3: UNIVERSAL LLM (12 Providers)
 
 
69
  ├──────────────────────────────────────────────────────────────────────────────┤
70
+ │ LAYER 2: 3-PIPELINE ORCHESTRATION + NOVELTY ENGINE
71
+ Pipeline 1: LLM-Only Pipeline 2: Basic RAG │ Pipeline 3: GraphRAG │
72
+ NoveltyEngine: PolyG RouterPPRSpreading Activation → Token Budget
 
 
 
 
 
 
73
  ├──────────────────────────────────────────────────────────────────────────────┤
74
+ │ LAYER 1: GRAPH
75
+ TG GraphRAG Service (official repo) ←→ Direct pyTigerGraph (fallback)
76
+ Retrievers: Hybrid, Community, Sibling GSQL: PPR, Paths, Activation
 
77
  └──────────────────────────────────────────────────────────────────────────────┘
78
  ```
79
 
80
+ ### Pipeline 3 Flow
81
 
82
  ```
83
+ Query keyword extraction TG GraphRAG Service (hybrid retriever)
84
+ → NoveltyEngine: PolyG Router → PPR → Spreading Activation → Token Budget
85
+ Structured context (entities + relationships + passages) → LLM → Answer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
87
 
88
  ---
89
 
90
  ## 🌟 14 Novel Techniques
91
 
92
+ ### Graph Retrieval (6 papers, wired into Pipeline 3 via NoveltyEngine)
 
 
 
 
 
 
 
 
 
93
 
94
+ | # | Technique | Paper | Result | Code |
95
+ |---|-----------|-------|--------|------|
96
+ | 1 | PPR Confidence Retrieval | [CatRAG](https://arxiv.org/abs/2602.01965) | Best reasoning on 4 benchmarks | `PPRConfidenceScorer` |
97
+ | 2 | Spreading Activation | [SA-RAG](https://arxiv.org/abs/2512.15922) | +39% correctness | `SpreadingActivation` |
98
+ | 3 | Flow-Pruned Paths | [PathRAG](https://arxiv.org/abs/2502.14902) | 62–65% win rate | `PathPruner` |
99
+ | 4 | Token Budget Controller | [TERAG](https://arxiv.org/abs/2509.18667) | 97% token reduction | `TokenBudgetController` |
100
+ | 5 | PolyG Hybrid Router | [RAGRouter-Bench](https://arxiv.org/abs/2602.00296) | Adaptive > fixed | `PolyGRouter` |
101
+ | 6 | Incremental Updates | [TG-RAG](https://arxiv.org/abs/2510.13590) | O(new) cost | `IncrementalGraphUpdater` |
102
 
103
+ ### Architecture + System (#7–14)
 
 
 
 
 
104
 
105
+ Schema-bounded extraction, dual-level keywords, adaptive routing, graph reasoning explanation, 12-provider LLM, OpenClaw agent, live 3-pipeline dashboard, advanced GSQL queries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  ---
108
 
109
  ## 📊 Evaluation Framework
110
 
111
+ All hackathon-required metrics implemented in `evaluation_layer.py`:
 
 
112
 
113
+ | Metric | Target | Implementation |
 
 
 
 
 
 
 
 
 
 
 
 
114
  |---|---|---|
115
+ | **LLM-as-a-Judge** (PASS/FAIL) | 90% pass rate | `compute_llm_judge()` reference-guided, CoT, JSON output |
116
+ | **BERTScore F1** | 0.55 rescaled / ≥ 0.88 raw | `compute_bertscore()` roberta-large with rescaling |
117
+ | **F1 / Exact Match** | — | SQuAD/HotpotQA standard |
118
+ | **RAGAS** | | Faithfulness, Relevancy, Context Precision/Recall |
119
+ | **Token Efficiency** | — | Per-pipeline per-query tracking |
120
+ | **Cost per Query** | | `tokens × provider_pricing` |
121
+ | **Latency** | — | End-to-end ms |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
  ```python
124
+ from graphrag.layers.evaluation_layer import compute_llm_judge, compute_bertscore
 
 
 
125
 
126
+ # LLM-as-a-Judge
127
+ result = compute_llm_judge(question, reference, candidate, llm_fn)
128
+ # {"verdict": "PASS", "feedback": "..."}
 
 
 
 
129
 
130
+ # BERTScore
131
+ results = compute_bertscore(predictions, references, rescale=True)
132
+ # → {"mean_f1": 0.62, "pass_rate": 0.85}
133
  ```
134
 
135
  ---
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  ## 🚀 Quick Start
138
 
 
 
 
 
 
 
 
139
  ```bash
140
  git clone https://huggingface.co/muthuk1/graphrag-inference-hackathon
141
+ cd graphrag-inference-hackathon && cp .env.example .env
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  pip install -r requirements.txt
143
 
144
+ # Setup TigerGraph (schema + core + advanced GSQL queries)
145
+ python graphrag/setup_tigergraph.py
 
 
 
 
 
 
 
 
 
 
 
 
146
 
147
+ # 3-pipeline benchmark
148
+ python -m graphrag.main benchmark --samples 50 --output results.json
 
149
 
150
+ # 3-column Gradio dashboard
151
+ python -m graphrag.main dashboard
 
152
 
153
+ # Next.js dashboard
154
  cd web && npm install && npm run dev
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
+ # Docker
157
+ docker build -t graphrag . && docker run -p 3000:3000 -p 7860:7860 --env-file .env graphrag
158
 
159
+ # Free (Ollama)
160
+ ollama pull llama3.2 && python -m graphrag.main demo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ```
162
 
163
  ---
 
165
  ## 📁 Project Structure
166
 
167
  ```
168
+ graphrag/layers/
169
+ tg_graphrag_client.py # 🆕 Official TG GraphRAG service integration
170
+ orchestration_layer.py # 🆕 3-pipeline + NoveltyEngine wiring
171
+ evaluation_layer.py # 🆕 LLM-Judge + BERTScore + RAGAS + F1/EM
172
+ novelties.py # 6 novel techniques (PPR, activation, paths, budget, router, incremental)
173
+ graph_layer.py # TigerGraph GSQL + schema
174
+ gsql_advanced.py # Advanced GSQL (PPR, paths, activation)
175
+ llm_layer.py / universal_llm.py # 12-provider LLM
176
+ graphrag/
177
+ benchmark.py # 🆕 3-pipeline HotpotQA benchmark
178
+ dashboard.py # 🆕 3-column Gradio dashboard
179
+ setup_tigergraph.py # 🆕 Schema + core + advanced query install
180
+ ingestion.py / main.py
181
+ web/src/app/api/compare/ # 🆕 3-pipeline Next.js API
182
+ openclaw/ # Agent skills
183
+ tests/ # 55 tests
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  ```
185
 
186
  ---
187
 
188
+ ## 📚 References (12 Papers)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
+ **Implemented:** [CatRAG](https://arxiv.org/abs/2602.01965), [SA-RAG](https://arxiv.org/abs/2512.15922), [PathRAG](https://arxiv.org/abs/2502.14902), [TERAG](https://arxiv.org/abs/2509.18667), [RAGRouter-Bench](https://arxiv.org/abs/2602.00296), [TG-RAG](https://arxiv.org/abs/2510.13590)
 
 
 
191
 
192
+ **Architecture:** [Microsoft GraphRAG](https://arxiv.org/abs/2404.16130), [LightRAG](https://arxiv.org/abs/2410.05779), [Youtu-GraphRAG](https://arxiv.org/abs/2508.19855), [HippoRAG 2](https://arxiv.org/abs/2502.14802)
193
 
194
+ **Evaluation:** [LLM-as-a-Judge](https://arxiv.org/abs/2306.05685) (NeurIPS 2023), [BERTScore](https://arxiv.org/abs/1904.09675) (ICLR 2020)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
 
196
  ---
197
 
198
+ ## 🔗 Links
199
 
200
+ [TigerGraph GraphRAG](https://github.com/tigergraph/graphrag) · [TigerGraph Savanna](https://tgcloud.io) · [TigerGraph MCP](https://github.com/tigergraph/tigergraph-mcp) · [TigerGraph Docs](https://docs.tigergraph.com)
 
 
 
 
 
 
 
201
 
202
  ---
203
 
204
  <div align="center">
205
 
206
+ **🏆 Built for the GraphRAG Inference Hackathon by TigerGraph**
207
 
208
+ 3 Pipelines · 14 Novelties · 12 Papers · 12 LLMs · 55 Tests · LLM-Judge + BERTScore · Docker
209
 
210
  *Build it. Benchmark it. Prove graph beats tokens.*
211
 
 
 
212
  </div>