File size: 2,530 Bytes
14e5cbf
f8b04c3
14e5cbf
a7c976e
 
14e5cbf
a84bdca
 
c8b183a
 
f8b04c3
 
 
 
c8b183a
 
f8b04c3
 
 
 
 
 
 
 
a7c976e
 
 
 
 
 
 
 
f8b04c3
 
 
 
a7c976e
 
f8b04c3
 
14e5cbf
 
 
 
f8b04c3
14e5cbf
 
a7c976e
 
 
14e5cbf
 
 
 
f8b04c3
14e5cbf
 
f8b04c3
14e5cbf
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
"""
CivicSetu RAGAS evaluation — CLI entry point.

  Phase 1: Invoke RAG graph for every query → save to eval_phase1_results.json
  Phase 2: Score results with RAGAS (Faithfulness, AnswerRelevancy, ContextPrecision)

Usage:
    uv run python scripts/run_eval.py
    EVAL_LIMIT=5 uv run python scripts/run_eval.py          # quick smoke-test

Graph LLM:
    Uses the normal app routing from settings.py / .env for phase 1 generation.
    PRIMARY_MODEL, FALLBACK_MODEL_1, FALLBACK_MODEL_2, and FALLBACK_MODEL_3
    are read by civicsetu.agent.nodes when the graph is imported.

Judge (RAGAS scorer) model:
    # Default judge is Groq llama-3.3-70b-versatile via GROQ_API_KEY_2
    uv run python scripts/run_eval.py
    JUDGE_PROVIDER=groq JUDGE_MODEL=llama-3.3-70b-versatile uv run python scripts/run_eval.py
    JUDGE_PROVIDER=gemini JUDGE_MODEL=gemini/gemini-2.5-flash-lite uv run python scripts/run_eval.py
    JUDGE_PROVIDER=openrouter JUDGE_MODEL=nvidia/nemotron-3-super-120b-a12b:free uv run python scripts/run_eval.py
    JUDGE_PROVIDER=osmapi JUDGE_MODEL=qwen3.5-397b-a17b uv run python scripts/run_eval.py
    JUDGE_PROVIDER=nvidia JUDGE_MODEL=z-ai/glm4.7 uv run python scripts/run_eval.py
    JUDGE_GEMINI_API_KEY=<key> JUDGE_PROVIDER=gemini JUDGE_MODEL=gemini/gemma-4-31b-it uv run python scripts/run_eval.py

Resume after phase 2 failure (phase 1 cached in eval_phase1_results.json):
    uv run python scripts/run_eval.py
    # Phase 1 prints "all N rows loaded from cache" and is skipped

Force re-run phase 1:
    rm eval_phase1_results.json && uv run python scripts/run_eval.py

Run only one phase:
    EVAL_PHASE=1 uv run python scripts/run_eval.py   # graph invocation only
    EVAL_PHASE=2 uv run python scripts/run_eval.py   # RAGAS scoring only (requires phase 1 cache)

Disable no_reasoning (not recommended for Qwen3 thinking models):
    NO_REASONING=false uv run python scripts/run_eval.py

All logic lives in: src/civicsetu/evaluation/ragas_eval.py
"""
from __future__ import annotations

import io
import sys
from pathlib import Path

from dotenv import load_dotenv
load_dotenv()

if sys.stdout.encoding != "utf-8":
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8")
    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8")

# Ensure src/ is on path when running the script directly (outside of uv run)
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from civicsetu.evaluation.ragas_eval import main  # noqa: E402

if __name__ == "__main__":
    main()