Spaces:

lablab-ai-amd-developer-hackathon
/

riprap-nyc

Running

App Files Files Community

riprap-nyc / experiments /03_granite_reranker /RESULTS.md

seriffic

Backend evolution: Phases 1-10 specialists + agentic FSM + Mellea + LiteLLM router

6a82282 20 days ago

preview code

raw

history blame contribute delete

6.01 kB

Phase 3 — Granite Embedding Reranker R2 (cross-encoder, 149 M)

Status

Working end-to-end on the existing 5-PDF corpus, both backends.

Model

Model: ibm-granite/granite-embedding-reranker-english-r2
Type: cross-encoder reranker (149 M params)
License: Apache-2.0 (verified, HF cardData)
Loader: sentence_transformers.CrossEncoder — sidecar pattern, no vLLM --task score per project decision
Library declared: sentence-transformers

Pipeline

rerank.py — loads the cross-encoder, scores [query, candidate] pairs, returns ranked top-K.
run_double_gate.py — calls the existing app.rag.retrieve_top_k-equivalent (with the per-doc dedup bypassed), gathers top-20, reranks to top-3, and runs both backends' reconciler against the top-1 passage.

Validation

Hand-crafted query + 5 candidate paragraphs

Query: "What are flood risks in Hollis, Queens?"

The reranker correctly ranked the Hollis-Ida paragraph #1 (score 0.93), Sandy/Brighton #2 (0.77), and Rockaways #3 (0.73). The Newtown Creek WWTP and Bluebelt operations paragraphs (off-topic for a Hollis flood-risk query) were correctly demoted out of the top-3.

Real corpus end-to-end

Query: "What flood risk does Hollis, Queens face from heavy rainfall?"

Retriever (Granite Embedding 278 M) top-3:

Rank	Retriever score	Doc	Excerpt
1	0.760	rag_mta	"Urgent Call for Action 7 Climate Resilience Roadmap…"
2	0.749	rag_comptroller	"Forecast & Emergency Plan Activation Flash flooding…"
3	0.749	rag_comptroller	"Is New York City Ready for Rain? An Investigation…"

Reranker top-3 (from retriever's top-20):

Rank	Reranker score	Was retriever rank	Doc	Excerpt
1	0.886	6	rag_comptroller	"Is New York City Ready for Rain?… (preparedness section)"
2	0.869	4	rag_comptroller	"Heavy rains persisted for more than an hour in southern Brooklyn…"
3	0.869	1	rag_mta	"Urgent Call for Action 7 Climate Resilience Roadmap…"

The reranker is doing its job: it surfaced a query-specific preparedness paragraph (originally rank 6 — buried by the retriever) and demoted a generic MTA boilerplate paragraph (originally rank 1) to position 3.

Honesty under uncertainty

Neither selected paragraph specifically mentions Hollis. Both backends correctly refused to invent a Hollis-specific answer and said so plainly with a citation:

Backend	Latency	Output
Ollama (M-series MPS)	10.56 s	"The provided document…does not specifically mention Hollis, Queens…I cannot determine the flood risk for Hollis, Queens from heavy rainfall…[rag_comptroller]"
vLLM (AMD MI300X)	0.68 s	"The provided document does not contain specific information about the flood risk faced by Hollis, Queens from heavy rainfall. [rag_comptroller]"

This is the desired silence-over-confabulation behavior. The reranker

reconciler combination did not surface a false claim despite there being a temptation (the document discusses a 2024 storm in NYC generally).

Latency budget

Stage	Latency	Notes
Retriever (Granite Embedding 278 M) cold load + index	52.7 s	One-time at app boot; amortized in production
Retriever per-query	< 0.1 s	Already in production
Reranker cold load (149 M)	1.8 s	One-time at app boot
Reranker score 20 candidates	0.93 s	M3 Pro CPU, batched
Reconcile (Ollama, M-series)	10.6 s
Reconcile (vLLM, AMD MI300X)	0.7 s	~15× faster

The reranker adds ~1 s to the user-visible path on CPU. Negligible relative to the existing reconciler latency, well under the brief's demo budget.

Findings worth remembering

The retriever's per-doc dedup is in the wrong place. Currently app/rag.py:retrieve() keeps "at most 1 chunk per doc" and then returns top-K. For the reranker integration, this should be inverted: gather top-20 with duplicates, rerank, then dedup to top-3. Otherwise we're throwing away high-relevance chunks before the rerank ever sees them.
Cross-encoder cache_dir arg is deprecated in current sentence-transformers; passes through with a warning. Move to model_kwargs={"cache_dir": ...} when integrating to silence it.
Reranker disagrees with the retriever in interesting ways. On the test query the retriever's rank-1 (a generic MTA roadmap intro) was a content-light string that scored high on lexical/embedding surface similarity to "flood risk heavy rainfall". The reranker correctly surfaced more specific content. This is the canonical reason cross-encoder reranking matters.
Sidecar deployment story. No GPU needed for the reranker; ~600 MB resident on CPU; loads in ~2 s after first download. Fits trivially in the HF Spaces T4 image. The vLLM-served alternative was explicitly out-of-scope per the project decision and isn't needed for these latencies.

Files

03_granite_reranker/
  rerank.py           CrossEncoder load + predict wrapper
  run_double_gate.py  retriever -> reranker -> reconciler probe
  RESULTS.md          (this file)
  .cache/             reranker weights, double_gate_*.json

Conclusion

Specialist works on both backends with the expected behavior change (reranker reorders top-3 in a query-relevant way; reconciler refuses to fabricate when source content doesn't address the query).

Recommended path forward: integrate as a one-line addition to app/rag.py:retrieve(): take retriever top-K=20 (drop the existing per-doc dedup), call the reranker, then dedup to top-3. Load the cross-encoder once at app boot in warm(). Single env var RIPRAP_RERANKER_ENABLE=1 to gate the new behavior so the existing production path is unchanged by default.