# Phase 3 — Granite Embedding Reranker R2 (cross-encoder, 149 M)

## Status

**Working end-to-end on the existing 5-PDF corpus, both backends.**

## Model

- **Model:** `ibm-granite/granite-embedding-reranker-english-r2`
- **Type:** cross-encoder reranker (149 M params)
- **License:** Apache-2.0 (verified, HF cardData)
- **Loader:** `sentence_transformers.CrossEncoder` — sidecar pattern,
  no vLLM `--task score` per project decision
- **Library declared:** `sentence-transformers`

## Pipeline

1. **`rerank.py`** — loads the cross-encoder, scores
   `[query, candidate]` pairs, returns ranked top-K.
2. **`run_double_gate.py`** — calls the existing
   `app.rag.retrieve_top_k`-equivalent (with the per-doc dedup
   bypassed), gathers top-20, reranks to top-3, and runs both
   backends' reconciler against the top-1 passage.

## Validation

### Hand-crafted query + 5 candidate paragraphs

Query: *"What are flood risks in Hollis, Queens?"*

The reranker correctly ranked the Hollis-Ida paragraph #1 (score
0.93), Sandy/Brighton #2 (0.77), and Rockaways #3 (0.73). The
Newtown Creek WWTP and Bluebelt operations paragraphs (off-topic for a
Hollis flood-risk query) were correctly demoted out of the top-3.

### Real corpus end-to-end

Query: *"What flood risk does Hollis, Queens face from heavy
rainfall?"*

Retriever (Granite Embedding 278 M) top-3:

| Rank | Retriever score | Doc | Excerpt |
|-----:|----------------:|-----|---------|
| 1 | 0.760 | rag_mta | "Urgent Call for Action 7 Climate Resilience Roadmap…" |
| 2 | 0.749 | rag_comptroller | "Forecast & Emergency Plan Activation Flash flooding…" |
| 3 | 0.749 | rag_comptroller | "Is New York City Ready for Rain? An Investigation…" |

Reranker top-3 (from retriever's top-20):

| Rank | Reranker score | Was retriever rank | Doc | Excerpt |
|-----:|---------------:|-------------------:|-----|---------|
| 1 | 0.886 | 6 | rag_comptroller | "Is New York City Ready for Rain?… (preparedness section)" |
| 2 | 0.869 | 4 | rag_comptroller | "Heavy rains persisted for more than an hour in southern Brooklyn…" |
| 3 | 0.869 | 1 | rag_mta | "Urgent Call for Action 7 Climate Resilience Roadmap…" |

The reranker is **doing its job**: it surfaced a query-specific
preparedness paragraph (originally rank 6 — buried by the retriever)
and demoted a generic MTA boilerplate paragraph (originally rank 1)
to position 3.

### Honesty under uncertainty

Neither selected paragraph specifically mentions Hollis. Both backends
correctly **refused to invent a Hollis-specific answer** and said so
plainly with a citation:

| Backend | Latency | Output |
|---------|--------:|--------|
| Ollama (M-series MPS) | 10.56 s | "The provided document…does not specifically mention Hollis, Queens…I cannot determine the flood risk for Hollis, Queens from heavy rainfall…[rag_comptroller]" |
| vLLM (AMD MI300X)     |  0.68 s | "The provided document does not contain specific information about the flood risk faced by Hollis, Queens from heavy rainfall. [rag_comptroller]" |

This is the desired silence-over-confabulation behavior. The reranker
+ reconciler combination did not surface a false claim despite there
being a temptation (the document discusses a 2024 storm in NYC
generally).

## Latency budget

| Stage | Latency | Notes |
|------:|--------:|-------|
| Retriever (Granite Embedding 278 M) cold load + index | 52.7 s | One-time at app boot; amortized in production |
| Retriever per-query | < 0.1 s | Already in production |
| Reranker cold load (149 M) | 1.8 s | One-time at app boot |
| Reranker score 20 candidates | 0.93 s | M3 Pro CPU, batched |
| Reconcile (Ollama, M-series) | 10.6 s | |
| Reconcile (vLLM, AMD MI300X) | 0.7 s | ~15× faster |

The reranker adds ~1 s to the user-visible path on CPU. Negligible
relative to the existing reconciler latency, well under the brief's
demo budget.

## Findings worth remembering

1. **The retriever's per-doc dedup is in the wrong place.** Currently
   `app/rag.py:retrieve()` keeps "at most 1 chunk per doc" and then
   returns top-K. For the reranker integration, this should be
   inverted: gather top-20 *with duplicates*, rerank, then dedup to
   top-3. Otherwise we're throwing away high-relevance chunks before
   the rerank ever sees them.

2. **Cross-encoder `cache_dir` arg is deprecated** in current
   sentence-transformers; passes through with a warning. Move to
   `model_kwargs={"cache_dir": ...}` when integrating to silence it.

3. **Reranker disagrees with the retriever in interesting ways.** On
   the test query the retriever's rank-1 (a generic MTA roadmap intro)
   was a content-light string that scored high on lexical/embedding
   surface similarity to "flood risk heavy rainfall". The reranker
   correctly surfaced more specific content. This is the canonical
   reason cross-encoder reranking matters.

4. **Sidecar deployment story.** No GPU needed for the reranker; ~600
   MB resident on CPU; loads in ~2 s after first download. Fits
   trivially in the HF Spaces T4 image. The vLLM-served alternative
   was explicitly out-of-scope per the project decision and isn't
   needed for these latencies.

## Files

```
03_granite_reranker/
  rerank.py           CrossEncoder load + predict wrapper
  run_double_gate.py  retriever -> reranker -> reconciler probe
  RESULTS.md          (this file)
  .cache/             reranker weights, double_gate_*.json
```

## Conclusion

Specialist works on both backends with the expected behavior change
(reranker reorders top-3 in a query-relevant way; reconciler refuses to
fabricate when source content doesn't address the query).

**Recommended path forward:** integrate as a one-line addition to
`app/rag.py:retrieve()`: take retriever top-K=20 (drop the existing
per-doc dedup), call the reranker, then dedup to top-3. Load the
cross-encoder once at app boot in `warm()`. Single env var
`RIPRAP_RERANKER_ENABLE=1` to gate the new behavior so the existing
production path is unchanged by default.