Aksel Joonas Reedi commited on
Commit
b292d83
Β·
unverified Β·
1 Parent(s): 1f3fd90

ci: add REVIEW.md for tunable Claude reviews (#104)

Browse files

* ci: add REVIEW.md and inject it into the review prompt

REVIEW.md is a repo-root freeform instructions file that gets prepended
to the review prompt as highest-priority guidance. Lets maintainers tune
severity calibration, nit caps, skip lists, and repo-specific must-checks
by editing one file instead of the workflow YAML.

Mirrors the pattern used by the managed Anthropic Code Review product so
we keep the same levers on our self-hosted Actions setup.

* review: add merge-bias, pushback norms, and What-I-checked summary

Insights from the Latent Space 'harness engineering' interview: review
agents should default to merge, not block; 🟑/🟣 are informational not
required; author pushback without a fix is legitimate for non-Important
findings; repeated disagreement is a signal REVIEW.md is missing a rule.

Also adds a 'What I checked' bullet list to the summary shape so even
clean LGTM reviews surface the coverage the reviewer actually applied.

* review: rename severity markers to P0/P1/P2

Replace πŸ”΄ Important / 🟑 Nit / 🟣 Pre-existing with plain P0/P1/P2
labels throughout REVIEW.md and the workflow prompt. Matches the
priority scheme from the Latent Space harness-engineering interview
and reads cleaner in terminal-rendered GitHub diffs.

* review: swap merge-bias for rigor; require deep investigation + merge verdict

Maintainer feedback: default-bias-merge was borrowed from a closed AI-loop
context (Ryan's harness) where the PR author is also an agent and merge-and-
iterate is cheap. For an open-source repo taking one-shot external PRs with
a small maintainer team, the risk flips: false negatives ship bugs, false
positives cost one contributor round trip. Rigor is the correct default.

Three concrete changes:
- 'Default bias: rigor' replaces 'default bias: merge'. Hold the line on P0
even under contributor pushback. P1/P2 still accept deferral silently.
- New 'Investigate before posting' section requires reading callers and
callees (not just the diff), tracing routing/auth chains end-to-end, and
checking established patterns before flagging divergence.
- Summary now carries an explicit 'Verdict: ready to merge / changes
requested / needs discussion' so the maintainer sees the call at a
glance.

* review: add Dependency PRs rubric to catch supply-chain bait

Empirical test against the current open-PR queue surfaced a false-negative:
a bot PR (orbisai0security, #96) titled 'upgrade authlib to 1.6.9 for
CVE-2026-27962' actually bumps 1.6.5 β†’ 1.7.0 in the lockfile, the CVE
isn't in NVD, and the bump silently introduces a new transitive dep
(joserfc). Existing REVIEW.md rules are routing/auth/agent-loop centric
and would LGTM it.

New 'Dependency PRs' section requires: CVE verification against NVD or
GH Advisory DB, title-version ↔ lockfile-diff match, justification for
any new transitive dep, and P0 framing-flag when a dep-only PR claims a
code-behavior fix.

* review: trim REVIEW.md β€” drop enumerations, tighten P1 cap to 3

- Remove 'What counts as P0 in this repo' enumeration: P0 is implicitly
for Claude to figure out from context, not a static checklist.
- Remove 'Always check' repo-specific enumeration: same rationale. The
rigor + investigate-before-posting framing carries the weight.
- Remove 'Anything CI already enforces' block under 'Do not report':
rigor framing plus the skip-paths list already covers it.
- Drop 'If you cannot invest the depth to verify, do not post the
finding' tail from Investigate-before-posting (implicit in rigor).
- Drop routing/effort/caching citation expansion from Verification bar
(implicit in generic citation rule).
- Drop the concrete What-I-checked example from Summary shape.
- Drop 'one paragraph of context at most' from Summary shape.
- Tighten P1 cap from 5 to 3.

* review: compress dep-PR section to one paragraph, drop test-nag example

Dep-PR rubric was carrying four bulleted cases that amounted to one idea:
claims in the PR body must match the diff, new deps need justification,
lying framing is P0. Collapsed to a single paragraph.

Also drops 'Consider adding a test' from the speculative examples β€” that
heuristic tends to manufacture P1s rather than filter noise.

Files changed (2) hide show
  1. .github/workflows/claude-review.yml +33 -12
  2. REVIEW.md +135 -0
.github/workflows/claude-review.yml CHANGED
@@ -23,19 +23,40 @@ jobs:
23
  with:
24
  fetch-depth: 0
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  - uses: anthropics/claude-code-action@v1
27
  with:
28
  anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
29
  track_progress: true
30
- prompt: |
31
- Review this pull request against the main branch. Focus on:
32
- - Correctness and likely bugs
33
- - Security issues (auth, input validation, secrets, injection)
34
- - Performance regressions, especially in the agent loop and streaming paths
35
- - Breakages in LiteLLM / Bedrock routing (model ids, params, prompt caching)
36
- - Test coverage for new behavior
37
- - Backend/frontend contract drift (FastAPI routes ↔ React client)
38
-
39
- Be concise. Prefer inline comments over long summaries. Skip nitpicks on
40
- style that ruff already catches. If the PR looks good, say so briefly
41
- instead of inventing issues.
 
23
  with:
24
  fetch-depth: 0
25
 
26
+ - name: Compose review prompt
27
+ id: compose
28
+ run: |
29
+ {
30
+ printf 'prompt<<PROMPT_EOF\n'
31
+ if [ -f REVIEW.md ]; then
32
+ echo '# Highest-priority review instructions (from REVIEW.md at the repo root)'
33
+ echo 'Follow these rules as the authoritative guide for this review. If anything'
34
+ echo 'below contradicts a more generic review habit, follow these.'
35
+ echo
36
+ cat REVIEW.md
37
+ echo
38
+ echo '---'
39
+ echo
40
+ fi
41
+ cat <<'BASE'
42
+ Review this pull request against the main branch.
43
+
44
+ Tag every finding with a priority label: P0 (blocks merge), P1 (worth
45
+ fixing, not blocking), or P2 (informational / pre-existing). Open the
46
+ review body with a one-line tally ("2 P0, 3 P1", or
47
+ "No blocking issues β€” 3 P1", or "LGTM" if nothing). Cite file:line for
48
+ every behavior claim. Prefer inline comments over long summaries.
49
+
50
+ Fallback focus if REVIEW.md is missing: correctness, security (auth,
51
+ injection, SSRF), LiteLLM/Bedrock routing breakage, agent loop / streaming
52
+ regressions, test coverage for new behavior. Skip anything ruff already
53
+ catches.
54
+ BASE
55
+ printf 'PROMPT_EOF\n'
56
+ } >> "$GITHUB_OUTPUT"
57
+
58
  - uses: anthropics/claude-code-action@v1
59
  with:
60
  anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
61
  track_progress: true
62
+ prompt: ${{ steps.compose.outputs.prompt }}
 
 
 
 
 
 
 
 
 
 
 
REVIEW.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Review instructions
2
+
3
+ These rules override the default review guidance. Treat them as the highest-priority
4
+ instruction block for any review of this repo. If something here contradicts a more
5
+ generic review habit, follow these.
6
+
7
+ ## Severity levels
8
+
9
+ Every finding carries one of three priority labels:
10
+
11
+ - **P0** β€” blocks merge.
12
+ - **P1** β€” worth fixing, not blocking.
13
+ - **P2** β€” informational.
14
+
15
+ Write labels as plain text (`P0`, `P1`, `P2`) in finding headers. Do not use
16
+ emoji or colored markers. Use judgment on what belongs at which level β€” this
17
+ repo does not enumerate P0 cases; read the code and decide.
18
+
19
+ ## Default bias: rigor
20
+
21
+ Reviews gate merges. This is an open-source repo that takes PRs from anyone; the
22
+ maintainer team is small and relies on the review to catch what they don't have
23
+ time to verify themselves. **Default bias is rigor, not speed.** When in doubt
24
+ on a P0-class concern, investigate further before deciding whether to flag β€” a
25
+ false negative ships a bug to production, a false positive costs the contributor
26
+ one round trip.
27
+
28
+ Rigor is not nitpicking. The P1 cap, "do not report" skip list, and verification
29
+ bar all still apply. Rigor means going deep on a small number of real concerns,
30
+ not surfacing a large number of shallow ones. Prefer one well-investigated P0
31
+ over three speculative P1s.
32
+
33
+ **Hold the line on P0.** If the author pushes back on a P0 finding without a fix
34
+ that actually addresses the root cause, re-state the concern with added
35
+ citations. Only accept the pushback if the author points to code or behavior you
36
+ missed. Do not soften a P0 because the contributor is polite or new to the repo.
37
+
38
+ For P1 and P2: if the author defers or pushes back without fixing, accept it
39
+ silently β€” do not re-flag on subsequent commits. P1/P2 are informational; the
40
+ author may defer to a follow-up issue at their discretion.
41
+
42
+ If Claude and the author repeatedly disagree on the same class of finding, the
43
+ signal is that REVIEW.md is missing a rule; note it once in the PR summary as
44
+ `suggest-rule: <short description>` and stop.
45
+
46
+ ## Investigate before posting
47
+
48
+ The depth of your analysis determines the strength of your finding. For any
49
+ P0-class concern, before writing it up:
50
+
51
+ - Read the relevant callers and callees, not just the diff. Use Read and Grep
52
+ to open files the diff doesn't touch but the changed code interacts with.
53
+ - Trace the full chain end-to-end for routing, auth, and agent-loop findings.
54
+ Cite each hop by `file:line`, not just the suspicious line.
55
+ - Check whether the codebase already has an established pattern for this kind
56
+ of change (`grep` for similar call sites, similar tool definitions, similar
57
+ route guards). If the PR introduces a new approach where an established
58
+ pattern exists, flag that β€” divergence from the existing pattern is usually a
59
+ regression vector even when the new code "works."
60
+ - Confirm the specific behavior you're claiming. "This breaks X" must be
61
+ grounded in either the code handling X or a test exercising X, not in
62
+ inference from naming or structure.
63
+
64
+ A finding you "spotted" by scanning the diff is more likely to be a false
65
+ positive than a finding you verified by reading the code around it.
66
+
67
+ ## P1 cap
68
+
69
+ Report at most **3** P1 findings per review. If you found more, say "plus N
70
+ similar items" in the summary. If everything you found is P1 or below, open the
71
+ summary with "No blocking issues."
72
+
73
+ ## Re-review convergence
74
+
75
+ If this PR has already received a Claude review (there is a prior review comment
76
+ by the `claude` bot), suppress new P1 findings and post only P0 ones. Do not
77
+ re-post P1s that were already flagged on earlier commits. If the author pushed a
78
+ fix for a previously flagged issue, acknowledge it in one line rather than
79
+ re-flagging.
80
+
81
+ ## Do not report
82
+
83
+ Anything in these paths β€” skip entirely:
84
+
85
+ - `frontend/node_modules/**`, `**/*.lock`, `uv.lock`, `package-lock.json`
86
+ - `hf_agent.egg-info/**`, `.ruff_cache/**`, `.pytest_cache/**`, `.venv/**`
87
+ - `session_logs/**`, `reports/**`
88
+ - Anything under a `gen/` or `generated/` path
89
+
90
+ Anything speculative β€” do not post:
91
+
92
+ - "This might be slow" without a concrete complexity claim tied to a specific
93
+ input size
94
+ - Hypothetical race conditions without a concrete interleaving
95
+
96
+ ## Dependency PRs
97
+
98
+ For PRs whose diff is only a lockfile bump, a `pyproject.toml` change, or a
99
+ new dependency, the code rules above don't apply β€” risks shift to provenance
100
+ and framing. Every claim in the title or body (CVE IDs, version numbers,
101
+ behavior fixes) must match what the diff actually does, and any new
102
+ transitive dep needs justification. A PR that lies in its framing is P0
103
+ regardless of whether the code change is safe in isolation.
104
+
105
+ ## Verification bar
106
+
107
+ Every behavior claim in a finding must cite `file:line`. "This breaks X" is not
108
+ actionable without a line reference. If you cannot cite a line, do not post
109
+ the finding.
110
+
111
+ ## Summary shape
112
+
113
+ Open the review body with a single-line tally and an explicit merge verdict, on
114
+ two lines:
115
+
116
+ ```
117
+ 2 P0, 3 P1
118
+ Verdict: changes requested
119
+ ```
120
+
121
+ Valid verdicts:
122
+
123
+ - **Verdict: ready to merge** β€” no P0 findings, contributor can merge as-is
124
+ once any CI passes
125
+ - **Verdict: changes requested** β€” at least one P0 that must be addressed
126
+ before merging
127
+ - **Verdict: needs discussion** β€” a design-level concern the maintainer should
128
+ weigh in on before the contributor iterates (use sparingly)
129
+
130
+ If it's a clean review, write `LGTM` followed by `Verdict: ready to merge`.
131
+
132
+ Then a **What I checked** bullet list β€” one line per major area you examined,
133
+ regardless of whether you found anything. This gives the maintainer visible
134
+ coverage at a glance and lets them decide whether to spot-check areas you
135
+ didn't touch.