Spaces:
Running on CPU Upgrade
ci: add REVIEW.md for tunable Claude reviews (#104)
Browse files* ci: add REVIEW.md and inject it into the review prompt
REVIEW.md is a repo-root freeform instructions file that gets prepended
to the review prompt as highest-priority guidance. Lets maintainers tune
severity calibration, nit caps, skip lists, and repo-specific must-checks
by editing one file instead of the workflow YAML.
Mirrors the pattern used by the managed Anthropic Code Review product so
we keep the same levers on our self-hosted Actions setup.
* review: add merge-bias, pushback norms, and What-I-checked summary
Insights from the Latent Space 'harness engineering' interview: review
agents should default to merge, not block; π‘/π£ are informational not
required; author pushback without a fix is legitimate for non-Important
findings; repeated disagreement is a signal REVIEW.md is missing a rule.
Also adds a 'What I checked' bullet list to the summary shape so even
clean LGTM reviews surface the coverage the reviewer actually applied.
* review: rename severity markers to P0/P1/P2
Replace π΄ Important / π‘ Nit / π£ Pre-existing with plain P0/P1/P2
labels throughout REVIEW.md and the workflow prompt. Matches the
priority scheme from the Latent Space harness-engineering interview
and reads cleaner in terminal-rendered GitHub diffs.
* review: swap merge-bias for rigor; require deep investigation + merge verdict
Maintainer feedback: default-bias-merge was borrowed from a closed AI-loop
context (Ryan's harness) where the PR author is also an agent and merge-and-
iterate is cheap. For an open-source repo taking one-shot external PRs with
a small maintainer team, the risk flips: false negatives ship bugs, false
positives cost one contributor round trip. Rigor is the correct default.
Three concrete changes:
- 'Default bias: rigor' replaces 'default bias: merge'. Hold the line on P0
even under contributor pushback. P1/P2 still accept deferral silently.
- New 'Investigate before posting' section requires reading callers and
callees (not just the diff), tracing routing/auth chains end-to-end, and
checking established patterns before flagging divergence.
- Summary now carries an explicit 'Verdict: ready to merge / changes
requested / needs discussion' so the maintainer sees the call at a
glance.
* review: add Dependency PRs rubric to catch supply-chain bait
Empirical test against the current open-PR queue surfaced a false-negative:
a bot PR (orbisai0security, #96) titled 'upgrade authlib to 1.6.9 for
CVE-2026-27962' actually bumps 1.6.5 β 1.7.0 in the lockfile, the CVE
isn't in NVD, and the bump silently introduces a new transitive dep
(joserfc). Existing REVIEW.md rules are routing/auth/agent-loop centric
and would LGTM it.
New 'Dependency PRs' section requires: CVE verification against NVD or
GH Advisory DB, title-version β lockfile-diff match, justification for
any new transitive dep, and P0 framing-flag when a dep-only PR claims a
code-behavior fix.
* review: trim REVIEW.md β drop enumerations, tighten P1 cap to 3
- Remove 'What counts as P0 in this repo' enumeration: P0 is implicitly
for Claude to figure out from context, not a static checklist.
- Remove 'Always check' repo-specific enumeration: same rationale. The
rigor + investigate-before-posting framing carries the weight.
- Remove 'Anything CI already enforces' block under 'Do not report':
rigor framing plus the skip-paths list already covers it.
- Drop 'If you cannot invest the depth to verify, do not post the
finding' tail from Investigate-before-posting (implicit in rigor).
- Drop routing/effort/caching citation expansion from Verification bar
(implicit in generic citation rule).
- Drop the concrete What-I-checked example from Summary shape.
- Drop 'one paragraph of context at most' from Summary shape.
- Tighten P1 cap from 5 to 3.
* review: compress dep-PR section to one paragraph, drop test-nag example
Dep-PR rubric was carrying four bulleted cases that amounted to one idea:
claims in the PR body must match the diff, new deps need justification,
lying framing is P0. Collapsed to a single paragraph.
Also drops 'Consider adding a test' from the speculative examples β that
heuristic tends to manufacture P1s rather than filter noise.
- .github/workflows/claude-review.yml +33 -12
- REVIEW.md +135 -0
|
@@ -23,19 +23,40 @@ jobs:
|
|
| 23 |
with:
|
| 24 |
fetch-depth: 0
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
- uses: anthropics/claude-code-action@v1
|
| 27 |
with:
|
| 28 |
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
|
| 29 |
track_progress: true
|
| 30 |
-
prompt:
|
| 31 |
-
Review this pull request against the main branch. Focus on:
|
| 32 |
-
- Correctness and likely bugs
|
| 33 |
-
- Security issues (auth, input validation, secrets, injection)
|
| 34 |
-
- Performance regressions, especially in the agent loop and streaming paths
|
| 35 |
-
- Breakages in LiteLLM / Bedrock routing (model ids, params, prompt caching)
|
| 36 |
-
- Test coverage for new behavior
|
| 37 |
-
- Backend/frontend contract drift (FastAPI routes β React client)
|
| 38 |
-
|
| 39 |
-
Be concise. Prefer inline comments over long summaries. Skip nitpicks on
|
| 40 |
-
style that ruff already catches. If the PR looks good, say so briefly
|
| 41 |
-
instead of inventing issues.
|
|
|
|
| 23 |
with:
|
| 24 |
fetch-depth: 0
|
| 25 |
|
| 26 |
+
- name: Compose review prompt
|
| 27 |
+
id: compose
|
| 28 |
+
run: |
|
| 29 |
+
{
|
| 30 |
+
printf 'prompt<<PROMPT_EOF\n'
|
| 31 |
+
if [ -f REVIEW.md ]; then
|
| 32 |
+
echo '# Highest-priority review instructions (from REVIEW.md at the repo root)'
|
| 33 |
+
echo 'Follow these rules as the authoritative guide for this review. If anything'
|
| 34 |
+
echo 'below contradicts a more generic review habit, follow these.'
|
| 35 |
+
echo
|
| 36 |
+
cat REVIEW.md
|
| 37 |
+
echo
|
| 38 |
+
echo '---'
|
| 39 |
+
echo
|
| 40 |
+
fi
|
| 41 |
+
cat <<'BASE'
|
| 42 |
+
Review this pull request against the main branch.
|
| 43 |
+
|
| 44 |
+
Tag every finding with a priority label: P0 (blocks merge), P1 (worth
|
| 45 |
+
fixing, not blocking), or P2 (informational / pre-existing). Open the
|
| 46 |
+
review body with a one-line tally ("2 P0, 3 P1", or
|
| 47 |
+
"No blocking issues β 3 P1", or "LGTM" if nothing). Cite file:line for
|
| 48 |
+
every behavior claim. Prefer inline comments over long summaries.
|
| 49 |
+
|
| 50 |
+
Fallback focus if REVIEW.md is missing: correctness, security (auth,
|
| 51 |
+
injection, SSRF), LiteLLM/Bedrock routing breakage, agent loop / streaming
|
| 52 |
+
regressions, test coverage for new behavior. Skip anything ruff already
|
| 53 |
+
catches.
|
| 54 |
+
BASE
|
| 55 |
+
printf 'PROMPT_EOF\n'
|
| 56 |
+
} >> "$GITHUB_OUTPUT"
|
| 57 |
+
|
| 58 |
- uses: anthropics/claude-code-action@v1
|
| 59 |
with:
|
| 60 |
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
|
| 61 |
track_progress: true
|
| 62 |
+
prompt: ${{ steps.compose.outputs.prompt }}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review instructions
|
| 2 |
+
|
| 3 |
+
These rules override the default review guidance. Treat them as the highest-priority
|
| 4 |
+
instruction block for any review of this repo. If something here contradicts a more
|
| 5 |
+
generic review habit, follow these.
|
| 6 |
+
|
| 7 |
+
## Severity levels
|
| 8 |
+
|
| 9 |
+
Every finding carries one of three priority labels:
|
| 10 |
+
|
| 11 |
+
- **P0** β blocks merge.
|
| 12 |
+
- **P1** β worth fixing, not blocking.
|
| 13 |
+
- **P2** β informational.
|
| 14 |
+
|
| 15 |
+
Write labels as plain text (`P0`, `P1`, `P2`) in finding headers. Do not use
|
| 16 |
+
emoji or colored markers. Use judgment on what belongs at which level β this
|
| 17 |
+
repo does not enumerate P0 cases; read the code and decide.
|
| 18 |
+
|
| 19 |
+
## Default bias: rigor
|
| 20 |
+
|
| 21 |
+
Reviews gate merges. This is an open-source repo that takes PRs from anyone; the
|
| 22 |
+
maintainer team is small and relies on the review to catch what they don't have
|
| 23 |
+
time to verify themselves. **Default bias is rigor, not speed.** When in doubt
|
| 24 |
+
on a P0-class concern, investigate further before deciding whether to flag β a
|
| 25 |
+
false negative ships a bug to production, a false positive costs the contributor
|
| 26 |
+
one round trip.
|
| 27 |
+
|
| 28 |
+
Rigor is not nitpicking. The P1 cap, "do not report" skip list, and verification
|
| 29 |
+
bar all still apply. Rigor means going deep on a small number of real concerns,
|
| 30 |
+
not surfacing a large number of shallow ones. Prefer one well-investigated P0
|
| 31 |
+
over three speculative P1s.
|
| 32 |
+
|
| 33 |
+
**Hold the line on P0.** If the author pushes back on a P0 finding without a fix
|
| 34 |
+
that actually addresses the root cause, re-state the concern with added
|
| 35 |
+
citations. Only accept the pushback if the author points to code or behavior you
|
| 36 |
+
missed. Do not soften a P0 because the contributor is polite or new to the repo.
|
| 37 |
+
|
| 38 |
+
For P1 and P2: if the author defers or pushes back without fixing, accept it
|
| 39 |
+
silently β do not re-flag on subsequent commits. P1/P2 are informational; the
|
| 40 |
+
author may defer to a follow-up issue at their discretion.
|
| 41 |
+
|
| 42 |
+
If Claude and the author repeatedly disagree on the same class of finding, the
|
| 43 |
+
signal is that REVIEW.md is missing a rule; note it once in the PR summary as
|
| 44 |
+
`suggest-rule: <short description>` and stop.
|
| 45 |
+
|
| 46 |
+
## Investigate before posting
|
| 47 |
+
|
| 48 |
+
The depth of your analysis determines the strength of your finding. For any
|
| 49 |
+
P0-class concern, before writing it up:
|
| 50 |
+
|
| 51 |
+
- Read the relevant callers and callees, not just the diff. Use Read and Grep
|
| 52 |
+
to open files the diff doesn't touch but the changed code interacts with.
|
| 53 |
+
- Trace the full chain end-to-end for routing, auth, and agent-loop findings.
|
| 54 |
+
Cite each hop by `file:line`, not just the suspicious line.
|
| 55 |
+
- Check whether the codebase already has an established pattern for this kind
|
| 56 |
+
of change (`grep` for similar call sites, similar tool definitions, similar
|
| 57 |
+
route guards). If the PR introduces a new approach where an established
|
| 58 |
+
pattern exists, flag that β divergence from the existing pattern is usually a
|
| 59 |
+
regression vector even when the new code "works."
|
| 60 |
+
- Confirm the specific behavior you're claiming. "This breaks X" must be
|
| 61 |
+
grounded in either the code handling X or a test exercising X, not in
|
| 62 |
+
inference from naming or structure.
|
| 63 |
+
|
| 64 |
+
A finding you "spotted" by scanning the diff is more likely to be a false
|
| 65 |
+
positive than a finding you verified by reading the code around it.
|
| 66 |
+
|
| 67 |
+
## P1 cap
|
| 68 |
+
|
| 69 |
+
Report at most **3** P1 findings per review. If you found more, say "plus N
|
| 70 |
+
similar items" in the summary. If everything you found is P1 or below, open the
|
| 71 |
+
summary with "No blocking issues."
|
| 72 |
+
|
| 73 |
+
## Re-review convergence
|
| 74 |
+
|
| 75 |
+
If this PR has already received a Claude review (there is a prior review comment
|
| 76 |
+
by the `claude` bot), suppress new P1 findings and post only P0 ones. Do not
|
| 77 |
+
re-post P1s that were already flagged on earlier commits. If the author pushed a
|
| 78 |
+
fix for a previously flagged issue, acknowledge it in one line rather than
|
| 79 |
+
re-flagging.
|
| 80 |
+
|
| 81 |
+
## Do not report
|
| 82 |
+
|
| 83 |
+
Anything in these paths β skip entirely:
|
| 84 |
+
|
| 85 |
+
- `frontend/node_modules/**`, `**/*.lock`, `uv.lock`, `package-lock.json`
|
| 86 |
+
- `hf_agent.egg-info/**`, `.ruff_cache/**`, `.pytest_cache/**`, `.venv/**`
|
| 87 |
+
- `session_logs/**`, `reports/**`
|
| 88 |
+
- Anything under a `gen/` or `generated/` path
|
| 89 |
+
|
| 90 |
+
Anything speculative β do not post:
|
| 91 |
+
|
| 92 |
+
- "This might be slow" without a concrete complexity claim tied to a specific
|
| 93 |
+
input size
|
| 94 |
+
- Hypothetical race conditions without a concrete interleaving
|
| 95 |
+
|
| 96 |
+
## Dependency PRs
|
| 97 |
+
|
| 98 |
+
For PRs whose diff is only a lockfile bump, a `pyproject.toml` change, or a
|
| 99 |
+
new dependency, the code rules above don't apply β risks shift to provenance
|
| 100 |
+
and framing. Every claim in the title or body (CVE IDs, version numbers,
|
| 101 |
+
behavior fixes) must match what the diff actually does, and any new
|
| 102 |
+
transitive dep needs justification. A PR that lies in its framing is P0
|
| 103 |
+
regardless of whether the code change is safe in isolation.
|
| 104 |
+
|
| 105 |
+
## Verification bar
|
| 106 |
+
|
| 107 |
+
Every behavior claim in a finding must cite `file:line`. "This breaks X" is not
|
| 108 |
+
actionable without a line reference. If you cannot cite a line, do not post
|
| 109 |
+
the finding.
|
| 110 |
+
|
| 111 |
+
## Summary shape
|
| 112 |
+
|
| 113 |
+
Open the review body with a single-line tally and an explicit merge verdict, on
|
| 114 |
+
two lines:
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
2 P0, 3 P1
|
| 118 |
+
Verdict: changes requested
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
Valid verdicts:
|
| 122 |
+
|
| 123 |
+
- **Verdict: ready to merge** β no P0 findings, contributor can merge as-is
|
| 124 |
+
once any CI passes
|
| 125 |
+
- **Verdict: changes requested** β at least one P0 that must be addressed
|
| 126 |
+
before merging
|
| 127 |
+
- **Verdict: needs discussion** β a design-level concern the maintainer should
|
| 128 |
+
weigh in on before the contributor iterates (use sparingly)
|
| 129 |
+
|
| 130 |
+
If it's a clean review, write `LGTM` followed by `Verdict: ready to merge`.
|
| 131 |
+
|
| 132 |
+
Then a **What I checked** bullet list β one line per major area you examined,
|
| 133 |
+
regardless of whether you found anything. This gives the maintainer visible
|
| 134 |
+
coverage at a glance and lets them decide whether to spot-check areas you
|
| 135 |
+
didn't touch.
|