Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

Aksel Joonas Reedi commited on 13 days ago

Commit

b292d83

unverified ·

1 Parent(s): 1f3fd90

ci: add REVIEW.md for tunable Claude reviews (#104)

* ci: add REVIEW.md and inject it into the review prompt

REVIEW.md is a repo-root freeform instructions file that gets prepended
to the review prompt as highest-priority guidance. Lets maintainers tune
severity calibration, nit caps, skip lists, and repo-specific must-checks
by editing one file instead of the workflow YAML.

Mirrors the pattern used by the managed Anthropic Code Review product so
we keep the same levers on our self-hosted Actions setup.

* review: add merge-bias, pushback norms, and What-I-checked summary

Insights from the Latent Space 'harness engineering' interview: review
agents should default to merge, not block; 🟡/🟣 are informational not
required; author pushback without a fix is legitimate for non-Important
findings; repeated disagreement is a signal REVIEW.md is missing a rule.

Also adds a 'What I checked' bullet list to the summary shape so even
clean LGTM reviews surface the coverage the reviewer actually applied.

* review: rename severity markers to P0/P1/P2

Replace 🔴 Important / 🟡 Nit / 🟣 Pre-existing with plain P0/P1/P2
labels throughout REVIEW.md and the workflow prompt. Matches the
priority scheme from the Latent Space harness-engineering interview
and reads cleaner in terminal-rendered GitHub diffs.

* review: swap merge-bias for rigor; require deep investigation + merge verdict

Maintainer feedback: default-bias-merge was borrowed from a closed AI-loop
context (Ryan's harness) where the PR author is also an agent and merge-and-
iterate is cheap. For an open-source repo taking one-shot external PRs with
a small maintainer team, the risk flips: false negatives ship bugs, false
positives cost one contributor round trip. Rigor is the correct default.

Three concrete changes:
- 'Default bias: rigor' replaces 'default bias: merge'. Hold the line on P0
even under contributor pushback. P1/P2 still accept deferral silently.
- New 'Investigate before posting' section requires reading callers and
callees (not just the diff), tracing routing/auth chains end-to-end, and
checking established patterns before flagging divergence.
- Summary now carries an explicit 'Verdict: ready to merge / changes
requested / needs discussion' so the maintainer sees the call at a
glance.

* review: add Dependency PRs rubric to catch supply-chain bait

Empirical test against the current open-PR queue surfaced a false-negative:
a bot PR (orbisai0security, #96) titled 'upgrade authlib to 1.6.9 for
CVE-2026-27962' actually bumps 1.6.5 → 1.7.0 in the lockfile, the CVE
isn't in NVD, and the bump silently introduces a new transitive dep
(joserfc). Existing REVIEW.md rules are routing/auth/agent-loop centric
and would LGTM it.

New 'Dependency PRs' section requires: CVE verification against NVD or
GH Advisory DB, title-version ↔ lockfile-diff match, justification for
any new transitive dep, and P0 framing-flag when a dep-only PR claims a
code-behavior fix.

* review: trim REVIEW.md — drop enumerations, tighten P1 cap to 3

- Remove 'What counts as P0 in this repo' enumeration: P0 is implicitly
for Claude to figure out from context, not a static checklist.
- Remove 'Always check' repo-specific enumeration: same rationale. The
rigor + investigate-before-posting framing carries the weight.
- Remove 'Anything CI already enforces' block under 'Do not report':
rigor framing plus the skip-paths list already covers it.
- Drop 'If you cannot invest the depth to verify, do not post the
finding' tail from Investigate-before-posting (implicit in rigor).
- Drop routing/effort/caching citation expansion from Verification bar
(implicit in generic citation rule).
- Drop the concrete What-I-checked example from Summary shape.
- Drop 'one paragraph of context at most' from Summary shape.
- Tighten P1 cap from 5 to 3.

* review: compress dep-PR section to one paragraph, drop test-nag example

Dep-PR rubric was carrying four bulleted cases that amounted to one idea:
claims in the PR body must match the diff, new deps need justification,
lying framing is P0. Collapsed to a single paragraph.

Also drops 'Consider adding a test' from the speculative examples — that
heuristic tends to manufacture P1s rather than filter noise.

Files changed (2) hide show

.github/workflows/claude-review.yml +33 -12
REVIEW.md +135 -0

.github/workflows/claude-review.yml CHANGED Viewed

@@ -23,19 +23,40 @@ jobs:
         with:
           fetch-depth: 0
       - uses: anthropics/claude-code-action@v1
         with:
           anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
           track_progress: true
-          prompt: |
-            Review this pull request against the main branch. Focus on:
-            - Correctness and likely bugs
-            - Security issues (auth, input validation, secrets, injection)
-            - Performance regressions, especially in the agent loop and streaming paths
-            - Breakages in LiteLLM / Bedrock routing (model ids, params, prompt caching)
-            - Test coverage for new behavior
-            - Backend/frontend contract drift (FastAPI routes ↔ React client)
-            Be concise. Prefer inline comments over long summaries. Skip nitpicks on
-            style that ruff already catches. If the PR looks good, say so briefly
-            instead of inventing issues.

         with:
           fetch-depth: 0
+      - name: Compose review prompt
+        id: compose
+        run: |
+          {
+            printf 'prompt<<PROMPT_EOF\n'
+            if [ -f REVIEW.md ]; then
+              echo '# Highest-priority review instructions (from REVIEW.md at the repo root)'
+              echo 'Follow these rules as the authoritative guide for this review. If anything'
+              echo 'below contradicts a more generic review habit, follow these.'
+              echo
+              cat REVIEW.md
+              echo
+              echo '---'
+              echo
+            fi
+            cat <<'BASE'
+          Review this pull request against the main branch.
+          Tag every finding with a priority label: P0 (blocks merge), P1 (worth
+          fixing, not blocking), or P2 (informational / pre-existing). Open the
+          review body with a one-line tally ("2 P0, 3 P1", or
+          "No blocking issues — 3 P1", or "LGTM" if nothing). Cite file:line for
+          every behavior claim. Prefer inline comments over long summaries.
+          Fallback focus if REVIEW.md is missing: correctness, security (auth,
+          injection, SSRF), LiteLLM/Bedrock routing breakage, agent loop / streaming
+          regressions, test coverage for new behavior. Skip anything ruff already
+          catches.
+          BASE
+            printf 'PROMPT_EOF\n'
+          } >> "$GITHUB_OUTPUT"
       - uses: anthropics/claude-code-action@v1
         with:
           anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
           track_progress: true
+          prompt: ${{ steps.compose.outputs.prompt }}

REVIEW.md ADDED Viewed

	@@ -0,0 +1,135 @@

+# Review instructions
+These rules override the default review guidance. Treat them as the highest-priority
+instruction block for any review of this repo. If something here contradicts a more
+generic review habit, follow these.
+## Severity levels
+Every finding carries one of three priority labels:
+- **P0** — blocks merge.
+- **P1** — worth fixing, not blocking.
+- **P2** — informational.
+Write labels as plain text (`P0`, `P1`, `P2`) in finding headers. Do not use
+emoji or colored markers. Use judgment on what belongs at which level — this
+repo does not enumerate P0 cases; read the code and decide.
+## Default bias: rigor
+Reviews gate merges. This is an open-source repo that takes PRs from anyone; the
+maintainer team is small and relies on the review to catch what they don't have
+time to verify themselves. **Default bias is rigor, not speed.** When in doubt
+on a P0-class concern, investigate further before deciding whether to flag — a
+false negative ships a bug to production, a false positive costs the contributor
+one round trip.
+Rigor is not nitpicking. The P1 cap, "do not report" skip list, and verification
+bar all still apply. Rigor means going deep on a small number of real concerns,
+not surfacing a large number of shallow ones. Prefer one well-investigated P0
+over three speculative P1s.
+**Hold the line on P0.** If the author pushes back on a P0 finding without a fix
+that actually addresses the root cause, re-state the concern with added
+citations. Only accept the pushback if the author points to code or behavior you
+missed. Do not soften a P0 because the contributor is polite or new to the repo.
+For P1 and P2: if the author defers or pushes back without fixing, accept it
+silently — do not re-flag on subsequent commits. P1/P2 are informational; the
+author may defer to a follow-up issue at their discretion.
+If Claude and the author repeatedly disagree on the same class of finding, the
+signal is that REVIEW.md is missing a rule; note it once in the PR summary as
+`suggest-rule: <short description>` and stop.
+## Investigate before posting
+The depth of your analysis determines the strength of your finding. For any
+P0-class concern, before writing it up:
+- Read the relevant callers and callees, not just the diff. Use Read and Grep
+  to open files the diff doesn't touch but the changed code interacts with.
+- Trace the full chain end-to-end for routing, auth, and agent-loop findings.
+  Cite each hop by `file:line`, not just the suspicious line.
+- Check whether the codebase already has an established pattern for this kind
+  of change (`grep` for similar call sites, similar tool definitions, similar
+  route guards). If the PR introduces a new approach where an established
+  pattern exists, flag that — divergence from the existing pattern is usually a
+  regression vector even when the new code "works."
+- Confirm the specific behavior you're claiming. "This breaks X" must be
+  grounded in either the code handling X or a test exercising X, not in
+  inference from naming or structure.
+A finding you "spotted" by scanning the diff is more likely to be a false
+positive than a finding you verified by reading the code around it.
+## P1 cap
+Report at most **3** P1 findings per review. If you found more, say "plus N
+similar items" in the summary. If everything you found is P1 or below, open the
+summary with "No blocking issues."
+## Re-review convergence
+If this PR has already received a Claude review (there is a prior review comment
+by the `claude` bot), suppress new P1 findings and post only P0 ones. Do not
+re-post P1s that were already flagged on earlier commits. If the author pushed a
+fix for a previously flagged issue, acknowledge it in one line rather than
+re-flagging.
+## Do not report
+Anything in these paths — skip entirely:
+- `frontend/node_modules/**`, `**/*.lock`, `uv.lock`, `package-lock.json`
+- `hf_agent.egg-info/**`, `.ruff_cache/**`, `.pytest_cache/**`, `.venv/**`
+- `session_logs/**`, `reports/**`
+- Anything under a `gen/` or `generated/` path
+Anything speculative — do not post:
+- "This might be slow" without a concrete complexity claim tied to a specific
+  input size
+- Hypothetical race conditions without a concrete interleaving
+## Dependency PRs
+For PRs whose diff is only a lockfile bump, a `pyproject.toml` change, or a
+new dependency, the code rules above don't apply — risks shift to provenance
+and framing. Every claim in the title or body (CVE IDs, version numbers,
+behavior fixes) must match what the diff actually does, and any new
+transitive dep needs justification. A PR that lies in its framing is P0
+regardless of whether the code change is safe in isolation.
+## Verification bar
+Every behavior claim in a finding must cite `file:line`. "This breaks X" is not
+actionable without a line reference. If you cannot cite a line, do not post
+the finding.
+## Summary shape
+Open the review body with a single-line tally and an explicit merge verdict, on
+two lines:
+```
+2 P0, 3 P1
+Verdict: changes requested
+```
+Valid verdicts:
+- **Verdict: ready to merge** — no P0 findings, contributor can merge as-is
+  once any CI passes
+- **Verdict: changes requested** — at least one P0 that must be addressed
+  before merging
+- **Verdict: needs discussion** — a design-level concern the maintainer should
+  weigh in on before the contributor iterates (use sparingly)
+If it's a clean review, write `LGTM` followed by `Verdict: ready to merge`.
+Then a **What I checked** bullet list — one line per major area you examined,
+regardless of whether you found anything. This gives the maintainer visible
+coverage at a glance and lets them decide whether to spot-check areas you
+didn't touch.