Add skills karpathy-guidelines + autoresearch from 2026 february
Browse files
autoresearch/SKILL.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: autoresearch
|
| 3 |
+
description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat forever until stopped.
|
| 4 |
+
version: 1.0.0
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Claude Autoresearch — Autonomous Goal-directed Iteration
|
| 8 |
+
|
| 9 |
+
Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research.
|
| 10 |
+
|
| 11 |
+
**Core idea:** You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat. Never stop.
|
| 12 |
+
|
| 13 |
+
## When to Activate
|
| 14 |
+
|
| 15 |
+
- User invokes `/autoresearch` or `/ug:autoresearch`
|
| 16 |
+
- User says "work autonomously", "iterate until done", "keep improving", "run overnight"
|
| 17 |
+
- Any task requiring repeated iteration cycles with measurable outcomes
|
| 18 |
+
|
| 19 |
+
## Setup Phase (Do Once)
|
| 20 |
+
|
| 21 |
+
1. **Read all in-scope files** for full context before any modification
|
| 22 |
+
2. **Define the goal** — What does "better" mean? Extract or ask for a mechanical metric:
|
| 23 |
+
- Code: tests pass, build succeeds, performance benchmark improves
|
| 24 |
+
- Content: word count target hit, SEO score improves, readability score
|
| 25 |
+
- Design: lighthouse score, accessibility audit passes
|
| 26 |
+
- If no metric exists → define one with user, or use simplest proxy (e.g. "compiles without errors")
|
| 27 |
+
3. **Define scope constraints** — Which files can you modify? Which are read-only?
|
| 28 |
+
4. **Create a results log** — Track every iteration (see `references/results-logging.md`)
|
| 29 |
+
5. **Establish baseline** — Run verification on current state. Record as iteration #0
|
| 30 |
+
6. **Confirm and go** — Show user the setup, get confirmation, then BEGIN THE LOOP
|
| 31 |
+
|
| 32 |
+
## The Loop (Runs Forever)
|
| 33 |
+
|
| 34 |
+
Read `references/autonomous-loop-protocol.md` for full protocol details.
|
| 35 |
+
|
| 36 |
+
```
|
| 37 |
+
LOOP FOREVER:
|
| 38 |
+
1. Review: Read current state + git history + results log
|
| 39 |
+
2. Ideate: Pick next change based on goal, past results, what hasn't been tried
|
| 40 |
+
3. Modify: Make ONE focused change to in-scope files
|
| 41 |
+
4. Commit: Git commit the change (before verification)
|
| 42 |
+
5. Verify: Run the mechanical metric (tests, build, benchmark, etc.)
|
| 43 |
+
6. Decide:
|
| 44 |
+
- IMPROVED → Keep commit, log "keep", advance
|
| 45 |
+
- SAME/WORSE → Git revert, log "discard"
|
| 46 |
+
- CRASHED → Try to fix (max 3 attempts), else log "crash" and move on
|
| 47 |
+
7. Log: Record result in results log
|
| 48 |
+
8. Repeat: Go to step 1. NEVER STOP. NEVER ASK "should I continue?"
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
## Critical Rules
|
| 52 |
+
|
| 53 |
+
1. **NEVER STOP** — Loop until manually interrupted. User may be asleep.
|
| 54 |
+
2. **Read before write** — Always understand full context before modifying
|
| 55 |
+
3. **One change per iteration** — Atomic changes. If it breaks, you know exactly why
|
| 56 |
+
4. **Mechanical verification only** — No subjective "looks good". Use metrics
|
| 57 |
+
5. **Automatic rollback** — Failed changes revert instantly. No debates
|
| 58 |
+
6. **Simplicity wins** — Equal results + less code = KEEP. Tiny improvement + ugly complexity = DISCARD
|
| 59 |
+
7. **Git is memory** — Every kept change committed. Agent reads history to learn patterns
|
| 60 |
+
8. **When stuck, think harder** — Re-read files, re-read goal, combine near-misses, try radical changes. Don't ask for help unless truly blocked by missing access/permissions
|
| 61 |
+
|
| 62 |
+
## Principles Reference
|
| 63 |
+
|
| 64 |
+
See `references/core-principles.md` for the 7 generalizable principles from autoresearch.
|
| 65 |
+
|
| 66 |
+
## Adapting to Different Domains
|
| 67 |
+
|
| 68 |
+
| Domain | Metric | Scope | Verify Command |
|
| 69 |
+
|--------|--------|-------|----------------|
|
| 70 |
+
| Backend code | Tests pass + coverage % | `src/**/*.ts` | `npm test` |
|
| 71 |
+
| Frontend UI | Lighthouse score | `src/components/**` | `npx lighthouse` |
|
| 72 |
+
| ML training | val_bpb / loss | `train.py` | `uv run train.py` |
|
| 73 |
+
| Blog/content | Word count + readability | `content/*.md` | Custom script |
|
| 74 |
+
| Performance | Benchmark time (ms) | Target files | `npm run bench` |
|
| 75 |
+
| Refactoring | Tests pass + LOC reduced | Target module | `npm test && wc -l` |
|
| 76 |
+
|
| 77 |
+
Adapt the loop to your domain. The PRINCIPLES are universal; the METRICS are domain-specific.
|
autoresearch/references/autonomous-loop-protocol.md
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Autonomous Loop Protocol
|
| 2 |
+
|
| 3 |
+
Detailed protocol for the autoresearch iteration loop. SKILL.md has the summary; this file has the full rules.
|
| 4 |
+
|
| 5 |
+
## Phase 1: Review (30 seconds)
|
| 6 |
+
|
| 7 |
+
Before each iteration, build situational awareness:
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
1. Read current state of in-scope files (full context)
|
| 11 |
+
2. Read last 10-20 entries from results log
|
| 12 |
+
3. Read git log --oneline -20 to see recent changes
|
| 13 |
+
4. Identify: what worked, what failed, what's untried
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
**Why read every time?** After rollbacks, state may differ from what you expect. Never assume — always verify.
|
| 17 |
+
|
| 18 |
+
## Phase 2: Ideate (Strategic)
|
| 19 |
+
|
| 20 |
+
Pick the NEXT change. Priority order:
|
| 21 |
+
|
| 22 |
+
1. **Fix crashes/failures** from previous iteration first
|
| 23 |
+
2. **Exploit successes** — if last change improved metric, try variants in same direction
|
| 24 |
+
3. **Explore new approaches** — try something the results log shows hasn't been attempted
|
| 25 |
+
4. **Combine near-misses** — two changes that individually didn't help might work together
|
| 26 |
+
5. **Simplify** — remove code while maintaining metric. Simpler = better
|
| 27 |
+
6. **Radical experiments** — when incremental changes stall, try something dramatically different
|
| 28 |
+
|
| 29 |
+
**Anti-patterns:**
|
| 30 |
+
- Don't repeat exact same change that was already discarded
|
| 31 |
+
- Don't make multiple unrelated changes at once (can't attribute improvement)
|
| 32 |
+
- Don't chase marginal gains with ugly complexity
|
| 33 |
+
|
| 34 |
+
## Phase 3: Modify (One Atomic Change)
|
| 35 |
+
|
| 36 |
+
- Make ONE focused change to in-scope files
|
| 37 |
+
- The change should be explainable in one sentence
|
| 38 |
+
- Write the description BEFORE making the change (forces clarity)
|
| 39 |
+
|
| 40 |
+
## Phase 4: Commit (Before Verification)
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
git add <changed-files>
|
| 44 |
+
git commit -m "experiment: <one-sentence description>"
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
Commit BEFORE running verification so rollback is clean: `git reset --hard HEAD~1`
|
| 48 |
+
|
| 49 |
+
## Phase 5: Verify (Mechanical Only)
|
| 50 |
+
|
| 51 |
+
Run the agreed-upon verification command. Capture output.
|
| 52 |
+
|
| 53 |
+
**Timeout rule:** If verification exceeds 2x normal time, kill and treat as crash.
|
| 54 |
+
|
| 55 |
+
**Extract metric:** Parse the verification output for the specific metric number.
|
| 56 |
+
|
| 57 |
+
## Phase 6: Decide (No Ambiguity)
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
IF metric_improved:
|
| 61 |
+
STATUS = "keep"
|
| 62 |
+
# Do nothing — commit stays
|
| 63 |
+
ELIF metric_same_or_worse:
|
| 64 |
+
STATUS = "discard"
|
| 65 |
+
git reset --hard HEAD~1
|
| 66 |
+
ELIF crashed:
|
| 67 |
+
# Attempt fix (max 3 tries)
|
| 68 |
+
IF fixable:
|
| 69 |
+
Fix → re-commit → re-verify
|
| 70 |
+
ELSE:
|
| 71 |
+
STATUS = "crash"
|
| 72 |
+
git reset --hard HEAD~1
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
**Simplicity override:** If metric barely improved (+<0.1%) but change adds significant complexity, treat as "discard". If metric unchanged but code is simpler, treat as "keep".
|
| 76 |
+
|
| 77 |
+
## Phase 7: Log Results
|
| 78 |
+
|
| 79 |
+
Append to results log (TSV format):
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
iteration commit metric status description
|
| 83 |
+
42 a1b2c3d 0.9821 keep increase attention heads from 8 to 12
|
| 84 |
+
43 - 0.9845 discard switch optimizer to SGD
|
| 85 |
+
44 - 0.0000 crash double batch size (OOM)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
## Phase 8: Repeat
|
| 89 |
+
|
| 90 |
+
Go to Phase 1. **NEVER STOP. NEVER ASK IF YOU SHOULD CONTINUE.**
|
| 91 |
+
|
| 92 |
+
If stuck (>5 consecutive discards):
|
| 93 |
+
1. Re-read ALL in-scope files from scratch
|
| 94 |
+
2. Re-read the original goal/direction
|
| 95 |
+
3. Review entire results log for patterns
|
| 96 |
+
4. Try combining 2-3 previously successful changes
|
| 97 |
+
5. Try the OPPOSITE of what hasn't been working
|
| 98 |
+
6. Try a radical architectural change
|
| 99 |
+
|
| 100 |
+
## Crash Recovery
|
| 101 |
+
|
| 102 |
+
- Syntax error → fix immediately, don't count as separate iteration
|
| 103 |
+
- Runtime error → attempt fix (max 3 tries), then move on
|
| 104 |
+
- Resource exhaustion (OOM) → revert, try smaller variant
|
| 105 |
+
- Infinite loop/hang → kill after timeout, revert, avoid that approach
|
| 106 |
+
- External dependency failure → skip, log, try different approach
|
| 107 |
+
|
| 108 |
+
## Communication
|
| 109 |
+
|
| 110 |
+
- **DO NOT** ask "should I keep going?" — YES. ALWAYS.
|
| 111 |
+
- **DO NOT** summarize after each iteration — just log and continue
|
| 112 |
+
- **DO** print a brief one-line status every ~5 iterations (e.g., "Iteration 25: metric at 0.95, 8 keeps / 17 discards")
|
| 113 |
+
- **DO** alert if you discover something surprising or game-changing
|
autoresearch/references/core-principles.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core Principles — From Karpathy's Autoresearch
|
| 2 |
+
|
| 3 |
+
7 universal principles extracted from autoresearch, applicable to ANY autonomous work.
|
| 4 |
+
|
| 5 |
+
## 1. Constraint = Enabler
|
| 6 |
+
|
| 7 |
+
Autonomy succeeds through intentional constraint, not despite it.
|
| 8 |
+
|
| 9 |
+
| Autoresearch | Generalized |
|
| 10 |
+
|--------------|-------------|
|
| 11 |
+
| 630-line codebase | Bounded scope that fits agent context |
|
| 12 |
+
| 5-minute time budget | Fixed iteration cost |
|
| 13 |
+
| One metric (val_bpb) | Single mechanical success criterion |
|
| 14 |
+
|
| 15 |
+
**Why:** Constraints enable agent confidence (full context understood), verification simplicity (no ambiguity), iteration velocity (low cost = rapid feedback loops).
|
| 16 |
+
|
| 17 |
+
**Apply:** Before starting, define: what files are in-scope? What's the ONE metric? What's the time budget per iteration?
|
| 18 |
+
|
| 19 |
+
## 2. Separate Strategy from Tactics
|
| 20 |
+
|
| 21 |
+
Humans set direction. Agents execute iterations.
|
| 22 |
+
|
| 23 |
+
| Strategic (Human) | Tactical (Agent) |
|
| 24 |
+
|-------------------|------------------|
|
| 25 |
+
| "Improve page load speed" | "Lazy-load images, code-split routes" |
|
| 26 |
+
| "Increase test coverage" | "Add tests for uncovered edge cases" |
|
| 27 |
+
| "Refactor auth module" | "Extract middleware, simplify handlers" |
|
| 28 |
+
|
| 29 |
+
**Why:** Humans understand WHY. Agents handle HOW. Mixing these roles wastes both human creativity and agent iteration speed.
|
| 30 |
+
|
| 31 |
+
**Apply:** Get clear direction from user (or program.md). Then iterate autonomously on implementation.
|
| 32 |
+
|
| 33 |
+
## 3. Metrics Must Be Mechanical
|
| 34 |
+
|
| 35 |
+
If you can't verify with a command, you can't iterate autonomously.
|
| 36 |
+
|
| 37 |
+
- Tests pass/fail (exit code 0)
|
| 38 |
+
- Benchmark time in milliseconds
|
| 39 |
+
- Coverage percentage
|
| 40 |
+
- Lighthouse score
|
| 41 |
+
- File size in bytes
|
| 42 |
+
- Lines of code count
|
| 43 |
+
|
| 44 |
+
**Anti-pattern:** "Looks better", "probably improved", "seems cleaner" → these KILL autonomous loops because there's no decision function.
|
| 45 |
+
|
| 46 |
+
**Apply:** Define the `grep` command (or equivalent) that extracts your metric BEFORE starting.
|
| 47 |
+
|
| 48 |
+
## 4. Verification Must Be Fast
|
| 49 |
+
|
| 50 |
+
If verification takes longer than the work itself, incentives misalign.
|
| 51 |
+
|
| 52 |
+
| Fast (enables iteration) | Slow (kills iteration) |
|
| 53 |
+
|-------------------------|----------------------|
|
| 54 |
+
| Unit tests (seconds) | Full E2E suite (minutes) |
|
| 55 |
+
| Type check (seconds) | Manual QA (hours) |
|
| 56 |
+
| Lint check (instant) | Code review (async) |
|
| 57 |
+
|
| 58 |
+
**Apply:** Use the FASTEST verification that still catches real problems. Save slow verification for after the loop.
|
| 59 |
+
|
| 60 |
+
## 5. Iteration Cost Shapes Behavior
|
| 61 |
+
|
| 62 |
+
- Cheap iteration → bold exploration, many experiments
|
| 63 |
+
- Expensive iteration → conservative, few experiments
|
| 64 |
+
|
| 65 |
+
Autoresearch: 5-minute cost → 100 experiments/night.
|
| 66 |
+
Software: 10-second test → 360 experiments/hour.
|
| 67 |
+
|
| 68 |
+
**Apply:** Minimize iteration cost. Use fast tests, incremental builds, targeted verification. Every minute saved = more experiments run.
|
| 69 |
+
|
| 70 |
+
## 6. Git as Memory and Audit Trail
|
| 71 |
+
|
| 72 |
+
Every successful change is committed. This enables:
|
| 73 |
+
- **Causality tracking** — which change drove improvement?
|
| 74 |
+
- **Stacking wins** — each commit builds on prior successes
|
| 75 |
+
- **Pattern learning** — agent sees what worked in THIS codebase
|
| 76 |
+
- **Human review** — researcher inspects agent's decision sequence
|
| 77 |
+
|
| 78 |
+
**Apply:** Commit before verify. Revert on failure. Agent reads its own git history to inform next experiment.
|
| 79 |
+
|
| 80 |
+
## 7. Honest Limitations
|
| 81 |
+
|
| 82 |
+
State what the system can and cannot do. Don't oversell.
|
| 83 |
+
|
| 84 |
+
Autoresearch CANNOT: change tokenizer, replace human direction, guarantee meaningful improvements.
|
| 85 |
+
|
| 86 |
+
**Apply:** At setup, explicitly state constraints. If agent hits a wall it can't solve (missing permissions, external dependency, needs human judgment), say so clearly instead of guessing.
|
| 87 |
+
|
| 88 |
+
## The Meta-Principle
|
| 89 |
+
|
| 90 |
+
> Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
|
| 91 |
+
|
| 92 |
+
This isn't "removing humans." It's reassigning human effort from execution to direction. Humans become MORE valuable by focusing on irreducibly creative/strategic work.
|
autoresearch/references/results-logging.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Results Logging Protocol
|
| 2 |
+
|
| 3 |
+
Track every iteration in a structured log. Enables pattern recognition and prevents repeating failed experiments.
|
| 4 |
+
|
| 5 |
+
## Log Format (TSV)
|
| 6 |
+
|
| 7 |
+
Create `autoresearch-results.tsv` in the working directory (gitignored):
|
| 8 |
+
|
| 9 |
+
```tsv
|
| 10 |
+
iteration commit metric delta status description
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
### Columns
|
| 14 |
+
|
| 15 |
+
| Column | Type | Description |
|
| 16 |
+
|--------|------|-------------|
|
| 17 |
+
| iteration | int | Sequential counter starting at 0 (baseline) |
|
| 18 |
+
| commit | string | Short git hash (7 chars), "-" if reverted |
|
| 19 |
+
| metric | float | Measured value from verification |
|
| 20 |
+
| delta | float | Change from previous best (negative = improved for "lower is better") |
|
| 21 |
+
| status | enum | `baseline`, `keep`, `discard`, `crash` |
|
| 22 |
+
| description | string | One-sentence description of what was tried |
|
| 23 |
+
|
| 24 |
+
### Example
|
| 25 |
+
|
| 26 |
+
```tsv
|
| 27 |
+
iteration commit metric delta status description
|
| 28 |
+
0 a1b2c3d 85.2 0.0 baseline initial state — test coverage 85.2%
|
| 29 |
+
1 b2c3d4e 87.1 +1.9 keep add tests for auth middleware edge cases
|
| 30 |
+
2 - 86.5 -0.6 discard refactor test helpers (broke 2 tests)
|
| 31 |
+
3 - 0.0 0.0 crash add integration tests (DB connection failed)
|
| 32 |
+
4 c3d4e5f 88.3 +1.2 keep add tests for error handling in API routes
|
| 33 |
+
5 d4e5f6g 89.0 +0.7 keep add boundary value tests for validators
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## Log Management
|
| 37 |
+
|
| 38 |
+
- Create at setup (iteration 0 = baseline)
|
| 39 |
+
- Append after EVERY iteration (including crashes)
|
| 40 |
+
- Do NOT commit this file to git (add to .gitignore)
|
| 41 |
+
- Read last 10-20 entries at start of each iteration for context
|
| 42 |
+
- Use to detect patterns: what kind of changes tend to succeed?
|
| 43 |
+
|
| 44 |
+
## Summary Reporting
|
| 45 |
+
|
| 46 |
+
Every 10 iterations, print a brief summary:
|
| 47 |
+
|
| 48 |
+
```
|
| 49 |
+
=== Autoresearch Progress (iteration 20) ===
|
| 50 |
+
Baseline: 85.2% → Current best: 92.1% (+6.9%)
|
| 51 |
+
Keeps: 8 | Discards: 10 | Crashes: 2
|
| 52 |
+
Last 5: keep, discard, discard, keep, keep
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Metric Direction
|
| 56 |
+
|
| 57 |
+
Clarify at setup whether lower or higher is better:
|
| 58 |
+
- **Lower is better:** val_bpb, response time (ms), bundle size (KB), error count
|
| 59 |
+
- **Higher is better:** test coverage (%), lighthouse score, throughput (req/s)
|
| 60 |
+
|
| 61 |
+
Record direction in first line of results log as a comment:
|
| 62 |
+
```
|
| 63 |
+
# metric_direction: higher_is_better
|
| 64 |
+
```
|
karpathy-guidelines/SKILL.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: karpathy-guidelines
|
| 3 |
+
description: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria.
|
| 4 |
+
license: MIT
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Karpathy Guidelines
|
| 8 |
+
|
| 9 |
+
Behavioral guidelines to reduce common LLM coding mistakes, derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls.
|
| 10 |
+
|
| 11 |
+
**Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
|
| 12 |
+
|
| 13 |
+
## 1. Think Before Coding
|
| 14 |
+
|
| 15 |
+
**Don't assume. Don't hide confusion. Surface tradeoffs.**
|
| 16 |
+
|
| 17 |
+
Before implementing:
|
| 18 |
+
- State your assumptions explicitly. If uncertain, ask.
|
| 19 |
+
- If multiple interpretations exist, present them - don't pick silently.
|
| 20 |
+
- If a simpler approach exists, say so. Push back when warranted.
|
| 21 |
+
- If something is unclear, stop. Name what's confusing. Ask.
|
| 22 |
+
|
| 23 |
+
## 2. Simplicity First
|
| 24 |
+
|
| 25 |
+
**Minimum code that solves the problem. Nothing speculative.**
|
| 26 |
+
|
| 27 |
+
- No features beyond what was asked.
|
| 28 |
+
- No abstractions for single-use code.
|
| 29 |
+
- No "flexibility" or "configurability" that wasn't requested.
|
| 30 |
+
- No error handling for impossible scenarios.
|
| 31 |
+
- If you write 200 lines and it could be 50, rewrite it.
|
| 32 |
+
|
| 33 |
+
Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
|
| 34 |
+
|
| 35 |
+
## 3. Surgical Changes
|
| 36 |
+
|
| 37 |
+
**Touch only what you must. Clean up only your own mess.**
|
| 38 |
+
|
| 39 |
+
When editing existing code:
|
| 40 |
+
- Don't "improve" adjacent code, comments, or formatting.
|
| 41 |
+
- Don't refactor things that aren't broken.
|
| 42 |
+
- Match existing style, even if you'd do it differently.
|
| 43 |
+
- If you notice unrelated dead code, mention it - don't delete it.
|
| 44 |
+
|
| 45 |
+
When your changes create orphans:
|
| 46 |
+
- Remove imports/variables/functions that YOUR changes made unused.
|
| 47 |
+
- Don't remove pre-existing dead code unless asked.
|
| 48 |
+
|
| 49 |
+
The test: Every changed line should trace directly to the user's request.
|
| 50 |
+
|
| 51 |
+
## 4. Goal-Driven Execution
|
| 52 |
+
|
| 53 |
+
**Define success criteria. Loop until verified.**
|
| 54 |
+
|
| 55 |
+
Transform tasks into verifiable goals:
|
| 56 |
+
- "Add validation" → "Write tests for invalid inputs, then make them pass"
|
| 57 |
+
- "Fix the bug" → "Write a test that reproduces it, then make it pass"
|
| 58 |
+
- "Refactor X" → "Ensure tests pass before and after"
|
| 59 |
+
|
| 60 |
+
For multi-step tasks, state a brief plan:
|
| 61 |
+
```
|
| 62 |
+
1. [Step] → verify: [check]
|
| 63 |
+
2. [Step] → verify: [check]
|
| 64 |
+
3. [Step] → verify: [check]
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
|