Nekochu commited on
Commit
bc20316
·
verified ·
1 Parent(s): aa41a3e

Add skills karpathy-guidelines + autoresearch from 2026 february

Browse files
autoresearch/SKILL.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: autoresearch
3
+ description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat forever until stopped.
4
+ version: 1.0.0
5
+ ---
6
+
7
+ # Claude Autoresearch — Autonomous Goal-directed Iteration
8
+
9
+ Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research.
10
+
11
+ **Core idea:** You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat. Never stop.
12
+
13
+ ## When to Activate
14
+
15
+ - User invokes `/autoresearch` or `/ug:autoresearch`
16
+ - User says "work autonomously", "iterate until done", "keep improving", "run overnight"
17
+ - Any task requiring repeated iteration cycles with measurable outcomes
18
+
19
+ ## Setup Phase (Do Once)
20
+
21
+ 1. **Read all in-scope files** for full context before any modification
22
+ 2. **Define the goal** — What does "better" mean? Extract or ask for a mechanical metric:
23
+ - Code: tests pass, build succeeds, performance benchmark improves
24
+ - Content: word count target hit, SEO score improves, readability score
25
+ - Design: lighthouse score, accessibility audit passes
26
+ - If no metric exists → define one with user, or use simplest proxy (e.g. "compiles without errors")
27
+ 3. **Define scope constraints** — Which files can you modify? Which are read-only?
28
+ 4. **Create a results log** — Track every iteration (see `references/results-logging.md`)
29
+ 5. **Establish baseline** — Run verification on current state. Record as iteration #0
30
+ 6. **Confirm and go** — Show user the setup, get confirmation, then BEGIN THE LOOP
31
+
32
+ ## The Loop (Runs Forever)
33
+
34
+ Read `references/autonomous-loop-protocol.md` for full protocol details.
35
+
36
+ ```
37
+ LOOP FOREVER:
38
+ 1. Review: Read current state + git history + results log
39
+ 2. Ideate: Pick next change based on goal, past results, what hasn't been tried
40
+ 3. Modify: Make ONE focused change to in-scope files
41
+ 4. Commit: Git commit the change (before verification)
42
+ 5. Verify: Run the mechanical metric (tests, build, benchmark, etc.)
43
+ 6. Decide:
44
+ - IMPROVED → Keep commit, log "keep", advance
45
+ - SAME/WORSE → Git revert, log "discard"
46
+ - CRASHED → Try to fix (max 3 attempts), else log "crash" and move on
47
+ 7. Log: Record result in results log
48
+ 8. Repeat: Go to step 1. NEVER STOP. NEVER ASK "should I continue?"
49
+ ```
50
+
51
+ ## Critical Rules
52
+
53
+ 1. **NEVER STOP** — Loop until manually interrupted. User may be asleep.
54
+ 2. **Read before write** — Always understand full context before modifying
55
+ 3. **One change per iteration** — Atomic changes. If it breaks, you know exactly why
56
+ 4. **Mechanical verification only** — No subjective "looks good". Use metrics
57
+ 5. **Automatic rollback** — Failed changes revert instantly. No debates
58
+ 6. **Simplicity wins** — Equal results + less code = KEEP. Tiny improvement + ugly complexity = DISCARD
59
+ 7. **Git is memory** — Every kept change committed. Agent reads history to learn patterns
60
+ 8. **When stuck, think harder** — Re-read files, re-read goal, combine near-misses, try radical changes. Don't ask for help unless truly blocked by missing access/permissions
61
+
62
+ ## Principles Reference
63
+
64
+ See `references/core-principles.md` for the 7 generalizable principles from autoresearch.
65
+
66
+ ## Adapting to Different Domains
67
+
68
+ | Domain | Metric | Scope | Verify Command |
69
+ |--------|--------|-------|----------------|
70
+ | Backend code | Tests pass + coverage % | `src/**/*.ts` | `npm test` |
71
+ | Frontend UI | Lighthouse score | `src/components/**` | `npx lighthouse` |
72
+ | ML training | val_bpb / loss | `train.py` | `uv run train.py` |
73
+ | Blog/content | Word count + readability | `content/*.md` | Custom script |
74
+ | Performance | Benchmark time (ms) | Target files | `npm run bench` |
75
+ | Refactoring | Tests pass + LOC reduced | Target module | `npm test && wc -l` |
76
+
77
+ Adapt the loop to your domain. The PRINCIPLES are universal; the METRICS are domain-specific.
autoresearch/references/autonomous-loop-protocol.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Autonomous Loop Protocol
2
+
3
+ Detailed protocol for the autoresearch iteration loop. SKILL.md has the summary; this file has the full rules.
4
+
5
+ ## Phase 1: Review (30 seconds)
6
+
7
+ Before each iteration, build situational awareness:
8
+
9
+ ```
10
+ 1. Read current state of in-scope files (full context)
11
+ 2. Read last 10-20 entries from results log
12
+ 3. Read git log --oneline -20 to see recent changes
13
+ 4. Identify: what worked, what failed, what's untried
14
+ ```
15
+
16
+ **Why read every time?** After rollbacks, state may differ from what you expect. Never assume — always verify.
17
+
18
+ ## Phase 2: Ideate (Strategic)
19
+
20
+ Pick the NEXT change. Priority order:
21
+
22
+ 1. **Fix crashes/failures** from previous iteration first
23
+ 2. **Exploit successes** — if last change improved metric, try variants in same direction
24
+ 3. **Explore new approaches** — try something the results log shows hasn't been attempted
25
+ 4. **Combine near-misses** — two changes that individually didn't help might work together
26
+ 5. **Simplify** — remove code while maintaining metric. Simpler = better
27
+ 6. **Radical experiments** — when incremental changes stall, try something dramatically different
28
+
29
+ **Anti-patterns:**
30
+ - Don't repeat exact same change that was already discarded
31
+ - Don't make multiple unrelated changes at once (can't attribute improvement)
32
+ - Don't chase marginal gains with ugly complexity
33
+
34
+ ## Phase 3: Modify (One Atomic Change)
35
+
36
+ - Make ONE focused change to in-scope files
37
+ - The change should be explainable in one sentence
38
+ - Write the description BEFORE making the change (forces clarity)
39
+
40
+ ## Phase 4: Commit (Before Verification)
41
+
42
+ ```bash
43
+ git add <changed-files>
44
+ git commit -m "experiment: <one-sentence description>"
45
+ ```
46
+
47
+ Commit BEFORE running verification so rollback is clean: `git reset --hard HEAD~1`
48
+
49
+ ## Phase 5: Verify (Mechanical Only)
50
+
51
+ Run the agreed-upon verification command. Capture output.
52
+
53
+ **Timeout rule:** If verification exceeds 2x normal time, kill and treat as crash.
54
+
55
+ **Extract metric:** Parse the verification output for the specific metric number.
56
+
57
+ ## Phase 6: Decide (No Ambiguity)
58
+
59
+ ```
60
+ IF metric_improved:
61
+ STATUS = "keep"
62
+ # Do nothing — commit stays
63
+ ELIF metric_same_or_worse:
64
+ STATUS = "discard"
65
+ git reset --hard HEAD~1
66
+ ELIF crashed:
67
+ # Attempt fix (max 3 tries)
68
+ IF fixable:
69
+ Fix → re-commit → re-verify
70
+ ELSE:
71
+ STATUS = "crash"
72
+ git reset --hard HEAD~1
73
+ ```
74
+
75
+ **Simplicity override:** If metric barely improved (+<0.1%) but change adds significant complexity, treat as "discard". If metric unchanged but code is simpler, treat as "keep".
76
+
77
+ ## Phase 7: Log Results
78
+
79
+ Append to results log (TSV format):
80
+
81
+ ```
82
+ iteration commit metric status description
83
+ 42 a1b2c3d 0.9821 keep increase attention heads from 8 to 12
84
+ 43 - 0.9845 discard switch optimizer to SGD
85
+ 44 - 0.0000 crash double batch size (OOM)
86
+ ```
87
+
88
+ ## Phase 8: Repeat
89
+
90
+ Go to Phase 1. **NEVER STOP. NEVER ASK IF YOU SHOULD CONTINUE.**
91
+
92
+ If stuck (>5 consecutive discards):
93
+ 1. Re-read ALL in-scope files from scratch
94
+ 2. Re-read the original goal/direction
95
+ 3. Review entire results log for patterns
96
+ 4. Try combining 2-3 previously successful changes
97
+ 5. Try the OPPOSITE of what hasn't been working
98
+ 6. Try a radical architectural change
99
+
100
+ ## Crash Recovery
101
+
102
+ - Syntax error → fix immediately, don't count as separate iteration
103
+ - Runtime error → attempt fix (max 3 tries), then move on
104
+ - Resource exhaustion (OOM) → revert, try smaller variant
105
+ - Infinite loop/hang → kill after timeout, revert, avoid that approach
106
+ - External dependency failure → skip, log, try different approach
107
+
108
+ ## Communication
109
+
110
+ - **DO NOT** ask "should I keep going?" — YES. ALWAYS.
111
+ - **DO NOT** summarize after each iteration — just log and continue
112
+ - **DO** print a brief one-line status every ~5 iterations (e.g., "Iteration 25: metric at 0.95, 8 keeps / 17 discards")
113
+ - **DO** alert if you discover something surprising or game-changing
autoresearch/references/core-principles.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core Principles — From Karpathy's Autoresearch
2
+
3
+ 7 universal principles extracted from autoresearch, applicable to ANY autonomous work.
4
+
5
+ ## 1. Constraint = Enabler
6
+
7
+ Autonomy succeeds through intentional constraint, not despite it.
8
+
9
+ | Autoresearch | Generalized |
10
+ |--------------|-------------|
11
+ | 630-line codebase | Bounded scope that fits agent context |
12
+ | 5-minute time budget | Fixed iteration cost |
13
+ | One metric (val_bpb) | Single mechanical success criterion |
14
+
15
+ **Why:** Constraints enable agent confidence (full context understood), verification simplicity (no ambiguity), iteration velocity (low cost = rapid feedback loops).
16
+
17
+ **Apply:** Before starting, define: what files are in-scope? What's the ONE metric? What's the time budget per iteration?
18
+
19
+ ## 2. Separate Strategy from Tactics
20
+
21
+ Humans set direction. Agents execute iterations.
22
+
23
+ | Strategic (Human) | Tactical (Agent) |
24
+ |-------------------|------------------|
25
+ | "Improve page load speed" | "Lazy-load images, code-split routes" |
26
+ | "Increase test coverage" | "Add tests for uncovered edge cases" |
27
+ | "Refactor auth module" | "Extract middleware, simplify handlers" |
28
+
29
+ **Why:** Humans understand WHY. Agents handle HOW. Mixing these roles wastes both human creativity and agent iteration speed.
30
+
31
+ **Apply:** Get clear direction from user (or program.md). Then iterate autonomously on implementation.
32
+
33
+ ## 3. Metrics Must Be Mechanical
34
+
35
+ If you can't verify with a command, you can't iterate autonomously.
36
+
37
+ - Tests pass/fail (exit code 0)
38
+ - Benchmark time in milliseconds
39
+ - Coverage percentage
40
+ - Lighthouse score
41
+ - File size in bytes
42
+ - Lines of code count
43
+
44
+ **Anti-pattern:** "Looks better", "probably improved", "seems cleaner" → these KILL autonomous loops because there's no decision function.
45
+
46
+ **Apply:** Define the `grep` command (or equivalent) that extracts your metric BEFORE starting.
47
+
48
+ ## 4. Verification Must Be Fast
49
+
50
+ If verification takes longer than the work itself, incentives misalign.
51
+
52
+ | Fast (enables iteration) | Slow (kills iteration) |
53
+ |-------------------------|----------------------|
54
+ | Unit tests (seconds) | Full E2E suite (minutes) |
55
+ | Type check (seconds) | Manual QA (hours) |
56
+ | Lint check (instant) | Code review (async) |
57
+
58
+ **Apply:** Use the FASTEST verification that still catches real problems. Save slow verification for after the loop.
59
+
60
+ ## 5. Iteration Cost Shapes Behavior
61
+
62
+ - Cheap iteration → bold exploration, many experiments
63
+ - Expensive iteration → conservative, few experiments
64
+
65
+ Autoresearch: 5-minute cost → 100 experiments/night.
66
+ Software: 10-second test → 360 experiments/hour.
67
+
68
+ **Apply:** Minimize iteration cost. Use fast tests, incremental builds, targeted verification. Every minute saved = more experiments run.
69
+
70
+ ## 6. Git as Memory and Audit Trail
71
+
72
+ Every successful change is committed. This enables:
73
+ - **Causality tracking** — which change drove improvement?
74
+ - **Stacking wins** — each commit builds on prior successes
75
+ - **Pattern learning** — agent sees what worked in THIS codebase
76
+ - **Human review** — researcher inspects agent's decision sequence
77
+
78
+ **Apply:** Commit before verify. Revert on failure. Agent reads its own git history to inform next experiment.
79
+
80
+ ## 7. Honest Limitations
81
+
82
+ State what the system can and cannot do. Don't oversell.
83
+
84
+ Autoresearch CANNOT: change tokenizer, replace human direction, guarantee meaningful improvements.
85
+
86
+ **Apply:** At setup, explicitly state constraints. If agent hits a wall it can't solve (missing permissions, external dependency, needs human judgment), say so clearly instead of guessing.
87
+
88
+ ## The Meta-Principle
89
+
90
+ > Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
91
+
92
+ This isn't "removing humans." It's reassigning human effort from execution to direction. Humans become MORE valuable by focusing on irreducibly creative/strategic work.
autoresearch/references/results-logging.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Results Logging Protocol
2
+
3
+ Track every iteration in a structured log. Enables pattern recognition and prevents repeating failed experiments.
4
+
5
+ ## Log Format (TSV)
6
+
7
+ Create `autoresearch-results.tsv` in the working directory (gitignored):
8
+
9
+ ```tsv
10
+ iteration commit metric delta status description
11
+ ```
12
+
13
+ ### Columns
14
+
15
+ | Column | Type | Description |
16
+ |--------|------|-------------|
17
+ | iteration | int | Sequential counter starting at 0 (baseline) |
18
+ | commit | string | Short git hash (7 chars), "-" if reverted |
19
+ | metric | float | Measured value from verification |
20
+ | delta | float | Change from previous best (negative = improved for "lower is better") |
21
+ | status | enum | `baseline`, `keep`, `discard`, `crash` |
22
+ | description | string | One-sentence description of what was tried |
23
+
24
+ ### Example
25
+
26
+ ```tsv
27
+ iteration commit metric delta status description
28
+ 0 a1b2c3d 85.2 0.0 baseline initial state — test coverage 85.2%
29
+ 1 b2c3d4e 87.1 +1.9 keep add tests for auth middleware edge cases
30
+ 2 - 86.5 -0.6 discard refactor test helpers (broke 2 tests)
31
+ 3 - 0.0 0.0 crash add integration tests (DB connection failed)
32
+ 4 c3d4e5f 88.3 +1.2 keep add tests for error handling in API routes
33
+ 5 d4e5f6g 89.0 +0.7 keep add boundary value tests for validators
34
+ ```
35
+
36
+ ## Log Management
37
+
38
+ - Create at setup (iteration 0 = baseline)
39
+ - Append after EVERY iteration (including crashes)
40
+ - Do NOT commit this file to git (add to .gitignore)
41
+ - Read last 10-20 entries at start of each iteration for context
42
+ - Use to detect patterns: what kind of changes tend to succeed?
43
+
44
+ ## Summary Reporting
45
+
46
+ Every 10 iterations, print a brief summary:
47
+
48
+ ```
49
+ === Autoresearch Progress (iteration 20) ===
50
+ Baseline: 85.2% → Current best: 92.1% (+6.9%)
51
+ Keeps: 8 | Discards: 10 | Crashes: 2
52
+ Last 5: keep, discard, discard, keep, keep
53
+ ```
54
+
55
+ ## Metric Direction
56
+
57
+ Clarify at setup whether lower or higher is better:
58
+ - **Lower is better:** val_bpb, response time (ms), bundle size (KB), error count
59
+ - **Higher is better:** test coverage (%), lighthouse score, throughput (req/s)
60
+
61
+ Record direction in first line of results log as a comment:
62
+ ```
63
+ # metric_direction: higher_is_better
64
+ ```
karpathy-guidelines/SKILL.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: karpathy-guidelines
3
+ description: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria.
4
+ license: MIT
5
+ ---
6
+
7
+ # Karpathy Guidelines
8
+
9
+ Behavioral guidelines to reduce common LLM coding mistakes, derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls.
10
+
11
+ **Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
12
+
13
+ ## 1. Think Before Coding
14
+
15
+ **Don't assume. Don't hide confusion. Surface tradeoffs.**
16
+
17
+ Before implementing:
18
+ - State your assumptions explicitly. If uncertain, ask.
19
+ - If multiple interpretations exist, present them - don't pick silently.
20
+ - If a simpler approach exists, say so. Push back when warranted.
21
+ - If something is unclear, stop. Name what's confusing. Ask.
22
+
23
+ ## 2. Simplicity First
24
+
25
+ **Minimum code that solves the problem. Nothing speculative.**
26
+
27
+ - No features beyond what was asked.
28
+ - No abstractions for single-use code.
29
+ - No "flexibility" or "configurability" that wasn't requested.
30
+ - No error handling for impossible scenarios.
31
+ - If you write 200 lines and it could be 50, rewrite it.
32
+
33
+ Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
34
+
35
+ ## 3. Surgical Changes
36
+
37
+ **Touch only what you must. Clean up only your own mess.**
38
+
39
+ When editing existing code:
40
+ - Don't "improve" adjacent code, comments, or formatting.
41
+ - Don't refactor things that aren't broken.
42
+ - Match existing style, even if you'd do it differently.
43
+ - If you notice unrelated dead code, mention it - don't delete it.
44
+
45
+ When your changes create orphans:
46
+ - Remove imports/variables/functions that YOUR changes made unused.
47
+ - Don't remove pre-existing dead code unless asked.
48
+
49
+ The test: Every changed line should trace directly to the user's request.
50
+
51
+ ## 4. Goal-Driven Execution
52
+
53
+ **Define success criteria. Loop until verified.**
54
+
55
+ Transform tasks into verifiable goals:
56
+ - "Add validation" → "Write tests for invalid inputs, then make them pass"
57
+ - "Fix the bug" → "Write a test that reproduces it, then make it pass"
58
+ - "Refactor X" → "Ensure tests pass before and after"
59
+
60
+ For multi-step tasks, state a brief plan:
61
+ ```
62
+ 1. [Step] → verify: [check]
63
+ 2. [Step] → verify: [check]
64
+ 3. [Step] → verify: [check]
65
+ ```
66
+
67
+ Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.