Spaces:

Luminia
/

README

Running

App Files Files Community

Nekochu commited on 4 days ago

Commit

bc20316

verified ·

1 Parent(s): aa41a3e

Add skills karpathy-guidelines + autoresearch from 2026 february

Browse files

Files changed (5) hide show

autoresearch/SKILL.md +77 -0
autoresearch/references/autonomous-loop-protocol.md +113 -0
autoresearch/references/core-principles.md +92 -0
autoresearch/references/results-logging.md +64 -0
karpathy-guidelines/SKILL.md +67 -0

autoresearch/SKILL.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+name: autoresearch
+description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat forever until stopped.
+version: 1.0.0
+---
+# Claude Autoresearch — Autonomous Goal-directed Iteration
+Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research.
+**Core idea:** You are an autonomous agent. Modify → Verify → Keep/Discard → Repeat. Never stop.
+## When to Activate
+- User invokes `/autoresearch` or `/ug:autoresearch`
+- User says "work autonomously", "iterate until done", "keep improving", "run overnight"
+- Any task requiring repeated iteration cycles with measurable outcomes
+## Setup Phase (Do Once)
+1. **Read all in-scope files** for full context before any modification
+2. **Define the goal** — What does "better" mean? Extract or ask for a mechanical metric:
+   - Code: tests pass, build succeeds, performance benchmark improves
+   - Content: word count target hit, SEO score improves, readability score
+   - Design: lighthouse score, accessibility audit passes
+   - If no metric exists → define one with user, or use simplest proxy (e.g. "compiles without errors")
+3. **Define scope constraints** — Which files can you modify? Which are read-only?
+4. **Create a results log** — Track every iteration (see `references/results-logging.md`)
+5. **Establish baseline** — Run verification on current state. Record as iteration #0
+6. **Confirm and go** — Show user the setup, get confirmation, then BEGIN THE LOOP
+## The Loop (Runs Forever)
+Read `references/autonomous-loop-protocol.md` for full protocol details.
+```
+LOOP FOREVER:
+  1. Review: Read current state + git history + results log
+  2. Ideate: Pick next change based on goal, past results, what hasn't been tried
+  3. Modify: Make ONE focused change to in-scope files
+  4. Commit: Git commit the change (before verification)
+  5. Verify: Run the mechanical metric (tests, build, benchmark, etc.)
+  6. Decide:
+     - IMPROVED → Keep commit, log "keep", advance
+     - SAME/WORSE → Git revert, log "discard"
+     - CRASHED → Try to fix (max 3 attempts), else log "crash" and move on
+  7. Log: Record result in results log
+  8. Repeat: Go to step 1. NEVER STOP. NEVER ASK "should I continue?"
+```
+## Critical Rules
+1. **NEVER STOP** — Loop until manually interrupted. User may be asleep.
+2. **Read before write** — Always understand full context before modifying
+3. **One change per iteration** — Atomic changes. If it breaks, you know exactly why
+4. **Mechanical verification only** — No subjective "looks good". Use metrics
+5. **Automatic rollback** — Failed changes revert instantly. No debates
+6. **Simplicity wins** — Equal results + less code = KEEP. Tiny improvement + ugly complexity = DISCARD
+7. **Git is memory** — Every kept change committed. Agent reads history to learn patterns
+8. **When stuck, think harder** — Re-read files, re-read goal, combine near-misses, try radical changes. Don't ask for help unless truly blocked by missing access/permissions
+## Principles Reference
+See `references/core-principles.md` for the 7 generalizable principles from autoresearch.
+## Adapting to Different Domains
+| Domain | Metric | Scope | Verify Command |
+|--------|--------|-------|----------------|
+| Backend code | Tests pass + coverage % | `src/**/*.ts` | `npm test` |
+| Frontend UI | Lighthouse score | `src/components/**` | `npx lighthouse` |
+| ML training | val_bpb / loss | `train.py` | `uv run train.py` |
+| Blog/content | Word count + readability | `content/*.md` | Custom script |
+| Performance | Benchmark time (ms) | Target files | `npm run bench` |
+| Refactoring | Tests pass + LOC reduced | Target module | `npm test && wc -l` |
+Adapt the loop to your domain. The PRINCIPLES are universal; the METRICS are domain-specific.

autoresearch/references/autonomous-loop-protocol.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# Autonomous Loop Protocol
+Detailed protocol for the autoresearch iteration loop. SKILL.md has the summary; this file has the full rules.
+## Phase 1: Review (30 seconds)
+Before each iteration, build situational awareness:
+```
+1. Read current state of in-scope files (full context)
+2. Read last 10-20 entries from results log
+3. Read git log --oneline -20 to see recent changes
+4. Identify: what worked, what failed, what's untried
+```
+**Why read every time?** After rollbacks, state may differ from what you expect. Never assume — always verify.
+## Phase 2: Ideate (Strategic)
+Pick the NEXT change. Priority order:
+1. **Fix crashes/failures** from previous iteration first
+2. **Exploit successes** — if last change improved metric, try variants in same direction
+3. **Explore new approaches** — try something the results log shows hasn't been attempted
+4. **Combine near-misses** — two changes that individually didn't help might work together
+5. **Simplify** — remove code while maintaining metric. Simpler = better
+6. **Radical experiments** — when incremental changes stall, try something dramatically different
+**Anti-patterns:**
+- Don't repeat exact same change that was already discarded
+- Don't make multiple unrelated changes at once (can't attribute improvement)
+- Don't chase marginal gains with ugly complexity
+## Phase 3: Modify (One Atomic Change)
+- Make ONE focused change to in-scope files
+- The change should be explainable in one sentence
+- Write the description BEFORE making the change (forces clarity)
+## Phase 4: Commit (Before Verification)
+```bash
+git add <changed-files>
+git commit -m "experiment: <one-sentence description>"
+```
+Commit BEFORE running verification so rollback is clean: `git reset --hard HEAD~1`
+## Phase 5: Verify (Mechanical Only)
+Run the agreed-upon verification command. Capture output.
+**Timeout rule:** If verification exceeds 2x normal time, kill and treat as crash.
+**Extract metric:** Parse the verification output for the specific metric number.
+## Phase 6: Decide (No Ambiguity)
+```
+IF metric_improved:
+    STATUS = "keep"
+    # Do nothing — commit stays
+ELIF metric_same_or_worse:
+    STATUS = "discard"
+    git reset --hard HEAD~1
+ELIF crashed:
+    # Attempt fix (max 3 tries)
+    IF fixable:
+        Fix → re-commit → re-verify
+    ELSE:
+        STATUS = "crash"
+        git reset --hard HEAD~1
+```
+**Simplicity override:** If metric barely improved (+<0.1%) but change adds significant complexity, treat as "discard". If metric unchanged but code is simpler, treat as "keep".
+## Phase 7: Log Results
+Append to results log (TSV format):
+```
+iteration  commit   metric   status   description
+42         a1b2c3d  0.9821   keep     increase attention heads from 8 to 12
+43         -        0.9845   discard  switch optimizer to SGD
+44         -        0.0000   crash    double batch size (OOM)
+```
+## Phase 8: Repeat
+Go to Phase 1. **NEVER STOP. NEVER ASK IF YOU SHOULD CONTINUE.**
+If stuck (>5 consecutive discards):
+1. Re-read ALL in-scope files from scratch
+2. Re-read the original goal/direction
+3. Review entire results log for patterns
+4. Try combining 2-3 previously successful changes
+5. Try the OPPOSITE of what hasn't been working
+6. Try a radical architectural change
+## Crash Recovery
+- Syntax error → fix immediately, don't count as separate iteration
+- Runtime error → attempt fix (max 3 tries), then move on
+- Resource exhaustion (OOM) → revert, try smaller variant
+- Infinite loop/hang → kill after timeout, revert, avoid that approach
+- External dependency failure → skip, log, try different approach
+## Communication
+- **DO NOT** ask "should I keep going?" — YES. ALWAYS.
+- **DO NOT** summarize after each iteration — just log and continue
+- **DO** print a brief one-line status every ~5 iterations (e.g., "Iteration 25: metric at 0.95, 8 keeps / 17 discards")
+- **DO** alert if you discover something surprising or game-changing

autoresearch/references/core-principles.md ADDED Viewed

	@@ -0,0 +1,92 @@

+# Core Principles — From Karpathy's Autoresearch
+7 universal principles extracted from autoresearch, applicable to ANY autonomous work.
+## 1. Constraint = Enabler
+Autonomy succeeds through intentional constraint, not despite it.
+| Autoresearch | Generalized |
+|--------------|-------------|
+| 630-line codebase | Bounded scope that fits agent context |
+| 5-minute time budget | Fixed iteration cost |
+| One metric (val_bpb) | Single mechanical success criterion |
+**Why:** Constraints enable agent confidence (full context understood), verification simplicity (no ambiguity), iteration velocity (low cost = rapid feedback loops).
+**Apply:** Before starting, define: what files are in-scope? What's the ONE metric? What's the time budget per iteration?
+## 2. Separate Strategy from Tactics
+Humans set direction. Agents execute iterations.
+| Strategic (Human) | Tactical (Agent) |
+|-------------------|------------------|
+| "Improve page load speed" | "Lazy-load images, code-split routes" |
+| "Increase test coverage" | "Add tests for uncovered edge cases" |
+| "Refactor auth module" | "Extract middleware, simplify handlers" |
+**Why:** Humans understand WHY. Agents handle HOW. Mixing these roles wastes both human creativity and agent iteration speed.
+**Apply:** Get clear direction from user (or program.md). Then iterate autonomously on implementation.
+## 3. Metrics Must Be Mechanical
+If you can't verify with a command, you can't iterate autonomously.
+- Tests pass/fail (exit code 0)
+- Benchmark time in milliseconds
+- Coverage percentage
+- Lighthouse score
+- File size in bytes
+- Lines of code count
+**Anti-pattern:** "Looks better", "probably improved", "seems cleaner" → these KILL autonomous loops because there's no decision function.
+**Apply:** Define the `grep` command (or equivalent) that extracts your metric BEFORE starting.
+## 4. Verification Must Be Fast
+If verification takes longer than the work itself, incentives misalign.
+| Fast (enables iteration) | Slow (kills iteration) |
+|-------------------------|----------------------|
+| Unit tests (seconds) | Full E2E suite (minutes) |
+| Type check (seconds) | Manual QA (hours) |
+| Lint check (instant) | Code review (async) |
+**Apply:** Use the FASTEST verification that still catches real problems. Save slow verification for after the loop.
+## 5. Iteration Cost Shapes Behavior
+- Cheap iteration → bold exploration, many experiments
+- Expensive iteration → conservative, few experiments
+Autoresearch: 5-minute cost → 100 experiments/night.
+Software: 10-second test → 360 experiments/hour.
+**Apply:** Minimize iteration cost. Use fast tests, incremental builds, targeted verification. Every minute saved = more experiments run.
+## 6. Git as Memory and Audit Trail
+Every successful change is committed. This enables:
+- **Causality tracking** — which change drove improvement?
+- **Stacking wins** — each commit builds on prior successes
+- **Pattern learning** — agent sees what worked in THIS codebase
+- **Human review** — researcher inspects agent's decision sequence
+**Apply:** Commit before verify. Revert on failure. Agent reads its own git history to inform next experiment.
+## 7. Honest Limitations
+State what the system can and cannot do. Don't oversell.
+Autoresearch CANNOT: change tokenizer, replace human direction, guarantee meaningful improvements.
+**Apply:** At setup, explicitly state constraints. If agent hits a wall it can't solve (missing permissions, external dependency, needs human judgment), say so clearly instead of guessing.
+## The Meta-Principle
+> Autonomy scales when you constrain scope, clarify success, mechanize verification, and let agents optimize tactics while humans optimize strategy.
+This isn't "removing humans." It's reassigning human effort from execution to direction. Humans become MORE valuable by focusing on irreducibly creative/strategic work.

autoresearch/references/results-logging.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Results Logging Protocol
+Track every iteration in a structured log. Enables pattern recognition and prevents repeating failed experiments.
+## Log Format (TSV)
+Create `autoresearch-results.tsv` in the working directory (gitignored):
+```tsv
+iteration	commit	metric	delta	status	description
+```
+### Columns
+| Column | Type | Description |
+|--------|------|-------------|
+| iteration | int | Sequential counter starting at 0 (baseline) |
+| commit | string | Short git hash (7 chars), "-" if reverted |
+| metric | float | Measured value from verification |
+| delta | float | Change from previous best (negative = improved for "lower is better") |
+| status | enum | `baseline`, `keep`, `discard`, `crash` |
+| description | string | One-sentence description of what was tried |
+### Example
+```tsv
+iteration	commit	metric	delta	status	description
+0	a1b2c3d	85.2	0.0	baseline	initial state — test coverage 85.2%
+1	b2c3d4e	87.1	+1.9	keep	add tests for auth middleware edge cases
+2	-	86.5	-0.6	discard	refactor test helpers (broke 2 tests)
+3	-	0.0	0.0	crash	add integration tests (DB connection failed)
+4	c3d4e5f	88.3	+1.2	keep	add tests for error handling in API routes
+5	d4e5f6g	89.0	+0.7	keep	add boundary value tests for validators
+```
+## Log Management
+- Create at setup (iteration 0 = baseline)
+- Append after EVERY iteration (including crashes)
+- Do NOT commit this file to git (add to .gitignore)
+- Read last 10-20 entries at start of each iteration for context
+- Use to detect patterns: what kind of changes tend to succeed?
+## Summary Reporting
+Every 10 iterations, print a brief summary:
+```
+=== Autoresearch Progress (iteration 20) ===
+Baseline: 85.2% → Current best: 92.1% (+6.9%)
+Keeps: 8 | Discards: 10 | Crashes: 2
+Last 5: keep, discard, discard, keep, keep
+```
+## Metric Direction
+Clarify at setup whether lower or higher is better:
+- **Lower is better:** val_bpb, response time (ms), bundle size (KB), error count
+- **Higher is better:** test coverage (%), lighthouse score, throughput (req/s)
+Record direction in first line of results log as a comment:
+```
+# metric_direction: higher_is_better
+```

karpathy-guidelines/SKILL.md ADDED Viewed

	@@ -0,0 +1,67 @@

+---
+name: karpathy-guidelines
+description: Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria.
+license: MIT
+---
+# Karpathy Guidelines
+Behavioral guidelines to reduce common LLM coding mistakes, derived from [Andrej Karpathy's observations](https://x.com/karpathy/status/2015883857489522876) on LLM coding pitfalls.
+**Tradeoff:** These guidelines bias toward caution over speed. For trivial tasks, use judgment.
+## 1. Think Before Coding
+**Don't assume. Don't hide confusion. Surface tradeoffs.**
+Before implementing:
+- State your assumptions explicitly. If uncertain, ask.
+- If multiple interpretations exist, present them - don't pick silently.
+- If a simpler approach exists, say so. Push back when warranted.
+- If something is unclear, stop. Name what's confusing. Ask.
+## 2. Simplicity First
+**Minimum code that solves the problem. Nothing speculative.**
+- No features beyond what was asked.
+- No abstractions for single-use code.
+- No "flexibility" or "configurability" that wasn't requested.
+- No error handling for impossible scenarios.
+- If you write 200 lines and it could be 50, rewrite it.
+Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
+## 3. Surgical Changes
+**Touch only what you must. Clean up only your own mess.**
+When editing existing code:
+- Don't "improve" adjacent code, comments, or formatting.
+- Don't refactor things that aren't broken.
+- Match existing style, even if you'd do it differently.
+- If you notice unrelated dead code, mention it - don't delete it.
+When your changes create orphans:
+- Remove imports/variables/functions that YOUR changes made unused.
+- Don't remove pre-existing dead code unless asked.
+The test: Every changed line should trace directly to the user's request.
+## 4. Goal-Driven Execution
+**Define success criteria. Loop until verified.**
+Transform tasks into verifiable goals:
+- "Add validation" → "Write tests for invalid inputs, then make them pass"
+- "Fix the bug" → "Write a test that reproduces it, then make it pass"
+- "Refactor X" → "Ensure tests pass before and after"
+For multi-step tasks, state a brief plan:
+```
+1. [Step] → verify: [check]
+2. [Step] → verify: [check]
+3. [Step] → verify: [check]
+```
+Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.