surrogate-1 / FEATURES.md
Ashira Pitchayapakayakul
docs: 100 must-have + 100 nice-to-have feature roadmap
a49e89a

Surrogate-1 Feature Roadmap

Updated: 2026-04-28 Status legend: βœ… shipped β”‚ 🚧 in progress β”‚ ⏳ planned β”‚ πŸ’‘ idea


🟒 Already Shipped (Foundation)

Pipeline (parallel orchestrate)

  • βœ… 6-stage chain: SA β†’ [Architect βˆ₯ QA-TDD] β†’ DEV β†’ [QA-Verify βˆ₯ OPS] β†’ Reviewer
  • βœ… Direct LLM call (skip broken tool-loop)
  • βœ… Marker-extraction β†’ real code blocks β†’ real files in cwd
  • βœ… Auto-commit + git push on APPROVE
  • βœ… 12-rung LLM ladder (Cerebras / Groq / Gemini Γ— 2 / Samba / GH Models / Chutes / OR Γ— 2 / HF Router Γ— 4)

Data + Knowledge

  • βœ… 26 public datasets covering all SDLC domains
  • βœ… Training-pair feedback loop (every stage β†’ ~/.surrogate/training-pairs.jsonl β†’ HF dataset every 3 min)
  • βœ… Web research preamble (DDG search β†’ context for PRD/orchestrate)
  • βœ… Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers)
  • βœ… Skill synthesis daemon (3-min cycles β†’ ~/.surrogate/skills/{cat}/SKILL.md)
  • βœ… Continuous scrape (8 workers, 5-30s cool-down)

Models (Ollama on HF)

  • βœ… qwen3-coder:30b-a3b (primary, 16GB MoE)
  • βœ… devstral:24b (Mistral SWE-agent, 53.6% SWE-bench)
  • βœ… qwen2.5-coder:14b (fallback)
  • βœ… yi-coder:9b (128k context)
  • βœ… nomic-embed-text (RAG embeddings)

Agent Roster (19 SDLC experts)

  • βœ… solution-architect, tech-architect (design)
  • βœ… dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl)
  • βœ… qa-engineer, qa-perf, qa-security (test)
  • βœ… devops, sre, cloud-architect (infra)
  • βœ… devsecops, cloud-security (security)
  • βœ… data-engineer, ml-engineer (data/ML)
  • βœ… tech-writer, reviewer (docs/gate)

Infrastructure

  • βœ… HF Space (CPU 16GB free) running 24/7
  • βœ… /data persistent volume (state + logs + memory + skills + sessions + training-pairs)
  • βœ… Backward-compat symlinks (~/.claude/* β†’ ~/.surrogate/*)
  • βœ… Mac CLI clean (20 essential files only, 118 daemons archived)
  • βœ… Status server: /, /health, /logs/{name}, /logs-list

πŸ”΄ Must-Have (next 30 days)

Reliability + Observability

  1. ⏳ Heartbeat alarm β†’ Discord webhook if HF Space down >5 min
  2. ⏳ Auto-retry on transient errors (provider 429/503 β†’ wait + retry next rung)
  3. ⏳ Cost meter per stage (tokens Γ— $/1M, alert >$1/day)
  4. ⏳ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE)
  5. ⏳ Dataset upload deduplication (md5 of slice β†’ skip if same as last)
  6. ⏳ Token-pool health check (rotate to next when 429)
  7. ⏳ Disk usage alert (>80% /data β†’ cleanup oldest scrape state)
  8. ⏳ Memory leak watchdog (kill daemon RSS >1.5GB, restart)
  9. ⏳ Crash recovery (auto-resume cron loop on SIGCHLD)
  10. ⏳ Snapshot scrape ledger to HF dataset weekly

PRD + Project bootstrap

  1. ⏳ Claude Projects-style PRD wizard (single description input β†’ auto-extract β†’ 1-3 follow-ups β†’ PRD)
  2. ⏳ PRD template library (web app / API / CLI / mobile / data pipeline / ML)
  3. ⏳ Auto-detect existing repo β†’ reverse-engineer surrogate.md
  4. ⏳ PRD versioning (v1, v2 with diff)
  5. ⏳ "Spec mode" β€” refine PRD interactively before any code

Pipeline quality

  1. ⏳ Self-critique loop (after dev: model A reviews model B output β†’ re-dev if NEEDS-WORK)
  2. ⏳ Regression test on touched files (re-run existing tests)
  3. ⏳ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep)
  4. ⏳ Diff approval UI (show changes before commit, esp. yolo mode)
  5. ⏳ Search-replace block edits (Aider-style, less risky than full rewrite)

Domain expert routing

  1. ⏳ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords
  2. ⏳ Multi-specialist parallel work (e.g., backend API + frontend UI in same task β†’ spawn both)
  3. ⏳ Specialist-specific eval (frontend agent β†’ check WCAG; backend β†’ check N+1)

Memory + Context

  1. ⏳ Episodic memory (last 50 sessions retrieval for similar tasks)
  2. ⏳ Procedural memory (how-to library auto-generated from successful runs)
  3. ⏳ Project context cache (surrogate.md + repo-map persisted across sessions)
  4. ⏳ Cross-project pattern share (skill from project A β†’ applicable to project B)
  5. ⏳ Long-term retention (key decisions β†’ ADR auto-generation)

Self-improvement loop

  1. ⏳ Reflexion lessons β†’ injected into next-similar-task prompt
  2. ⏳ Failed orchestrate β†’ root-cause analysis β†’ improvement queue
  3. ⏳ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain)
  4. ⏳ A/B test prompts (variant A vs B, pick winner by APPROVE rate)
  5. ⏳ Voyager-style skill crystallization (pattern repeated 3+ times β†’ permanent skill)

Datasets + Training

  1. ⏳ SRE postmortem corpus (scrape danluu/post-mortems β†’ ~600 incident β†’ instruction-pair)
  2. ⏳ AWS Well-Architected synthetic Q/A (PDFs β†’ distilabel pipeline β†’ 5k pairs)
  3. ⏳ Internal axentx code β†’ instruction pairs (commit messages + diffs)
  4. ⏳ Training pair quality scoring (filter low-quality before HF upload)
  5. ⏳ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles)
  6. ⏳ Synthetic ADR generation (real OSS examples β†’ expand via distilabel)

Tools + Integrations

  1. ⏳ MCP client support (Claude Desktop schema β€” connect external tools)
  2. ⏳ ToolSearch lazy-load (don't blow context on full tool list)
  3. ⏳ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load)
  4. ⏳ Repo-map context (tree-sitter symbol graph β†’ smarter file selection)
  5. ⏳ Tool-call traces saved as training data (every tool use β†’ pair)

Security + Safety

  1. ⏳ Secret-scan pre-commit hook (gitleaks integration)
  2. ⏳ Rate limit per-IP (HF Space /chat endpoint)
  3. ⏳ Allowlist/denylist for git push (don't push to main without flag)
  4. ⏳ PII scrubber for training pairs (remove emails, IPs, names before upload)
  5. ⏳ Sandbox tool execution (no rm -rf, no curl |sh, no destructive ops)
  6. ⏳ Audit log for every orchestrate run (who/what/when/result)

Multi-modal + I/O

  1. ⏳ Voice input (Whisper transcribe β†’ surrogate)
  2. ⏳ Image input (architectural diagrams β†’ analysis)
  3. ⏳ Screen recording β†’ video β†’ tutorial agent
  4. ⏳ Discord voice channel (TTS responses)

CLI UX

  1. ⏳ /resume (continue past session)
  2. ⏳ /diff (show pending changes before commit)
  3. ⏳ /undo (rollback last orchestrate via git stash)
  4. ⏳ /share (publish session as gist for review)
  5. ⏳ Tab autocomplete for slash commands
  6. ⏳ Cost-meter live in statusline (running $ this session)

Cloud / multi-region

  1. ⏳ Mirror to Cloudflare Workers AI (free tier backup)
  2. ⏳ Egress whitelist for Discord on HF Pro tier
  3. ⏳ HF Space upgrade auto-scale (when load > 80%)
  4. ⏳ Backup strategy: weekly snapshot of /data β†’ HF dataset

Codebase intelligence

  1. ⏳ Symbol search (tree-sitter index, not just text grep)
  2. ⏳ Cross-file refactor (rename across project safely)
  3. ⏳ Type-aware code completion (LSP integration)
  4. ⏳ Dead code detection (vulture, ts-prune)
  5. ⏳ Dependency graph viz (per-project)

Training data flywheel

  1. ⏳ Trace storage on HF (axentx/surrogate-1-traces dataset)
  2. ⏳ Auto-tag training pairs by domain (frontend/backend/etc)
  3. ⏳ Quality gate before training pair upload (β‰₯ N tokens, well-formed)
  4. ⏳ Weekly eval on SWE-bench-Lite (track improvement)
  5. ⏳ DPO data generation (REWORK cycles β†’ preference pairs)

Discord + notifications

  1. ⏳ Discord webhook for every commit (axentx repo notifications)
  2. ⏳ Daily digest webhook (commits + pairs + scrape stats)
  3. ⏳ Failure alerts (orchestrate fail β†’ ping)
  4. ⏳ Slash commands /orchestrate "task" from Discord

HF integrations

  1. ⏳ TEI server (text-embeddings-inference) for RAG
  2. ⏳ TGI server (text-generation-inference) for self-hosted LLM
  3. ⏳ autotrain weekly LoRA on training pairs
  4. ⏳ HF Inference Providers as primary (paid bypass)
  5. ⏳ HF Spaces gradio UI (visualize chain status)

Agent quality

  1. ⏳ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark)
  2. ⏳ Multi-model consensus on critical decisions (architecture, security)
  3. ⏳ Constitutional rules (no hard-coded secrets, validate input)
  4. ⏳ Tool use tracking per agent (which tools each agent calls)
  5. ⏳ Persona consistency check (review for tone/style mid-thread)

Project management

  1. ⏳ Burndown chart per surrogate.md plan
  2. ⏳ Story-point estimation from PRD
  3. ⏳ Auto-create GitHub issues from - [ ] plan items
  4. ⏳ PR description auto-write from commit list
  5. ⏳ Sprint retrospective auto-summary

Performance

  1. ⏳ Profile + optimize orchestrate cycle time (target < 90s p50)
  2. ⏳ Streaming responses (LLM tokens flow live, don't wait for full)
  3. ⏳ Local cache for repeated identical prompts
  4. ⏳ Parallel model calls (race fastest-first, kill rest)
  5. ⏳ Edge inference (qwen3-coder on Cerebras WaferScale via API)

Compliance + Governance

  1. ⏳ License audit per file generated (OSS license compatibility)
  2. ⏳ Commit signing (gpg/sigstore)

πŸ’‘ Nice-to-Have (future)

Multi-agent collaboration

  1. πŸ’‘ MoA (Mixture of Agents) β€” 3 LLMs propose, judge picks best
  2. πŸ’‘ Debate mode (2 agents argue, third synthesizes)
  3. πŸ’‘ Tournament-style code review (3 reviewers, majority verdict)
  4. πŸ’‘ Hierarchical agents (manager β†’ workers β†’ reporter)
  5. πŸ’‘ Autonomous research squad (3 agents split topics, merge findings)

UI / UX

  1. πŸ’‘ Web dashboard (real-time pipeline status, training pair count, model health)
  2. πŸ’‘ VSCode extension (surrogate /auto from editor)
  3. πŸ’‘ IntelliJ plugin
  4. πŸ’‘ Mobile app (iOS/Android) for on-the-go orchestrate
  5. πŸ’‘ Apple Watch glance (current task status)

Voice + Audio

  1. πŸ’‘ Whisper realtime transcription
  2. πŸ’‘ ElevenLabs TTS for status reports
  3. πŸ’‘ Daily audio briefing podcast
  4. πŸ’‘ Voice clone of user for replies

Visual

  1. πŸ’‘ Architecture diagram auto-generation (mermaid β†’ SVG)
  2. πŸ’‘ Dependency graph live render
  3. πŸ’‘ Heat map of code changes per file
  4. πŸ’‘ 3D codebase visualization (gource-style)

Integrations

  1. πŸ’‘ Linear / Jira sync (pull tickets, update status)
  2. πŸ’‘ Slack bot
  3. πŸ’‘ Microsoft Teams bot
  4. πŸ’‘ Notion sync (PRD ↔ Notion page)
  5. πŸ’‘ Figma plugin (design β†’ code via DEV agent)
  6. πŸ’‘ Storybook integration (component dev)
  7. πŸ’‘ Sentry integration (errors β†’ fix queue)
  8. πŸ’‘ PagerDuty integration (incident β†’ SRE agent)
  9. πŸ’‘ GitHub Copilot bridge (delegate to Surrogate for complex)
  10. πŸ’‘ Cursor IDE integration

ML / Self-improvement

  1. πŸ’‘ RLHF from APPROVE/REWORK signals
  2. πŸ’‘ RLAIF (AI feedback on agent outputs)
  3. πŸ’‘ Continual pre-training on axentx code corpus
  4. πŸ’‘ Distillation (qwen-coder-30B β†’ 7B for edge)
  5. πŸ’‘ Quantization-aware fine-tuning
  6. πŸ’‘ Speculative decoding for faster inference
  7. πŸ’‘ Mixture-of-experts custom training

Datasets

  1. πŸ’‘ Real-time scrape of GitHub trending (every 1h)
  2. πŸ’‘ Scrape Hacker News top stories daily
  3. πŸ’‘ Scrape Reddit r/programming weekly
  4. πŸ’‘ Scrape Twitter dev threads (X API tier 1 = $100/m, skip)
  5. πŸ’‘ Curated YouTube transcripts (developer talks, RustConf, KubeCon)
  6. πŸ’‘ Scrape arxiv-sanity for AI papers
  7. πŸ’‘ Crawl AWS/GCP/Azure docs nightly
  8. πŸ’‘ PR diff archive (axentx own PRs as training)
  9. πŸ’‘ Stack Overflow accepted answers (dump filter)
  10. πŸ’‘ GitHub issue resolutions (closed issue β†’ PR linkage)

Cloud / Deployment

  1. πŸ’‘ Multi-region HF Spaces (ap-southeast + us-east + eu-west)
  2. πŸ’‘ K8s deployment manifests (move beyond HF when scale demands)
  3. πŸ’‘ Kubernetes operator for axentx orchestration
  4. πŸ’‘ Lambda@Edge for global low-latency inference
  5. πŸ’‘ IPFS publish of PRDs (decentralized)

Privacy + Security

  1. πŸ’‘ E2E encryption for Discord chat
  2. πŸ’‘ Air-gapped mode (Mac-only, no cloud)
  3. πŸ’‘ Federated learning (multiple users contribute, no central data)
  4. πŸ’‘ Zero-knowledge proofs for code provenance
  5. πŸ’‘ Confidential computing (Intel SGX) for sensitive code
  6. πŸ’‘ GDPR compliance toolkit (PII scrub, right-to-delete)
  7. πŸ’‘ SOC 2 Type II readiness checklist
  8. πŸ’‘ ISO 27001 audit prep

Specialty agents

  1. πŸ’‘ Compiler engineer (LLVM, optimization passes)
  2. πŸ’‘ Embedded systems (microcontroller code, real-time)
  3. πŸ’‘ Game dev (Unity, Unreal, Godot)
  4. πŸ’‘ Blockchain (Solidity, smart contracts, security)
  5. πŸ’‘ Quantum computing (Qiskit, circuits)
  6. πŸ’‘ Robotics (ROS, motion planning)
  7. πŸ’‘ Bioinformatics (BLAST, sequence analysis)
  8. πŸ’‘ Quantitative finance (backtesting, risk)
  9. πŸ’‘ Climate modeling
  10. πŸ’‘ Legal tech (contract review)

Education

  1. πŸ’‘ Teach mode (explain decisions step-by-step for learners)
  2. πŸ’‘ Pair programming mode (turn-taking with user)
  3. πŸ’‘ Code review school (annotated learning examples)
  4. πŸ’‘ Daily challenge generator (LeetCode-style, personalized)
  5. πŸ’‘ Concept explainer (DDD, hexagonal, CAP theorem on demand)

Productivity

  1. πŸ’‘ Calendar integration (block focus time when in flow)
  2. πŸ’‘ Pomodoro mode
  3. πŸ’‘ Energy/mood tracker (suggest break when fatigued)
  4. πŸ’‘ Distraction blocker (no Twitter when Surrogate active)
  5. πŸ’‘ Focus music generator (lo-fi via Suno API)

Emerging tech

  1. πŸ’‘ ASI safety guardrails (per Anthropic Constitutional AI)
  2. πŸ’‘ World model simulation (test ideas in synth environment)
  3. πŸ’‘ Causal reasoning (vs correlation)
  4. πŸ’‘ Theorem prover integration (Lean, Coq for verified code)
  5. πŸ’‘ Differential privacy in training
  6. πŸ’‘ Explainable AI for code reviews

Localization

  1. πŸ’‘ Thai-native pipeline (โค้ดแΰΈ₯ΰΈ° comments ΰΉ€ΰΈ›ΰΉ‡ΰΈ™ΰΉ„ΰΈ—ΰΈ’)
  2. πŸ’‘ Japanese, Korean, Chinese support
  3. πŸ’‘ RTL languages (Arabic, Hebrew)
  4. πŸ’‘ Local LLM Thai-fluent (typhoon, openthaigpt)
  5. πŸ’‘ Cultural code review (idioms per locale)

Marketing + community

  1. πŸ’‘ Public Surrogate-1 demo Space (read-only)
  2. πŸ’‘ Twitter bot posts daily Surrogate-1 wins
  3. πŸ’‘ GitHub discussions for community
  4. πŸ’‘ Discord server for users
  5. πŸ’‘ Newsletter (weekly improvements)
  6. πŸ’‘ Blog (axentx engineering)

Speculative

  1. πŸ’‘ Surrogate-2 (full local inference, no cloud dep)
  2. πŸ’‘ Custom silicon (qwen-coder optimized FPGA)
  3. πŸ’‘ BCI integration (Neuralink-style direct intent)
  4. πŸ’‘ Physical robot (Boston Dynamics + Surrogate brain)
  5. πŸ’‘ ASI alignment research collaboration

Current Cadence (auto-running on HF)

Task Frequency Status
Continuous scrape 8 workers, 5-30s cool-down βœ…
Agentic crawler 6 workers, BFS frontier βœ…
Skill synthesis every 3 min βœ…
surrogate-dev-loop every 2 min βœ…
work-queue producer every 5 min βœ…
training-pair push to HF every 3 min βœ…
auto-orchestrate-loop every 20 min βœ…
research-apply every 30 min βœ…
keyword tuner every 60 min βœ…
research-loop every 6h βœ…
dataset-enrich every 12h βœ…

Verified working (2026-04-28)

  • 5 commits to HF dataset in 12 min (~4047 pairs uploaded)
  • Pipeline produces real Python/Go code with DDD patterns
  • Reviewer issues APPROVE / REWORK / REJECT verdicts
  • Training feedback loop closing (every stage β†’ HF)