surrogate-1 / FEATURES.md
Ashira Pitchayapakayakul
docs: 100 must-have + 100 nice-to-have feature roadmap
a49e89a
# Surrogate-1 Feature Roadmap
**Updated**: 2026-04-28
**Status legend**: βœ… shipped β”‚ 🚧 in progress β”‚ ⏳ planned β”‚ πŸ’‘ idea
---
## 🟒 Already Shipped (Foundation)
### Pipeline (parallel orchestrate)
- βœ… 6-stage chain: SA β†’ [Architect βˆ₯ QA-TDD] β†’ DEV β†’ [QA-Verify βˆ₯ OPS] β†’ Reviewer
- βœ… Direct LLM call (skip broken tool-loop)
- βœ… Marker-extraction β†’ real code blocks β†’ real files in cwd
- βœ… Auto-commit + git push on APPROVE
- βœ… 12-rung LLM ladder (Cerebras / Groq / Gemini Γ— 2 / Samba / GH Models / Chutes / OR Γ— 2 / **HF Router Γ— 4**)
### Data + Knowledge
- βœ… 26 public datasets covering all SDLC domains
- βœ… Training-pair feedback loop (every stage β†’ ~/.surrogate/training-pairs.jsonl β†’ HF dataset every 3 min)
- βœ… Web research preamble (DDG search β†’ context for PRD/orchestrate)
- βœ… Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers)
- βœ… Skill synthesis daemon (3-min cycles β†’ ~/.surrogate/skills/{cat}/SKILL.md)
- βœ… Continuous scrape (8 workers, 5-30s cool-down)
### Models (Ollama on HF)
- βœ… qwen3-coder:30b-a3b (primary, 16GB MoE)
- βœ… devstral:24b (Mistral SWE-agent, 53.6% SWE-bench)
- βœ… qwen2.5-coder:14b (fallback)
- βœ… yi-coder:9b (128k context)
- βœ… nomic-embed-text (RAG embeddings)
### Agent Roster (19 SDLC experts)
- βœ… solution-architect, tech-architect (design)
- βœ… dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl)
- βœ… qa-engineer, qa-perf, qa-security (test)
- βœ… devops, sre, cloud-architect (infra)
- βœ… devsecops, cloud-security (security)
- βœ… data-engineer, ml-engineer (data/ML)
- βœ… tech-writer, reviewer (docs/gate)
### Infrastructure
- βœ… HF Space (CPU 16GB free) running 24/7
- βœ… /data persistent volume (state + logs + memory + skills + sessions + training-pairs)
- βœ… Backward-compat symlinks (~/.claude/* β†’ ~/.surrogate/*)
- βœ… Mac CLI clean (20 essential files only, 118 daemons archived)
- βœ… Status server: /, /health, /logs/{name}, /logs-list
---
## πŸ”΄ Must-Have (next 30 days)
### Reliability + Observability
1. ⏳ Heartbeat alarm β†’ Discord webhook if HF Space down >5 min
2. ⏳ Auto-retry on transient errors (provider 429/503 β†’ wait + retry next rung)
3. ⏳ Cost meter per stage (tokens Γ— $/1M, alert >$1/day)
4. ⏳ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE)
5. ⏳ Dataset upload deduplication (md5 of slice β†’ skip if same as last)
6. ⏳ Token-pool health check (rotate to next when 429)
7. ⏳ Disk usage alert (>80% /data β†’ cleanup oldest scrape state)
8. ⏳ Memory leak watchdog (kill daemon RSS >1.5GB, restart)
9. ⏳ Crash recovery (auto-resume cron loop on SIGCHLD)
10. ⏳ Snapshot scrape ledger to HF dataset weekly
### PRD + Project bootstrap
11. ⏳ Claude Projects-style PRD wizard (single description input β†’ auto-extract β†’ 1-3 follow-ups β†’ PRD)
12. ⏳ PRD template library (web app / API / CLI / mobile / data pipeline / ML)
13. ⏳ Auto-detect existing repo β†’ reverse-engineer surrogate.md
14. ⏳ PRD versioning (v1, v2 with diff)
15. ⏳ "Spec mode" β€” refine PRD interactively before any code
### Pipeline quality
16. ⏳ Self-critique loop (after dev: model A reviews model B output β†’ re-dev if NEEDS-WORK)
17. ⏳ Regression test on touched files (re-run existing tests)
18. ⏳ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep)
19. ⏳ Diff approval UI (show changes before commit, esp. yolo mode)
20. ⏳ Search-replace block edits (Aider-style, less risky than full rewrite)
### Domain expert routing
21. ⏳ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords
22. ⏳ Multi-specialist parallel work (e.g., backend API + frontend UI in same task β†’ spawn both)
23. ⏳ Specialist-specific eval (frontend agent β†’ check WCAG; backend β†’ check N+1)
### Memory + Context
24. ⏳ Episodic memory (last 50 sessions retrieval for similar tasks)
25. ⏳ Procedural memory (how-to library auto-generated from successful runs)
26. ⏳ Project context cache (surrogate.md + repo-map persisted across sessions)
27. ⏳ Cross-project pattern share (skill from project A β†’ applicable to project B)
28. ⏳ Long-term retention (key decisions β†’ ADR auto-generation)
### Self-improvement loop
29. ⏳ Reflexion lessons β†’ injected into next-similar-task prompt
30. ⏳ Failed orchestrate β†’ root-cause analysis β†’ improvement queue
31. ⏳ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain)
32. ⏳ A/B test prompts (variant A vs B, pick winner by APPROVE rate)
33. ⏳ Voyager-style skill crystallization (pattern repeated 3+ times β†’ permanent skill)
### Datasets + Training
34. ⏳ SRE postmortem corpus (scrape danluu/post-mortems β†’ ~600 incident β†’ instruction-pair)
35. ⏳ AWS Well-Architected synthetic Q/A (PDFs β†’ distilabel pipeline β†’ 5k pairs)
36. ⏳ Internal axentx code β†’ instruction pairs (commit messages + diffs)
37. ⏳ Training pair quality scoring (filter low-quality before HF upload)
38. ⏳ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles)
39. ⏳ Synthetic ADR generation (real OSS examples β†’ expand via distilabel)
### Tools + Integrations
40. ⏳ MCP client support (Claude Desktop schema β€” connect external tools)
41. ⏳ ToolSearch lazy-load (don't blow context on full tool list)
42. ⏳ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load)
43. ⏳ Repo-map context (tree-sitter symbol graph β†’ smarter file selection)
44. ⏳ Tool-call traces saved as training data (every tool use β†’ pair)
### Security + Safety
45. ⏳ Secret-scan pre-commit hook (gitleaks integration)
46. ⏳ Rate limit per-IP (HF Space /chat endpoint)
47. ⏳ Allowlist/denylist for git push (don't push to main without flag)
48. ⏳ PII scrubber for training pairs (remove emails, IPs, names before upload)
49. ⏳ Sandbox tool execution (no rm -rf, no curl |sh, no destructive ops)
50. ⏳ Audit log for every orchestrate run (who/what/when/result)
### Multi-modal + I/O
51. ⏳ Voice input (Whisper transcribe β†’ surrogate)
52. ⏳ Image input (architectural diagrams β†’ analysis)
53. ⏳ Screen recording β†’ video β†’ tutorial agent
54. ⏳ Discord voice channel (TTS responses)
### CLI UX
55. ⏳ /resume <session-id> (continue past session)
56. ⏳ /diff (show pending changes before commit)
57. ⏳ /undo (rollback last orchestrate via git stash)
58. ⏳ /share (publish session as gist for review)
59. ⏳ Tab autocomplete for slash commands
60. ⏳ Cost-meter live in statusline (running $ this session)
### Cloud / multi-region
61. ⏳ Mirror to Cloudflare Workers AI (free tier backup)
62. ⏳ Egress whitelist for Discord on HF Pro tier
63. ⏳ HF Space upgrade auto-scale (when load > 80%)
64. ⏳ Backup strategy: weekly snapshot of /data β†’ HF dataset
### Codebase intelligence
65. ⏳ Symbol search (tree-sitter index, not just text grep)
66. ⏳ Cross-file refactor (rename across project safely)
67. ⏳ Type-aware code completion (LSP integration)
68. ⏳ Dead code detection (vulture, ts-prune)
69. ⏳ Dependency graph viz (per-project)
### Training data flywheel
70. ⏳ Trace storage on HF (axentx/surrogate-1-traces dataset)
71. ⏳ Auto-tag training pairs by domain (frontend/backend/etc)
72. ⏳ Quality gate before training pair upload (β‰₯ N tokens, well-formed)
73. ⏳ Weekly eval on SWE-bench-Lite (track improvement)
74. ⏳ DPO data generation (REWORK cycles β†’ preference pairs)
### Discord + notifications
75. ⏳ Discord webhook for every commit (axentx repo notifications)
76. ⏳ Daily digest webhook (commits + pairs + scrape stats)
77. ⏳ Failure alerts (orchestrate fail β†’ ping)
78. ⏳ Slash commands `/orchestrate "task"` from Discord
### HF integrations
79. ⏳ TEI server (text-embeddings-inference) for RAG
80. ⏳ TGI server (text-generation-inference) for self-hosted LLM
81. ⏳ autotrain weekly LoRA on training pairs
82. ⏳ HF Inference Providers as primary (paid bypass)
83. ⏳ HF Spaces gradio UI (visualize chain status)
### Agent quality
84. ⏳ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark)
85. ⏳ Multi-model consensus on critical decisions (architecture, security)
86. ⏳ Constitutional rules (no hard-coded secrets, validate input)
87. ⏳ Tool use tracking per agent (which tools each agent calls)
88. ⏳ Persona consistency check (review for tone/style mid-thread)
### Project management
89. ⏳ Burndown chart per surrogate.md plan
90. ⏳ Story-point estimation from PRD
91. ⏳ Auto-create GitHub issues from `- [ ]` plan items
92. ⏳ PR description auto-write from commit list
93. ⏳ Sprint retrospective auto-summary
### Performance
94. ⏳ Profile + optimize orchestrate cycle time (target < 90s p50)
95. ⏳ Streaming responses (LLM tokens flow live, don't wait for full)
96. ⏳ Local cache for repeated identical prompts
97. ⏳ Parallel model calls (race fastest-first, kill rest)
98. ⏳ Edge inference (qwen3-coder on Cerebras WaferScale via API)
### Compliance + Governance
99. ⏳ License audit per file generated (OSS license compatibility)
100. ⏳ Commit signing (gpg/sigstore)
---
## πŸ’‘ Nice-to-Have (future)
### Multi-agent collaboration
1. πŸ’‘ MoA (Mixture of Agents) β€” 3 LLMs propose, judge picks best
2. πŸ’‘ Debate mode (2 agents argue, third synthesizes)
3. πŸ’‘ Tournament-style code review (3 reviewers, majority verdict)
4. πŸ’‘ Hierarchical agents (manager β†’ workers β†’ reporter)
5. πŸ’‘ Autonomous research squad (3 agents split topics, merge findings)
### UI / UX
6. πŸ’‘ Web dashboard (real-time pipeline status, training pair count, model health)
7. πŸ’‘ VSCode extension (`surrogate /auto` from editor)
8. πŸ’‘ IntelliJ plugin
9. πŸ’‘ Mobile app (iOS/Android) for on-the-go orchestrate
10. πŸ’‘ Apple Watch glance (current task status)
### Voice + Audio
11. πŸ’‘ Whisper realtime transcription
12. πŸ’‘ ElevenLabs TTS for status reports
13. πŸ’‘ Daily audio briefing podcast
14. πŸ’‘ Voice clone of user for replies
### Visual
15. πŸ’‘ Architecture diagram auto-generation (mermaid β†’ SVG)
16. πŸ’‘ Dependency graph live render
17. πŸ’‘ Heat map of code changes per file
18. πŸ’‘ 3D codebase visualization (gource-style)
### Integrations
19. πŸ’‘ Linear / Jira sync (pull tickets, update status)
20. πŸ’‘ Slack bot
21. πŸ’‘ Microsoft Teams bot
22. πŸ’‘ Notion sync (PRD ↔ Notion page)
23. πŸ’‘ Figma plugin (design β†’ code via DEV agent)
24. πŸ’‘ Storybook integration (component dev)
25. πŸ’‘ Sentry integration (errors β†’ fix queue)
26. πŸ’‘ PagerDuty integration (incident β†’ SRE agent)
27. πŸ’‘ GitHub Copilot bridge (delegate to Surrogate for complex)
28. πŸ’‘ Cursor IDE integration
### ML / Self-improvement
29. πŸ’‘ RLHF from APPROVE/REWORK signals
30. πŸ’‘ RLAIF (AI feedback on agent outputs)
31. πŸ’‘ Continual pre-training on axentx code corpus
32. πŸ’‘ Distillation (qwen-coder-30B β†’ 7B for edge)
33. πŸ’‘ Quantization-aware fine-tuning
34. πŸ’‘ Speculative decoding for faster inference
35. πŸ’‘ Mixture-of-experts custom training
### Datasets
36. πŸ’‘ Real-time scrape of GitHub trending (every 1h)
37. πŸ’‘ Scrape Hacker News top stories daily
38. πŸ’‘ Scrape Reddit r/programming weekly
39. πŸ’‘ Scrape Twitter dev threads (X API tier 1 = $100/m, skip)
40. πŸ’‘ Curated YouTube transcripts (developer talks, RustConf, KubeCon)
41. πŸ’‘ Scrape arxiv-sanity for AI papers
42. πŸ’‘ Crawl AWS/GCP/Azure docs nightly
43. πŸ’‘ PR diff archive (axentx own PRs as training)
44. πŸ’‘ Stack Overflow accepted answers (dump filter)
45. πŸ’‘ GitHub issue resolutions (closed issue β†’ PR linkage)
### Cloud / Deployment
46. πŸ’‘ Multi-region HF Spaces (ap-southeast + us-east + eu-west)
47. πŸ’‘ K8s deployment manifests (move beyond HF when scale demands)
48. πŸ’‘ Kubernetes operator for axentx orchestration
49. πŸ’‘ Lambda@Edge for global low-latency inference
50. πŸ’‘ IPFS publish of PRDs (decentralized)
### Privacy + Security
51. πŸ’‘ E2E encryption for Discord chat
52. πŸ’‘ Air-gapped mode (Mac-only, no cloud)
53. πŸ’‘ Federated learning (multiple users contribute, no central data)
54. πŸ’‘ Zero-knowledge proofs for code provenance
55. πŸ’‘ Confidential computing (Intel SGX) for sensitive code
56. πŸ’‘ GDPR compliance toolkit (PII scrub, right-to-delete)
57. πŸ’‘ SOC 2 Type II readiness checklist
58. πŸ’‘ ISO 27001 audit prep
### Specialty agents
59. πŸ’‘ Compiler engineer (LLVM, optimization passes)
60. πŸ’‘ Embedded systems (microcontroller code, real-time)
61. πŸ’‘ Game dev (Unity, Unreal, Godot)
62. πŸ’‘ Blockchain (Solidity, smart contracts, security)
63. πŸ’‘ Quantum computing (Qiskit, circuits)
64. πŸ’‘ Robotics (ROS, motion planning)
65. πŸ’‘ Bioinformatics (BLAST, sequence analysis)
66. πŸ’‘ Quantitative finance (backtesting, risk)
67. πŸ’‘ Climate modeling
68. πŸ’‘ Legal tech (contract review)
### Education
69. πŸ’‘ Teach mode (explain decisions step-by-step for learners)
70. πŸ’‘ Pair programming mode (turn-taking with user)
71. πŸ’‘ Code review school (annotated learning examples)
72. πŸ’‘ Daily challenge generator (LeetCode-style, personalized)
73. πŸ’‘ Concept explainer (DDD, hexagonal, CAP theorem on demand)
### Productivity
74. πŸ’‘ Calendar integration (block focus time when in flow)
75. πŸ’‘ Pomodoro mode
76. πŸ’‘ Energy/mood tracker (suggest break when fatigued)
77. πŸ’‘ Distraction blocker (no Twitter when Surrogate active)
78. πŸ’‘ Focus music generator (lo-fi via Suno API)
### Emerging tech
79. πŸ’‘ ASI safety guardrails (per Anthropic Constitutional AI)
80. πŸ’‘ World model simulation (test ideas in synth environment)
81. πŸ’‘ Causal reasoning (vs correlation)
82. πŸ’‘ Theorem prover integration (Lean, Coq for verified code)
83. πŸ’‘ Differential privacy in training
84. πŸ’‘ Explainable AI for code reviews
### Localization
85. πŸ’‘ Thai-native pipeline (โค้ดแΰΈ₯ΰΈ° comments ΰΉ€ΰΈ›ΰΉ‡ΰΈ™ΰΉ„ΰΈ—ΰΈ’)
86. πŸ’‘ Japanese, Korean, Chinese support
87. πŸ’‘ RTL languages (Arabic, Hebrew)
88. πŸ’‘ Local LLM Thai-fluent (typhoon, openthaigpt)
89. πŸ’‘ Cultural code review (idioms per locale)
### Marketing + community
90. πŸ’‘ Public Surrogate-1 demo Space (read-only)
91. πŸ’‘ Twitter bot posts daily Surrogate-1 wins
92. πŸ’‘ GitHub discussions for community
93. πŸ’‘ Discord server for users
94. πŸ’‘ Newsletter (weekly improvements)
95. πŸ’‘ Blog (axentx engineering)
### Speculative
96. πŸ’‘ Surrogate-2 (full local inference, no cloud dep)
97. πŸ’‘ Custom silicon (qwen-coder optimized FPGA)
98. πŸ’‘ BCI integration (Neuralink-style direct intent)
99. πŸ’‘ Physical robot (Boston Dynamics + Surrogate brain)
100. πŸ’‘ ASI alignment research collaboration
---
## Current Cadence (auto-running on HF)
| Task | Frequency | Status |
|---|---|---|
| Continuous scrape | 8 workers, 5-30s cool-down | βœ… |
| Agentic crawler | 6 workers, BFS frontier | βœ… |
| Skill synthesis | every 3 min | βœ… |
| surrogate-dev-loop | every 2 min | βœ… |
| work-queue producer | every 5 min | βœ… |
| training-pair push to HF | every 3 min | βœ… |
| auto-orchestrate-loop | every 20 min | βœ… |
| research-apply | every 30 min | βœ… |
| keyword tuner | every 60 min | βœ… |
| research-loop | every 6h | βœ… |
| dataset-enrich | every 12h | βœ… |
## Verified working (2026-04-28)
- 5 commits to HF dataset in 12 min (~4047 pairs uploaded)
- Pipeline produces real Python/Go code with DDD patterns
- Reviewer issues APPROVE / REWORK / REJECT verdicts
- Training feedback loop closing (every stage β†’ HF)