Spaces:
Running
Running
| # Surrogate-1 Feature Roadmap | |
| **Updated**: 2026-04-28 | |
| **Status legend**: β shipped β π§ in progress β β³ planned β π‘ idea | |
| --- | |
| ## π’ Already Shipped (Foundation) | |
| ### Pipeline (parallel orchestrate) | |
| - β 6-stage chain: SA β [Architect β₯ QA-TDD] β DEV β [QA-Verify β₯ OPS] β Reviewer | |
| - β Direct LLM call (skip broken tool-loop) | |
| - β Marker-extraction β real code blocks β real files in cwd | |
| - β Auto-commit + git push on APPROVE | |
| - β 12-rung LLM ladder (Cerebras / Groq / Gemini Γ 2 / Samba / GH Models / Chutes / OR Γ 2 / **HF Router Γ 4**) | |
| ### Data + Knowledge | |
| - β 26 public datasets covering all SDLC domains | |
| - β Training-pair feedback loop (every stage β ~/.surrogate/training-pairs.jsonl β HF dataset every 3 min) | |
| - β Web research preamble (DDG search β context for PRD/orchestrate) | |
| - β Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers) | |
| - β Skill synthesis daemon (3-min cycles β ~/.surrogate/skills/{cat}/SKILL.md) | |
| - β Continuous scrape (8 workers, 5-30s cool-down) | |
| ### Models (Ollama on HF) | |
| - β qwen3-coder:30b-a3b (primary, 16GB MoE) | |
| - β devstral:24b (Mistral SWE-agent, 53.6% SWE-bench) | |
| - β qwen2.5-coder:14b (fallback) | |
| - β yi-coder:9b (128k context) | |
| - β nomic-embed-text (RAG embeddings) | |
| ### Agent Roster (19 SDLC experts) | |
| - β solution-architect, tech-architect (design) | |
| - β dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl) | |
| - β qa-engineer, qa-perf, qa-security (test) | |
| - β devops, sre, cloud-architect (infra) | |
| - β devsecops, cloud-security (security) | |
| - β data-engineer, ml-engineer (data/ML) | |
| - β tech-writer, reviewer (docs/gate) | |
| ### Infrastructure | |
| - β HF Space (CPU 16GB free) running 24/7 | |
| - β /data persistent volume (state + logs + memory + skills + sessions + training-pairs) | |
| - β Backward-compat symlinks (~/.claude/* β ~/.surrogate/*) | |
| - β Mac CLI clean (20 essential files only, 118 daemons archived) | |
| - β Status server: /, /health, /logs/{name}, /logs-list | |
| --- | |
| ## π΄ Must-Have (next 30 days) | |
| ### Reliability + Observability | |
| 1. β³ Heartbeat alarm β Discord webhook if HF Space down >5 min | |
| 2. β³ Auto-retry on transient errors (provider 429/503 β wait + retry next rung) | |
| 3. β³ Cost meter per stage (tokens Γ $/1M, alert >$1/day) | |
| 4. β³ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE) | |
| 5. β³ Dataset upload deduplication (md5 of slice β skip if same as last) | |
| 6. β³ Token-pool health check (rotate to next when 429) | |
| 7. β³ Disk usage alert (>80% /data β cleanup oldest scrape state) | |
| 8. β³ Memory leak watchdog (kill daemon RSS >1.5GB, restart) | |
| 9. β³ Crash recovery (auto-resume cron loop on SIGCHLD) | |
| 10. β³ Snapshot scrape ledger to HF dataset weekly | |
| ### PRD + Project bootstrap | |
| 11. β³ Claude Projects-style PRD wizard (single description input β auto-extract β 1-3 follow-ups β PRD) | |
| 12. β³ PRD template library (web app / API / CLI / mobile / data pipeline / ML) | |
| 13. β³ Auto-detect existing repo β reverse-engineer surrogate.md | |
| 14. β³ PRD versioning (v1, v2 with diff) | |
| 15. β³ "Spec mode" β refine PRD interactively before any code | |
| ### Pipeline quality | |
| 16. β³ Self-critique loop (after dev: model A reviews model B output β re-dev if NEEDS-WORK) | |
| 17. β³ Regression test on touched files (re-run existing tests) | |
| 18. β³ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep) | |
| 19. β³ Diff approval UI (show changes before commit, esp. yolo mode) | |
| 20. β³ Search-replace block edits (Aider-style, less risky than full rewrite) | |
| ### Domain expert routing | |
| 21. β³ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords | |
| 22. β³ Multi-specialist parallel work (e.g., backend API + frontend UI in same task β spawn both) | |
| 23. β³ Specialist-specific eval (frontend agent β check WCAG; backend β check N+1) | |
| ### Memory + Context | |
| 24. β³ Episodic memory (last 50 sessions retrieval for similar tasks) | |
| 25. β³ Procedural memory (how-to library auto-generated from successful runs) | |
| 26. β³ Project context cache (surrogate.md + repo-map persisted across sessions) | |
| 27. β³ Cross-project pattern share (skill from project A β applicable to project B) | |
| 28. β³ Long-term retention (key decisions β ADR auto-generation) | |
| ### Self-improvement loop | |
| 29. β³ Reflexion lessons β injected into next-similar-task prompt | |
| 30. β³ Failed orchestrate β root-cause analysis β improvement queue | |
| 31. β³ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain) | |
| 32. β³ A/B test prompts (variant A vs B, pick winner by APPROVE rate) | |
| 33. β³ Voyager-style skill crystallization (pattern repeated 3+ times β permanent skill) | |
| ### Datasets + Training | |
| 34. β³ SRE postmortem corpus (scrape danluu/post-mortems β ~600 incident β instruction-pair) | |
| 35. β³ AWS Well-Architected synthetic Q/A (PDFs β distilabel pipeline β 5k pairs) | |
| 36. β³ Internal axentx code β instruction pairs (commit messages + diffs) | |
| 37. β³ Training pair quality scoring (filter low-quality before HF upload) | |
| 38. β³ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles) | |
| 39. β³ Synthetic ADR generation (real OSS examples β expand via distilabel) | |
| ### Tools + Integrations | |
| 40. β³ MCP client support (Claude Desktop schema β connect external tools) | |
| 41. β³ ToolSearch lazy-load (don't blow context on full tool list) | |
| 42. β³ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load) | |
| 43. β³ Repo-map context (tree-sitter symbol graph β smarter file selection) | |
| 44. β³ Tool-call traces saved as training data (every tool use β pair) | |
| ### Security + Safety | |
| 45. β³ Secret-scan pre-commit hook (gitleaks integration) | |
| 46. β³ Rate limit per-IP (HF Space /chat endpoint) | |
| 47. β³ Allowlist/denylist for git push (don't push to main without flag) | |
| 48. β³ PII scrubber for training pairs (remove emails, IPs, names before upload) | |
| 49. β³ Sandbox tool execution (no rm -rf, no curl |sh, no destructive ops) | |
| 50. β³ Audit log for every orchestrate run (who/what/when/result) | |
| ### Multi-modal + I/O | |
| 51. β³ Voice input (Whisper transcribe β surrogate) | |
| 52. β³ Image input (architectural diagrams β analysis) | |
| 53. β³ Screen recording β video β tutorial agent | |
| 54. β³ Discord voice channel (TTS responses) | |
| ### CLI UX | |
| 55. β³ /resume <session-id> (continue past session) | |
| 56. β³ /diff (show pending changes before commit) | |
| 57. β³ /undo (rollback last orchestrate via git stash) | |
| 58. β³ /share (publish session as gist for review) | |
| 59. β³ Tab autocomplete for slash commands | |
| 60. β³ Cost-meter live in statusline (running $ this session) | |
| ### Cloud / multi-region | |
| 61. β³ Mirror to Cloudflare Workers AI (free tier backup) | |
| 62. β³ Egress whitelist for Discord on HF Pro tier | |
| 63. β³ HF Space upgrade auto-scale (when load > 80%) | |
| 64. β³ Backup strategy: weekly snapshot of /data β HF dataset | |
| ### Codebase intelligence | |
| 65. β³ Symbol search (tree-sitter index, not just text grep) | |
| 66. β³ Cross-file refactor (rename across project safely) | |
| 67. β³ Type-aware code completion (LSP integration) | |
| 68. β³ Dead code detection (vulture, ts-prune) | |
| 69. β³ Dependency graph viz (per-project) | |
| ### Training data flywheel | |
| 70. β³ Trace storage on HF (axentx/surrogate-1-traces dataset) | |
| 71. β³ Auto-tag training pairs by domain (frontend/backend/etc) | |
| 72. β³ Quality gate before training pair upload (β₯ N tokens, well-formed) | |
| 73. β³ Weekly eval on SWE-bench-Lite (track improvement) | |
| 74. β³ DPO data generation (REWORK cycles β preference pairs) | |
| ### Discord + notifications | |
| 75. β³ Discord webhook for every commit (axentx repo notifications) | |
| 76. β³ Daily digest webhook (commits + pairs + scrape stats) | |
| 77. β³ Failure alerts (orchestrate fail β ping) | |
| 78. β³ Slash commands `/orchestrate "task"` from Discord | |
| ### HF integrations | |
| 79. β³ TEI server (text-embeddings-inference) for RAG | |
| 80. β³ TGI server (text-generation-inference) for self-hosted LLM | |
| 81. β³ autotrain weekly LoRA on training pairs | |
| 82. β³ HF Inference Providers as primary (paid bypass) | |
| 83. β³ HF Spaces gradio UI (visualize chain status) | |
| ### Agent quality | |
| 84. β³ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark) | |
| 85. β³ Multi-model consensus on critical decisions (architecture, security) | |
| 86. β³ Constitutional rules (no hard-coded secrets, validate input) | |
| 87. β³ Tool use tracking per agent (which tools each agent calls) | |
| 88. β³ Persona consistency check (review for tone/style mid-thread) | |
| ### Project management | |
| 89. β³ Burndown chart per surrogate.md plan | |
| 90. β³ Story-point estimation from PRD | |
| 91. β³ Auto-create GitHub issues from `- [ ]` plan items | |
| 92. β³ PR description auto-write from commit list | |
| 93. β³ Sprint retrospective auto-summary | |
| ### Performance | |
| 94. β³ Profile + optimize orchestrate cycle time (target < 90s p50) | |
| 95. β³ Streaming responses (LLM tokens flow live, don't wait for full) | |
| 96. β³ Local cache for repeated identical prompts | |
| 97. β³ Parallel model calls (race fastest-first, kill rest) | |
| 98. β³ Edge inference (qwen3-coder on Cerebras WaferScale via API) | |
| ### Compliance + Governance | |
| 99. β³ License audit per file generated (OSS license compatibility) | |
| 100. β³ Commit signing (gpg/sigstore) | |
| --- | |
| ## π‘ Nice-to-Have (future) | |
| ### Multi-agent collaboration | |
| 1. π‘ MoA (Mixture of Agents) β 3 LLMs propose, judge picks best | |
| 2. π‘ Debate mode (2 agents argue, third synthesizes) | |
| 3. π‘ Tournament-style code review (3 reviewers, majority verdict) | |
| 4. π‘ Hierarchical agents (manager β workers β reporter) | |
| 5. π‘ Autonomous research squad (3 agents split topics, merge findings) | |
| ### UI / UX | |
| 6. π‘ Web dashboard (real-time pipeline status, training pair count, model health) | |
| 7. π‘ VSCode extension (`surrogate /auto` from editor) | |
| 8. π‘ IntelliJ plugin | |
| 9. π‘ Mobile app (iOS/Android) for on-the-go orchestrate | |
| 10. π‘ Apple Watch glance (current task status) | |
| ### Voice + Audio | |
| 11. π‘ Whisper realtime transcription | |
| 12. π‘ ElevenLabs TTS for status reports | |
| 13. π‘ Daily audio briefing podcast | |
| 14. π‘ Voice clone of user for replies | |
| ### Visual | |
| 15. π‘ Architecture diagram auto-generation (mermaid β SVG) | |
| 16. π‘ Dependency graph live render | |
| 17. π‘ Heat map of code changes per file | |
| 18. π‘ 3D codebase visualization (gource-style) | |
| ### Integrations | |
| 19. π‘ Linear / Jira sync (pull tickets, update status) | |
| 20. π‘ Slack bot | |
| 21. π‘ Microsoft Teams bot | |
| 22. π‘ Notion sync (PRD β Notion page) | |
| 23. π‘ Figma plugin (design β code via DEV agent) | |
| 24. π‘ Storybook integration (component dev) | |
| 25. π‘ Sentry integration (errors β fix queue) | |
| 26. π‘ PagerDuty integration (incident β SRE agent) | |
| 27. π‘ GitHub Copilot bridge (delegate to Surrogate for complex) | |
| 28. π‘ Cursor IDE integration | |
| ### ML / Self-improvement | |
| 29. π‘ RLHF from APPROVE/REWORK signals | |
| 30. π‘ RLAIF (AI feedback on agent outputs) | |
| 31. π‘ Continual pre-training on axentx code corpus | |
| 32. π‘ Distillation (qwen-coder-30B β 7B for edge) | |
| 33. π‘ Quantization-aware fine-tuning | |
| 34. π‘ Speculative decoding for faster inference | |
| 35. π‘ Mixture-of-experts custom training | |
| ### Datasets | |
| 36. π‘ Real-time scrape of GitHub trending (every 1h) | |
| 37. π‘ Scrape Hacker News top stories daily | |
| 38. π‘ Scrape Reddit r/programming weekly | |
| 39. π‘ Scrape Twitter dev threads (X API tier 1 = $100/m, skip) | |
| 40. π‘ Curated YouTube transcripts (developer talks, RustConf, KubeCon) | |
| 41. π‘ Scrape arxiv-sanity for AI papers | |
| 42. π‘ Crawl AWS/GCP/Azure docs nightly | |
| 43. π‘ PR diff archive (axentx own PRs as training) | |
| 44. π‘ Stack Overflow accepted answers (dump filter) | |
| 45. π‘ GitHub issue resolutions (closed issue β PR linkage) | |
| ### Cloud / Deployment | |
| 46. π‘ Multi-region HF Spaces (ap-southeast + us-east + eu-west) | |
| 47. π‘ K8s deployment manifests (move beyond HF when scale demands) | |
| 48. π‘ Kubernetes operator for axentx orchestration | |
| 49. π‘ Lambda@Edge for global low-latency inference | |
| 50. π‘ IPFS publish of PRDs (decentralized) | |
| ### Privacy + Security | |
| 51. π‘ E2E encryption for Discord chat | |
| 52. π‘ Air-gapped mode (Mac-only, no cloud) | |
| 53. π‘ Federated learning (multiple users contribute, no central data) | |
| 54. π‘ Zero-knowledge proofs for code provenance | |
| 55. π‘ Confidential computing (Intel SGX) for sensitive code | |
| 56. π‘ GDPR compliance toolkit (PII scrub, right-to-delete) | |
| 57. π‘ SOC 2 Type II readiness checklist | |
| 58. π‘ ISO 27001 audit prep | |
| ### Specialty agents | |
| 59. π‘ Compiler engineer (LLVM, optimization passes) | |
| 60. π‘ Embedded systems (microcontroller code, real-time) | |
| 61. π‘ Game dev (Unity, Unreal, Godot) | |
| 62. π‘ Blockchain (Solidity, smart contracts, security) | |
| 63. π‘ Quantum computing (Qiskit, circuits) | |
| 64. π‘ Robotics (ROS, motion planning) | |
| 65. π‘ Bioinformatics (BLAST, sequence analysis) | |
| 66. π‘ Quantitative finance (backtesting, risk) | |
| 67. π‘ Climate modeling | |
| 68. π‘ Legal tech (contract review) | |
| ### Education | |
| 69. π‘ Teach mode (explain decisions step-by-step for learners) | |
| 70. π‘ Pair programming mode (turn-taking with user) | |
| 71. π‘ Code review school (annotated learning examples) | |
| 72. π‘ Daily challenge generator (LeetCode-style, personalized) | |
| 73. π‘ Concept explainer (DDD, hexagonal, CAP theorem on demand) | |
| ### Productivity | |
| 74. π‘ Calendar integration (block focus time when in flow) | |
| 75. π‘ Pomodoro mode | |
| 76. π‘ Energy/mood tracker (suggest break when fatigued) | |
| 77. π‘ Distraction blocker (no Twitter when Surrogate active) | |
| 78. π‘ Focus music generator (lo-fi via Suno API) | |
| ### Emerging tech | |
| 79. π‘ ASI safety guardrails (per Anthropic Constitutional AI) | |
| 80. π‘ World model simulation (test ideas in synth environment) | |
| 81. π‘ Causal reasoning (vs correlation) | |
| 82. π‘ Theorem prover integration (Lean, Coq for verified code) | |
| 83. π‘ Differential privacy in training | |
| 84. π‘ Explainable AI for code reviews | |
| ### Localization | |
| 85. π‘ Thai-native pipeline (ΰΉΰΈΰΉΰΈΰΉΰΈ₯ΰΈ° comments ΰΉΰΈΰΉΰΈΰΉΰΈΰΈ’) | |
| 86. π‘ Japanese, Korean, Chinese support | |
| 87. π‘ RTL languages (Arabic, Hebrew) | |
| 88. π‘ Local LLM Thai-fluent (typhoon, openthaigpt) | |
| 89. π‘ Cultural code review (idioms per locale) | |
| ### Marketing + community | |
| 90. π‘ Public Surrogate-1 demo Space (read-only) | |
| 91. π‘ Twitter bot posts daily Surrogate-1 wins | |
| 92. π‘ GitHub discussions for community | |
| 93. π‘ Discord server for users | |
| 94. π‘ Newsletter (weekly improvements) | |
| 95. π‘ Blog (axentx engineering) | |
| ### Speculative | |
| 96. π‘ Surrogate-2 (full local inference, no cloud dep) | |
| 97. π‘ Custom silicon (qwen-coder optimized FPGA) | |
| 98. π‘ BCI integration (Neuralink-style direct intent) | |
| 99. π‘ Physical robot (Boston Dynamics + Surrogate brain) | |
| 100. π‘ ASI alignment research collaboration | |
| --- | |
| ## Current Cadence (auto-running on HF) | |
| | Task | Frequency | Status | | |
| |---|---|---| | |
| | Continuous scrape | 8 workers, 5-30s cool-down | β | | |
| | Agentic crawler | 6 workers, BFS frontier | β | | |
| | Skill synthesis | every 3 min | β | | |
| | surrogate-dev-loop | every 2 min | β | | |
| | work-queue producer | every 5 min | β | | |
| | training-pair push to HF | every 3 min | β | | |
| | auto-orchestrate-loop | every 20 min | β | | |
| | research-apply | every 30 min | β | | |
| | keyword tuner | every 60 min | β | | |
| | research-loop | every 6h | β | | |
| | dataset-enrich | every 12h | β | | |
| ## Verified working (2026-04-28) | |
| - 5 commits to HF dataset in 12 min (~4047 pairs uploaded) | |
| - Pipeline produces real Python/Go code with DDD patterns | |
| - Reviewer issues APPROVE / REWORK / REJECT verdicts | |
| - Training feedback loop closing (every stage β HF) | |