Spaces:

axentx
/

surrogate-1

Running

App Files Files Community

surrogate-1 / FEATURES.md

Ashira Pitchayapakayakul

docs: 100 must-have + 100 nice-to-have feature roadmap

a49e89a 10 days ago

preview code

raw

history blame contribute delete

15.6 kB

	# Surrogate-1 Feature Roadmap

	Updated: 2026-04-28
	Status legend: ✅ shipped │ 🚧 in progress │ ⏳ planned │ 💡 idea

	---

	## 🟢 Already Shipped (Foundation)

	### Pipeline (parallel orchestrate)
	- ✅ 6-stage chain: SA → [Architect ∥ QA-TDD] → DEV → [QA-Verify ∥ OPS] → Reviewer
	- ✅ Direct LLM call (skip broken tool-loop)
	- ✅ Marker-extraction → real code blocks → real files in cwd
	- ✅ Auto-commit + git push on APPROVE
	- ✅ 12-rung LLM ladder (Cerebras / Groq / Gemini × 2 / Samba / GH Models / Chutes / OR × 2 / HF Router × 4)

	### Data + Knowledge
	- ✅ 26 public datasets covering all SDLC domains
	- ✅ Training-pair feedback loop (every stage → ~/.surrogate/training-pairs.jsonl → HF dataset every 3 min)
	- ✅ Web research preamble (DDG search → context for PRD/orchestrate)
	- ✅ Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers)
	- ✅ Skill synthesis daemon (3-min cycles → ~/.surrogate/skills/{cat}/SKILL.md)
	- ✅ Continuous scrape (8 workers, 5-30s cool-down)

	### Models (Ollama on HF)
	- ✅ qwen3-coder:30b-a3b (primary, 16GB MoE)
	- ✅ devstral:24b (Mistral SWE-agent, 53.6% SWE-bench)
	- ✅ qwen2.5-coder:14b (fallback)
	- ✅ yi-coder:9b (128k context)
	- ✅ nomic-embed-text (RAG embeddings)

	### Agent Roster (19 SDLC experts)
	- ✅ solution-architect, tech-architect (design)
	- ✅ dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl)
	- ✅ qa-engineer, qa-perf, qa-security (test)
	- ✅ devops, sre, cloud-architect (infra)
	- ✅ devsecops, cloud-security (security)
	- ✅ data-engineer, ml-engineer (data/ML)
	- ✅ tech-writer, reviewer (docs/gate)

	### Infrastructure
	- ✅ HF Space (CPU 16GB free) running 24/7
	- ✅ /data persistent volume (state + logs + memory + skills + sessions + training-pairs)
	- ✅ Backward-compat symlinks (~/.claude/* → ~/.surrogate/*)
	- ✅ Mac CLI clean (20 essential files only, 118 daemons archived)
	- ✅ Status server: /, /health, /logs/{name}, /logs-list

	---

	## 🔴 Must-Have (next 30 days)

	### Reliability + Observability
	1. ⏳ Heartbeat alarm → Discord webhook if HF Space down >5 min
	2. ⏳ Auto-retry on transient errors (provider 429/503 → wait + retry next rung)
	3. ⏳ Cost meter per stage (tokens × $/1M, alert >$1/day)
	4. ⏳ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE)
	5. ⏳ Dataset upload deduplication (md5 of slice → skip if same as last)
	6. ⏳ Token-pool health check (rotate to next when 429)
	7. ⏳ Disk usage alert (>80% /data → cleanup oldest scrape state)
	8. ⏳ Memory leak watchdog (kill daemon RSS >1.5GB, restart)
	9. ⏳ Crash recovery (auto-resume cron loop on SIGCHLD)
	10. ⏳ Snapshot scrape ledger to HF dataset weekly

	### PRD + Project bootstrap
	11. ⏳ Claude Projects-style PRD wizard (single description input → auto-extract → 1-3 follow-ups → PRD)
	12. ⏳ PRD template library (web app / API / CLI / mobile / data pipeline / ML)
	13. ⏳ Auto-detect existing repo → reverse-engineer surrogate.md
	14. ⏳ PRD versioning (v1, v2 with diff)
	15. ⏳ "Spec mode" — refine PRD interactively before any code

	### Pipeline quality
	16. ⏳ Self-critique loop (after dev: model A reviews model B output → re-dev if NEEDS-WORK)
	17. ⏳ Regression test on touched files (re-run existing tests)
	18. ⏳ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep)
	19. ⏳ Diff approval UI (show changes before commit, esp. yolo mode)
	20. ⏳ Search-replace block edits (Aider-style, less risky than full rewrite)

	### Domain expert routing
	21. ⏳ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords
	22. ⏳ Multi-specialist parallel work (e.g., backend API + frontend UI in same task → spawn both)
	23. ⏳ Specialist-specific eval (frontend agent → check WCAG; backend → check N+1)

	### Memory + Context
	24. ⏳ Episodic memory (last 50 sessions retrieval for similar tasks)
	25. ⏳ Procedural memory (how-to library auto-generated from successful runs)
	26. ⏳ Project context cache (surrogate.md + repo-map persisted across sessions)
	27. ⏳ Cross-project pattern share (skill from project A → applicable to project B)
	28. ⏳ Long-term retention (key decisions → ADR auto-generation)

	### Self-improvement loop
	29. ⏳ Reflexion lessons → injected into next-similar-task prompt
	30. ⏳ Failed orchestrate → root-cause analysis → improvement queue
	31. ⏳ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain)
	32. ⏳ A/B test prompts (variant A vs B, pick winner by APPROVE rate)
	33. ⏳ Voyager-style skill crystallization (pattern repeated 3+ times → permanent skill)

	### Datasets + Training
	34. ⏳ SRE postmortem corpus (scrape danluu/post-mortems → ~600 incident → instruction-pair)
	35. ⏳ AWS Well-Architected synthetic Q/A (PDFs → distilabel pipeline → 5k pairs)
	36. ⏳ Internal axentx code → instruction pairs (commit messages + diffs)
	37. ⏳ Training pair quality scoring (filter low-quality before HF upload)
	38. ⏳ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles)
	39. ⏳ Synthetic ADR generation (real OSS examples → expand via distilabel)

	### Tools + Integrations
	40. ⏳ MCP client support (Claude Desktop schema — connect external tools)
	41. ⏳ ToolSearch lazy-load (don't blow context on full tool list)
	42. ⏳ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load)
	43. ⏳ Repo-map context (tree-sitter symbol graph → smarter file selection)
	44. ⏳ Tool-call traces saved as training data (every tool use → pair)

	### Security + Safety
	45. ⏳ Secret-scan pre-commit hook (gitleaks integration)
	46. ⏳ Rate limit per-IP (HF Space /chat endpoint)
	47. ⏳ Allowlist/denylist for git push (don't push to main without flag)
	48. ⏳ PII scrubber for training pairs (remove emails, IPs, names before upload)
	49. ⏳ Sandbox tool execution (no rm -rf, no curl \|sh, no destructive ops)
	50. ⏳ Audit log for every orchestrate run (who/what/when/result)

	### Multi-modal + I/O
	51. ⏳ Voice input (Whisper transcribe → surrogate)
	52. ⏳ Image input (architectural diagrams → analysis)
	53. ⏳ Screen recording → video → tutorial agent
	54. ⏳ Discord voice channel (TTS responses)

	### CLI UX
	55. ⏳ /resume <session-id> (continue past session)
	56. ⏳ /diff (show pending changes before commit)
	57. ⏳ /undo (rollback last orchestrate via git stash)
	58. ⏳ /share (publish session as gist for review)
	59. ⏳ Tab autocomplete for slash commands
	60. ⏳ Cost-meter live in statusline (running $ this session)

	### Cloud / multi-region
	61. ⏳ Mirror to Cloudflare Workers AI (free tier backup)
	62. ⏳ Egress whitelist for Discord on HF Pro tier
	63. ⏳ HF Space upgrade auto-scale (when load > 80%)
	64. ⏳ Backup strategy: weekly snapshot of /data → HF dataset

	### Codebase intelligence
	65. ⏳ Symbol search (tree-sitter index, not just text grep)
	66. ⏳ Cross-file refactor (rename across project safely)
	67. ⏳ Type-aware code completion (LSP integration)
	68. ⏳ Dead code detection (vulture, ts-prune)
	69. ⏳ Dependency graph viz (per-project)

	### Training data flywheel
	70. ⏳ Trace storage on HF (axentx/surrogate-1-traces dataset)
	71. ⏳ Auto-tag training pairs by domain (frontend/backend/etc)
	72. ⏳ Quality gate before training pair upload (≥ N tokens, well-formed)
	73. ⏳ Weekly eval on SWE-bench-Lite (track improvement)
	74. ⏳ DPO data generation (REWORK cycles → preference pairs)

	### Discord + notifications
	75. ⏳ Discord webhook for every commit (axentx repo notifications)
	76. ⏳ Daily digest webhook (commits + pairs + scrape stats)
	77. ⏳ Failure alerts (orchestrate fail → ping)
	78. ⏳ Slash commands `/orchestrate "task"` from Discord

	### HF integrations
	79. ⏳ TEI server (text-embeddings-inference) for RAG
	80. ⏳ TGI server (text-generation-inference) for self-hosted LLM
	81. ⏳ autotrain weekly LoRA on training pairs
	82. ⏳ HF Inference Providers as primary (paid bypass)
	83. ⏳ HF Spaces gradio UI (visualize chain status)

	### Agent quality
	84. ⏳ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark)
	85. ⏳ Multi-model consensus on critical decisions (architecture, security)
	86. ⏳ Constitutional rules (no hard-coded secrets, validate input)
	87. ⏳ Tool use tracking per agent (which tools each agent calls)
	88. ⏳ Persona consistency check (review for tone/style mid-thread)

	### Project management
	89. ⏳ Burndown chart per surrogate.md plan
	90. ⏳ Story-point estimation from PRD
	91. ⏳ Auto-create GitHub issues from `- [ ]` plan items
	92. ⏳ PR description auto-write from commit list
	93. ⏳ Sprint retrospective auto-summary

	### Performance
	94. ⏳ Profile + optimize orchestrate cycle time (target < 90s p50)
	95. ⏳ Streaming responses (LLM tokens flow live, don't wait for full)
	96. ⏳ Local cache for repeated identical prompts
	97. ⏳ Parallel model calls (race fastest-first, kill rest)
	98. ⏳ Edge inference (qwen3-coder on Cerebras WaferScale via API)

	### Compliance + Governance
	99. ⏳ License audit per file generated (OSS license compatibility)
	100. ⏳ Commit signing (gpg/sigstore)

	---

	## 💡 Nice-to-Have (future)

	### Multi-agent collaboration
	1. 💡 MoA (Mixture of Agents) — 3 LLMs propose, judge picks best
	2. 💡 Debate mode (2 agents argue, third synthesizes)
	3. 💡 Tournament-style code review (3 reviewers, majority verdict)
	4. 💡 Hierarchical agents (manager → workers → reporter)
	5. 💡 Autonomous research squad (3 agents split topics, merge findings)

	### UI / UX
	6. 💡 Web dashboard (real-time pipeline status, training pair count, model health)
	7. 💡 VSCode extension (`surrogate /auto` from editor)
	8. 💡 IntelliJ plugin
	9. 💡 Mobile app (iOS/Android) for on-the-go orchestrate
	10. 💡 Apple Watch glance (current task status)

	### Voice + Audio
	11. 💡 Whisper realtime transcription
	12. 💡 ElevenLabs TTS for status reports
	13. 💡 Daily audio briefing podcast
	14. 💡 Voice clone of user for replies

	### Visual
	15. 💡 Architecture diagram auto-generation (mermaid → SVG)
	16. 💡 Dependency graph live render
	17. 💡 Heat map of code changes per file
	18. 💡 3D codebase visualization (gource-style)

	### Integrations
	19. 💡 Linear / Jira sync (pull tickets, update status)
	20. 💡 Slack bot
	21. 💡 Microsoft Teams bot
	22. 💡 Notion sync (PRD ↔ Notion page)
	23. 💡 Figma plugin (design → code via DEV agent)
	24. 💡 Storybook integration (component dev)
	25. 💡 Sentry integration (errors → fix queue)
	26. 💡 PagerDuty integration (incident → SRE agent)
	27. 💡 GitHub Copilot bridge (delegate to Surrogate for complex)
	28. 💡 Cursor IDE integration

	### ML / Self-improvement
	29. 💡 RLHF from APPROVE/REWORK signals
	30. 💡 RLAIF (AI feedback on agent outputs)
	31. 💡 Continual pre-training on axentx code corpus
	32. 💡 Distillation (qwen-coder-30B → 7B for edge)
	33. 💡 Quantization-aware fine-tuning
	34. 💡 Speculative decoding for faster inference
	35. 💡 Mixture-of-experts custom training

	### Datasets
	36. 💡 Real-time scrape of GitHub trending (every 1h)
	37. 💡 Scrape Hacker News top stories daily
	38. 💡 Scrape Reddit r/programming weekly
	39. 💡 Scrape Twitter dev threads (X API tier 1 = $100/m, skip)
	40. 💡 Curated YouTube transcripts (developer talks, RustConf, KubeCon)
	41. 💡 Scrape arxiv-sanity for AI papers
	42. 💡 Crawl AWS/GCP/Azure docs nightly
	43. 💡 PR diff archive (axentx own PRs as training)
	44. 💡 Stack Overflow accepted answers (dump filter)
	45. 💡 GitHub issue resolutions (closed issue → PR linkage)

	### Cloud / Deployment
	46. 💡 Multi-region HF Spaces (ap-southeast + us-east + eu-west)
	47. 💡 K8s deployment manifests (move beyond HF when scale demands)
	48. 💡 Kubernetes operator for axentx orchestration
	49. 💡 Lambda@Edge for global low-latency inference
	50. 💡 IPFS publish of PRDs (decentralized)

	### Privacy + Security
	51. 💡 E2E encryption for Discord chat
	52. 💡 Air-gapped mode (Mac-only, no cloud)
	53. 💡 Federated learning (multiple users contribute, no central data)
	54. 💡 Zero-knowledge proofs for code provenance
	55. 💡 Confidential computing (Intel SGX) for sensitive code
	56. 💡 GDPR compliance toolkit (PII scrub, right-to-delete)
	57. 💡 SOC 2 Type II readiness checklist
	58. 💡 ISO 27001 audit prep

	### Specialty agents
	59. 💡 Compiler engineer (LLVM, optimization passes)
	60. 💡 Embedded systems (microcontroller code, real-time)
	61. 💡 Game dev (Unity, Unreal, Godot)
	62. 💡 Blockchain (Solidity, smart contracts, security)
	63. 💡 Quantum computing (Qiskit, circuits)
	64. 💡 Robotics (ROS, motion planning)
	65. 💡 Bioinformatics (BLAST, sequence analysis)
	66. 💡 Quantitative finance (backtesting, risk)
	67. 💡 Climate modeling
	68. 💡 Legal tech (contract review)

	### Education
	69. 💡 Teach mode (explain decisions step-by-step for learners)
	70. 💡 Pair programming mode (turn-taking with user)
	71. 💡 Code review school (annotated learning examples)
	72. 💡 Daily challenge generator (LeetCode-style, personalized)
	73. 💡 Concept explainer (DDD, hexagonal, CAP theorem on demand)

	### Productivity
	74. 💡 Calendar integration (block focus time when in flow)
	75. 💡 Pomodoro mode
	76. 💡 Energy/mood tracker (suggest break when fatigued)
	77. 💡 Distraction blocker (no Twitter when Surrogate active)
	78. 💡 Focus music generator (lo-fi via Suno API)

	### Emerging tech
	79. 💡 ASI safety guardrails (per Anthropic Constitutional AI)
	80. 💡 World model simulation (test ideas in synth environment)
	81. 💡 Causal reasoning (vs correlation)
	82. 💡 Theorem prover integration (Lean, Coq for verified code)
	83. 💡 Differential privacy in training
	84. 💡 Explainable AI for code reviews

	### Localization
	85. 💡 Thai-native pipeline (โค้ดและ comments เป็นไทย)
	86. 💡 Japanese, Korean, Chinese support
	87. 💡 RTL languages (Arabic, Hebrew)
	88. 💡 Local LLM Thai-fluent (typhoon, openthaigpt)
	89. 💡 Cultural code review (idioms per locale)

	### Marketing + community
	90. 💡 Public Surrogate-1 demo Space (read-only)
	91. 💡 Twitter bot posts daily Surrogate-1 wins
	92. 💡 GitHub discussions for community
	93. 💡 Discord server for users
	94. 💡 Newsletter (weekly improvements)
	95. 💡 Blog (axentx engineering)

	### Speculative
	96. 💡 Surrogate-2 (full local inference, no cloud dep)
	97. 💡 Custom silicon (qwen-coder optimized FPGA)
	98. 💡 BCI integration (Neuralink-style direct intent)
	99. 💡 Physical robot (Boston Dynamics + Surrogate brain)
	100. 💡 ASI alignment research collaboration

	---

	## Current Cadence (auto-running on HF)

	\| Task \| Frequency \| Status \|
	\|---\|---\|---\|
	\| Continuous scrape \| 8 workers, 5-30s cool-down \| ✅ \|
	\| Agentic crawler \| 6 workers, BFS frontier \| ✅ \|
	\| Skill synthesis \| every 3 min \| ✅ \|
	\| surrogate-dev-loop \| every 2 min \| ✅ \|
	\| work-queue producer \| every 5 min \| ✅ \|
	\| training-pair push to HF \| every 3 min \| ✅ \|
	\| auto-orchestrate-loop \| every 20 min \| ✅ \|
	\| research-apply \| every 30 min \| ✅ \|
	\| keyword tuner \| every 60 min \| ✅ \|
	\| research-loop \| every 6h \| ✅ \|
	\| dataset-enrich \| every 12h \| ✅ \|

	## Verified working (2026-04-28)
	- 5 commits to HF dataset in 12 min (~4047 pairs uploaded)
	- Pipeline produces real Python/Go code with DDD patterns
	- Reviewer issues APPROVE / REWORK / REJECT verdicts
	- Training feedback loop closing (every stage → HF)