Spaces:
Running
Running
Surrogate-1 Feature Roadmap
Updated: 2026-04-28 Status legend: β shipped β π§ in progress β β³ planned β π‘ idea
π’ Already Shipped (Foundation)
Pipeline (parallel orchestrate)
- β 6-stage chain: SA β [Architect β₯ QA-TDD] β DEV β [QA-Verify β₯ OPS] β Reviewer
- β Direct LLM call (skip broken tool-loop)
- β Marker-extraction β real code blocks β real files in cwd
- β Auto-commit + git push on APPROVE
- β 12-rung LLM ladder (Cerebras / Groq / Gemini Γ 2 / Samba / GH Models / Chutes / OR Γ 2 / HF Router Γ 4)
Data + Knowledge
- β 26 public datasets covering all SDLC domains
- β Training-pair feedback loop (every stage β ~/.surrogate/training-pairs.jsonl β HF dataset every 3 min)
- β Web research preamble (DDG search β context for PRD/orchestrate)
- β Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers)
- β Skill synthesis daemon (3-min cycles β ~/.surrogate/skills/{cat}/SKILL.md)
- β Continuous scrape (8 workers, 5-30s cool-down)
Models (Ollama on HF)
- β qwen3-coder:30b-a3b (primary, 16GB MoE)
- β devstral:24b (Mistral SWE-agent, 53.6% SWE-bench)
- β qwen2.5-coder:14b (fallback)
- β yi-coder:9b (128k context)
- β nomic-embed-text (RAG embeddings)
Agent Roster (19 SDLC experts)
- β solution-architect, tech-architect (design)
- β dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl)
- β qa-engineer, qa-perf, qa-security (test)
- β devops, sre, cloud-architect (infra)
- β devsecops, cloud-security (security)
- β data-engineer, ml-engineer (data/ML)
- β tech-writer, reviewer (docs/gate)
Infrastructure
- β HF Space (CPU 16GB free) running 24/7
- β /data persistent volume (state + logs + memory + skills + sessions + training-pairs)
- β Backward-compat symlinks (~/.claude/* β ~/.surrogate/*)
- β Mac CLI clean (20 essential files only, 118 daemons archived)
- β Status server: /, /health, /logs/{name}, /logs-list
π΄ Must-Have (next 30 days)
Reliability + Observability
- β³ Heartbeat alarm β Discord webhook if HF Space down >5 min
- β³ Auto-retry on transient errors (provider 429/503 β wait + retry next rung)
- β³ Cost meter per stage (tokens Γ $/1M, alert >$1/day)
- β³ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE)
- β³ Dataset upload deduplication (md5 of slice β skip if same as last)
- β³ Token-pool health check (rotate to next when 429)
- β³ Disk usage alert (>80% /data β cleanup oldest scrape state)
- β³ Memory leak watchdog (kill daemon RSS >1.5GB, restart)
- β³ Crash recovery (auto-resume cron loop on SIGCHLD)
- β³ Snapshot scrape ledger to HF dataset weekly
PRD + Project bootstrap
- β³ Claude Projects-style PRD wizard (single description input β auto-extract β 1-3 follow-ups β PRD)
- β³ PRD template library (web app / API / CLI / mobile / data pipeline / ML)
- β³ Auto-detect existing repo β reverse-engineer surrogate.md
- β³ PRD versioning (v1, v2 with diff)
- β³ "Spec mode" β refine PRD interactively before any code
Pipeline quality
- β³ Self-critique loop (after dev: model A reviews model B output β re-dev if NEEDS-WORK)
- β³ Regression test on touched files (re-run existing tests)
- β³ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep)
- β³ Diff approval UI (show changes before commit, esp. yolo mode)
- β³ Search-replace block edits (Aider-style, less risky than full rewrite)
Domain expert routing
- β³ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords
- β³ Multi-specialist parallel work (e.g., backend API + frontend UI in same task β spawn both)
- β³ Specialist-specific eval (frontend agent β check WCAG; backend β check N+1)
Memory + Context
- β³ Episodic memory (last 50 sessions retrieval for similar tasks)
- β³ Procedural memory (how-to library auto-generated from successful runs)
- β³ Project context cache (surrogate.md + repo-map persisted across sessions)
- β³ Cross-project pattern share (skill from project A β applicable to project B)
- β³ Long-term retention (key decisions β ADR auto-generation)
Self-improvement loop
- β³ Reflexion lessons β injected into next-similar-task prompt
- β³ Failed orchestrate β root-cause analysis β improvement queue
- β³ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain)
- β³ A/B test prompts (variant A vs B, pick winner by APPROVE rate)
- β³ Voyager-style skill crystallization (pattern repeated 3+ times β permanent skill)
Datasets + Training
- β³ SRE postmortem corpus (scrape danluu/post-mortems β ~600 incident β instruction-pair)
- β³ AWS Well-Architected synthetic Q/A (PDFs β distilabel pipeline β 5k pairs)
- β³ Internal axentx code β instruction pairs (commit messages + diffs)
- β³ Training pair quality scoring (filter low-quality before HF upload)
- β³ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles)
- β³ Synthetic ADR generation (real OSS examples β expand via distilabel)
Tools + Integrations
- β³ MCP client support (Claude Desktop schema β connect external tools)
- β³ ToolSearch lazy-load (don't blow context on full tool list)
- β³ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load)
- β³ Repo-map context (tree-sitter symbol graph β smarter file selection)
- β³ Tool-call traces saved as training data (every tool use β pair)
Security + Safety
- β³ Secret-scan pre-commit hook (gitleaks integration)
- β³ Rate limit per-IP (HF Space /chat endpoint)
- β³ Allowlist/denylist for git push (don't push to main without flag)
- β³ PII scrubber for training pairs (remove emails, IPs, names before upload)
- β³ Sandbox tool execution (no rm -rf, no curl |sh, no destructive ops)
- β³ Audit log for every orchestrate run (who/what/when/result)
Multi-modal + I/O
- β³ Voice input (Whisper transcribe β surrogate)
- β³ Image input (architectural diagrams β analysis)
- β³ Screen recording β video β tutorial agent
- β³ Discord voice channel (TTS responses)
CLI UX
- β³ /resume (continue past session)
- β³ /diff (show pending changes before commit)
- β³ /undo (rollback last orchestrate via git stash)
- β³ /share (publish session as gist for review)
- β³ Tab autocomplete for slash commands
- β³ Cost-meter live in statusline (running $ this session)
Cloud / multi-region
- β³ Mirror to Cloudflare Workers AI (free tier backup)
- β³ Egress whitelist for Discord on HF Pro tier
- β³ HF Space upgrade auto-scale (when load > 80%)
- β³ Backup strategy: weekly snapshot of /data β HF dataset
Codebase intelligence
- β³ Symbol search (tree-sitter index, not just text grep)
- β³ Cross-file refactor (rename across project safely)
- β³ Type-aware code completion (LSP integration)
- β³ Dead code detection (vulture, ts-prune)
- β³ Dependency graph viz (per-project)
Training data flywheel
- β³ Trace storage on HF (axentx/surrogate-1-traces dataset)
- β³ Auto-tag training pairs by domain (frontend/backend/etc)
- β³ Quality gate before training pair upload (β₯ N tokens, well-formed)
- β³ Weekly eval on SWE-bench-Lite (track improvement)
- β³ DPO data generation (REWORK cycles β preference pairs)
Discord + notifications
- β³ Discord webhook for every commit (axentx repo notifications)
- β³ Daily digest webhook (commits + pairs + scrape stats)
- β³ Failure alerts (orchestrate fail β ping)
- β³ Slash commands
/orchestrate "task"from Discord
HF integrations
- β³ TEI server (text-embeddings-inference) for RAG
- β³ TGI server (text-generation-inference) for self-hosted LLM
- β³ autotrain weekly LoRA on training pairs
- β³ HF Inference Providers as primary (paid bypass)
- β³ HF Spaces gradio UI (visualize chain status)
Agent quality
- β³ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark)
- β³ Multi-model consensus on critical decisions (architecture, security)
- β³ Constitutional rules (no hard-coded secrets, validate input)
- β³ Tool use tracking per agent (which tools each agent calls)
- β³ Persona consistency check (review for tone/style mid-thread)
Project management
- β³ Burndown chart per surrogate.md plan
- β³ Story-point estimation from PRD
- β³ Auto-create GitHub issues from
- [ ]plan items - β³ PR description auto-write from commit list
- β³ Sprint retrospective auto-summary
Performance
- β³ Profile + optimize orchestrate cycle time (target < 90s p50)
- β³ Streaming responses (LLM tokens flow live, don't wait for full)
- β³ Local cache for repeated identical prompts
- β³ Parallel model calls (race fastest-first, kill rest)
- β³ Edge inference (qwen3-coder on Cerebras WaferScale via API)
Compliance + Governance
- β³ License audit per file generated (OSS license compatibility)
- β³ Commit signing (gpg/sigstore)
π‘ Nice-to-Have (future)
Multi-agent collaboration
- π‘ MoA (Mixture of Agents) β 3 LLMs propose, judge picks best
- π‘ Debate mode (2 agents argue, third synthesizes)
- π‘ Tournament-style code review (3 reviewers, majority verdict)
- π‘ Hierarchical agents (manager β workers β reporter)
- π‘ Autonomous research squad (3 agents split topics, merge findings)
UI / UX
- π‘ Web dashboard (real-time pipeline status, training pair count, model health)
- π‘ VSCode extension (
surrogate /autofrom editor) - π‘ IntelliJ plugin
- π‘ Mobile app (iOS/Android) for on-the-go orchestrate
- π‘ Apple Watch glance (current task status)
Voice + Audio
- π‘ Whisper realtime transcription
- π‘ ElevenLabs TTS for status reports
- π‘ Daily audio briefing podcast
- π‘ Voice clone of user for replies
Visual
- π‘ Architecture diagram auto-generation (mermaid β SVG)
- π‘ Dependency graph live render
- π‘ Heat map of code changes per file
- π‘ 3D codebase visualization (gource-style)
Integrations
- π‘ Linear / Jira sync (pull tickets, update status)
- π‘ Slack bot
- π‘ Microsoft Teams bot
- π‘ Notion sync (PRD β Notion page)
- π‘ Figma plugin (design β code via DEV agent)
- π‘ Storybook integration (component dev)
- π‘ Sentry integration (errors β fix queue)
- π‘ PagerDuty integration (incident β SRE agent)
- π‘ GitHub Copilot bridge (delegate to Surrogate for complex)
- π‘ Cursor IDE integration
ML / Self-improvement
- π‘ RLHF from APPROVE/REWORK signals
- π‘ RLAIF (AI feedback on agent outputs)
- π‘ Continual pre-training on axentx code corpus
- π‘ Distillation (qwen-coder-30B β 7B for edge)
- π‘ Quantization-aware fine-tuning
- π‘ Speculative decoding for faster inference
- π‘ Mixture-of-experts custom training
Datasets
- π‘ Real-time scrape of GitHub trending (every 1h)
- π‘ Scrape Hacker News top stories daily
- π‘ Scrape Reddit r/programming weekly
- π‘ Scrape Twitter dev threads (X API tier 1 = $100/m, skip)
- π‘ Curated YouTube transcripts (developer talks, RustConf, KubeCon)
- π‘ Scrape arxiv-sanity for AI papers
- π‘ Crawl AWS/GCP/Azure docs nightly
- π‘ PR diff archive (axentx own PRs as training)
- π‘ Stack Overflow accepted answers (dump filter)
- π‘ GitHub issue resolutions (closed issue β PR linkage)
Cloud / Deployment
- π‘ Multi-region HF Spaces (ap-southeast + us-east + eu-west)
- π‘ K8s deployment manifests (move beyond HF when scale demands)
- π‘ Kubernetes operator for axentx orchestration
- π‘ Lambda@Edge for global low-latency inference
- π‘ IPFS publish of PRDs (decentralized)
Privacy + Security
- π‘ E2E encryption for Discord chat
- π‘ Air-gapped mode (Mac-only, no cloud)
- π‘ Federated learning (multiple users contribute, no central data)
- π‘ Zero-knowledge proofs for code provenance
- π‘ Confidential computing (Intel SGX) for sensitive code
- π‘ GDPR compliance toolkit (PII scrub, right-to-delete)
- π‘ SOC 2 Type II readiness checklist
- π‘ ISO 27001 audit prep
Specialty agents
- π‘ Compiler engineer (LLVM, optimization passes)
- π‘ Embedded systems (microcontroller code, real-time)
- π‘ Game dev (Unity, Unreal, Godot)
- π‘ Blockchain (Solidity, smart contracts, security)
- π‘ Quantum computing (Qiskit, circuits)
- π‘ Robotics (ROS, motion planning)
- π‘ Bioinformatics (BLAST, sequence analysis)
- π‘ Quantitative finance (backtesting, risk)
- π‘ Climate modeling
- π‘ Legal tech (contract review)
Education
- π‘ Teach mode (explain decisions step-by-step for learners)
- π‘ Pair programming mode (turn-taking with user)
- π‘ Code review school (annotated learning examples)
- π‘ Daily challenge generator (LeetCode-style, personalized)
- π‘ Concept explainer (DDD, hexagonal, CAP theorem on demand)
Productivity
- π‘ Calendar integration (block focus time when in flow)
- π‘ Pomodoro mode
- π‘ Energy/mood tracker (suggest break when fatigued)
- π‘ Distraction blocker (no Twitter when Surrogate active)
- π‘ Focus music generator (lo-fi via Suno API)
Emerging tech
- π‘ ASI safety guardrails (per Anthropic Constitutional AI)
- π‘ World model simulation (test ideas in synth environment)
- π‘ Causal reasoning (vs correlation)
- π‘ Theorem prover integration (Lean, Coq for verified code)
- π‘ Differential privacy in training
- π‘ Explainable AI for code reviews
Localization
- π‘ Thai-native pipeline (ΰΉΰΈΰΉΰΈΰΉΰΈ₯ΰΈ° comments ΰΉΰΈΰΉΰΈΰΉΰΈΰΈ’)
- π‘ Japanese, Korean, Chinese support
- π‘ RTL languages (Arabic, Hebrew)
- π‘ Local LLM Thai-fluent (typhoon, openthaigpt)
- π‘ Cultural code review (idioms per locale)
Marketing + community
- π‘ Public Surrogate-1 demo Space (read-only)
- π‘ Twitter bot posts daily Surrogate-1 wins
- π‘ GitHub discussions for community
- π‘ Discord server for users
- π‘ Newsletter (weekly improvements)
- π‘ Blog (axentx engineering)
Speculative
- π‘ Surrogate-2 (full local inference, no cloud dep)
- π‘ Custom silicon (qwen-coder optimized FPGA)
- π‘ BCI integration (Neuralink-style direct intent)
- π‘ Physical robot (Boston Dynamics + Surrogate brain)
- π‘ ASI alignment research collaboration
Current Cadence (auto-running on HF)
| Task | Frequency | Status |
|---|---|---|
| Continuous scrape | 8 workers, 5-30s cool-down | β |
| Agentic crawler | 6 workers, BFS frontier | β |
| Skill synthesis | every 3 min | β |
| surrogate-dev-loop | every 2 min | β |
| work-queue producer | every 5 min | β |
| training-pair push to HF | every 3 min | β |
| auto-orchestrate-loop | every 20 min | β |
| research-apply | every 30 min | β |
| keyword tuner | every 60 min | β |
| research-loop | every 6h | β |
| dataset-enrich | every 12h | β |
Verified working (2026-04-28)
- 5 commits to HF dataset in 12 min (~4047 pairs uploaded)
- Pipeline produces real Python/Go code with DDD patterns
- Reviewer issues APPROVE / REWORK / REJECT verdicts
- Training feedback loop closing (every stage β HF)