Spaces:

axentx
/

surrogate-1

Running

App Files Files Community

surrogate-1 / FEATURES.md

Ashira Pitchayapakayakul

docs: 100 must-have + 100 nice-to-have feature roadmap

a49e89a 10 days ago

preview code

raw

history blame contribute delete

15.6 kB

Surrogate-1 Feature Roadmap

Updated: 2026-04-28 Status legend: ✅ shipped │ 🚧 in progress │ ⏳ planned │ 💡 idea

🟢 Already Shipped (Foundation)

Pipeline (parallel orchestrate)

✅ 6-stage chain: SA → [Architect ∥ QA-TDD] → DEV → [QA-Verify ∥ OPS] → Reviewer
✅ Direct LLM call (skip broken tool-loop)
✅ Marker-extraction → real code blocks → real files in cwd
✅ Auto-commit + git push on APPROVE
✅ 12-rung LLM ladder (Cerebras / Groq / Gemini × 2 / Samba / GH Models / Chutes / OR × 2 / HF Router × 4)

Data + Knowledge

✅ 26 public datasets covering all SDLC domains
✅ Training-pair feedback loop (every stage → ~/.surrogate/training-pairs.jsonl → HF dataset every 3 min)
✅ Web research preamble (DDG search → context for PRD/orchestrate)
✅ Agentic crawler (URL frontier + visited stamps + BFS link discovery, 6 workers)
✅ Skill synthesis daemon (3-min cycles → ~/.surrogate/skills/{cat}/SKILL.md)
✅ Continuous scrape (8 workers, 5-30s cool-down)

Models (Ollama on HF)

✅ qwen3-coder:30b-a3b (primary, 16GB MoE)
✅ devstral:24b (Mistral SWE-agent, 53.6% SWE-bench)
✅ qwen2.5-coder:14b (fallback)
✅ yi-coder:9b (128k context)
✅ nomic-embed-text (RAG embeddings)

Agent Roster (19 SDLC experts)

✅ solution-architect, tech-architect (design)
✅ dev-frontend, dev-backend, dev-mobile, dev-fullstack, dev-database (impl)
✅ qa-engineer, qa-perf, qa-security (test)
✅ devops, sre, cloud-architect (infra)
✅ devsecops, cloud-security (security)
✅ data-engineer, ml-engineer (data/ML)
✅ tech-writer, reviewer (docs/gate)

Infrastructure

✅ HF Space (CPU 16GB free) running 24/7
✅ /data persistent volume (state + logs + memory + skills + sessions + training-pairs)
✅ Backward-compat symlinks (~/.claude/* → ~/.surrogate/*)
✅ Mac CLI clean (20 essential files only, 118 daemons archived)
✅ Status server: /, /health, /logs/{name}, /logs-list

🔴 Must-Have (next 30 days)

Reliability + Observability

⏳ Heartbeat alarm → Discord webhook if HF Space down >5 min
⏳ Auto-retry on transient errors (provider 429/503 → wait + retry next rung)
⏳ Cost meter per stage (tokens × $/1M, alert >$1/day)
⏳ Regression test suite (run nightly: orchestrate test fixtures, expect APPROVE)
⏳ Dataset upload deduplication (md5 of slice → skip if same as last)
⏳ Token-pool health check (rotate to next when 429)
⏳ Disk usage alert (>80% /data → cleanup oldest scrape state)
⏳ Memory leak watchdog (kill daemon RSS >1.5GB, restart)
⏳ Crash recovery (auto-resume cron loop on SIGCHLD)
⏳ Snapshot scrape ledger to HF dataset weekly

PRD + Project bootstrap

⏳ Claude Projects-style PRD wizard (single description input → auto-extract → 1-3 follow-ups → PRD)
⏳ PRD template library (web app / API / CLI / mobile / data pipeline / ML)
⏳ Auto-detect existing repo → reverse-engineer surrogate.md
⏳ PRD versioning (v1, v2 with diff)
⏳ "Spec mode" — refine PRD interactively before any code

Pipeline quality

⏳ Self-critique loop (after dev: model A reviews model B output → re-dev if NEEDS-WORK)
⏳ Regression test on touched files (re-run existing tests)
⏳ Lint + type-check + security scan in pipeline (ruff, mypy, semgrep)
⏳ Diff approval UI (show changes before commit, esp. yolo mode)
⏳ Search-replace block edits (Aider-style, less risky than full rewrite)

Domain expert routing

⏳ Auto-route DEV stage to specialist (frontend/backend/mobile/iac) based on task keywords
⏳ Multi-specialist parallel work (e.g., backend API + frontend UI in same task → spawn both)
⏳ Specialist-specific eval (frontend agent → check WCAG; backend → check N+1)

Memory + Context

⏳ Episodic memory (last 50 sessions retrieval for similar tasks)
⏳ Procedural memory (how-to library auto-generated from successful runs)
⏳ Project context cache (surrogate.md + repo-map persisted across sessions)
⏳ Cross-project pattern share (skill from project A → applicable to project B)
⏳ Long-term retention (key decisions → ADR auto-generation)

Self-improvement loop

⏳ Reflexion lessons → injected into next-similar-task prompt
⏳ Failed orchestrate → root-cause analysis → improvement queue
⏳ Weekly LoRA fine-tune trigger (on accumulated training pairs, autotrain)
⏳ A/B test prompts (variant A vs B, pick winner by APPROVE rate)
⏳ Voyager-style skill crystallization (pattern repeated 3+ times → permanent skill)

Datasets + Training

⏳ SRE postmortem corpus (scrape danluu/post-mortems → ~600 incident → instruction-pair)
⏳ AWS Well-Architected synthetic Q/A (PDFs → distilabel pipeline → 5k pairs)
⏳ Internal axentx code → instruction pairs (commit messages + diffs)
⏳ Training pair quality scoring (filter low-quality before HF upload)
⏳ DPO preference pairs from reviewer (chosen/rejected from REWORK cycles)
⏳ Synthetic ADR generation (real OSS examples → expand via distilabel)

Tools + Integrations

⏳ MCP client support (Claude Desktop schema — connect external tools)
⏳ ToolSearch lazy-load (don't blow context on full tool list)
⏳ Constitutional Critic from ~/.surrogate/agents/roster.json (auto-load)
⏳ Repo-map context (tree-sitter symbol graph → smarter file selection)
⏳ Tool-call traces saved as training data (every tool use → pair)

Security + Safety

⏳ Secret-scan pre-commit hook (gitleaks integration)
⏳ Rate limit per-IP (HF Space /chat endpoint)
⏳ Allowlist/denylist for git push (don't push to main without flag)
⏳ PII scrubber for training pairs (remove emails, IPs, names before upload)
⏳ Sandbox tool execution (no rm -rf, no curl |sh, no destructive ops)
⏳ Audit log for every orchestrate run (who/what/when/result)

Multi-modal + I/O

⏳ Voice input (Whisper transcribe → surrogate)
⏳ Image input (architectural diagrams → analysis)
⏳ Screen recording → video → tutorial agent
⏳ Discord voice channel (TTS responses)

CLI UX

⏳ /resume (continue past session)
⏳ /diff (show pending changes before commit)
⏳ /undo (rollback last orchestrate via git stash)
⏳ /share (publish session as gist for review)
⏳ Tab autocomplete for slash commands
⏳ Cost-meter live in statusline (running $ this session)

Cloud / multi-region

⏳ Mirror to Cloudflare Workers AI (free tier backup)
⏳ Egress whitelist for Discord on HF Pro tier
⏳ HF Space upgrade auto-scale (when load > 80%)
⏳ Backup strategy: weekly snapshot of /data → HF dataset

Codebase intelligence

⏳ Symbol search (tree-sitter index, not just text grep)
⏳ Cross-file refactor (rename across project safely)
⏳ Type-aware code completion (LSP integration)
⏳ Dead code detection (vulture, ts-prune)
⏳ Dependency graph viz (per-project)

Training data flywheel

⏳ Trace storage on HF (axentx/surrogate-1-traces dataset)
⏳ Auto-tag training pairs by domain (frontend/backend/etc)
⏳ Quality gate before training pair upload (≥ N tokens, well-formed)
⏳ Weekly eval on SWE-bench-Lite (track improvement)
⏳ DPO data generation (REWORK cycles → preference pairs)

Discord + notifications

⏳ Discord webhook for every commit (axentx repo notifications)
⏳ Daily digest webhook (commits + pairs + scrape stats)
⏳ Failure alerts (orchestrate fail → ping)
⏳ Slash commands /orchestrate "task" from Discord

HF integrations

⏳ TEI server (text-embeddings-inference) for RAG
⏳ TGI server (text-generation-inference) for self-hosted LLM
⏳ autotrain weekly LoRA on training pairs
⏳ HF Inference Providers as primary (paid bypass)
⏳ HF Spaces gradio UI (visualize chain status)

Agent quality

⏳ Specialist eval per agent (e.g., dev-backend on RealWorld benchmark)
⏳ Multi-model consensus on critical decisions (architecture, security)
⏳ Constitutional rules (no hard-coded secrets, validate input)
⏳ Tool use tracking per agent (which tools each agent calls)
⏳ Persona consistency check (review for tone/style mid-thread)

Project management

⏳ Burndown chart per surrogate.md plan
⏳ Story-point estimation from PRD
⏳ Auto-create GitHub issues from - [ ] plan items
⏳ PR description auto-write from commit list
⏳ Sprint retrospective auto-summary

Performance

⏳ Profile + optimize orchestrate cycle time (target < 90s p50)
⏳ Streaming responses (LLM tokens flow live, don't wait for full)
⏳ Local cache for repeated identical prompts
⏳ Parallel model calls (race fastest-first, kill rest)
⏳ Edge inference (qwen3-coder on Cerebras WaferScale via API)

Compliance + Governance

⏳ License audit per file generated (OSS license compatibility)
⏳ Commit signing (gpg/sigstore)

💡 Nice-to-Have (future)

Multi-agent collaboration

💡 MoA (Mixture of Agents) — 3 LLMs propose, judge picks best
💡 Debate mode (2 agents argue, third synthesizes)
💡 Tournament-style code review (3 reviewers, majority verdict)
💡 Hierarchical agents (manager → workers → reporter)
💡 Autonomous research squad (3 agents split topics, merge findings)

UI / UX

💡 Web dashboard (real-time pipeline status, training pair count, model health)
💡 VSCode extension (surrogate /auto from editor)
💡 IntelliJ plugin
💡 Mobile app (iOS/Android) for on-the-go orchestrate
💡 Apple Watch glance (current task status)

Voice + Audio

💡 Whisper realtime transcription
💡 ElevenLabs TTS for status reports
💡 Daily audio briefing podcast
💡 Voice clone of user for replies

Visual

💡 Architecture diagram auto-generation (mermaid → SVG)
💡 Dependency graph live render
💡 Heat map of code changes per file
💡 3D codebase visualization (gource-style)

Integrations

💡 Linear / Jira sync (pull tickets, update status)
💡 Slack bot
💡 Microsoft Teams bot
💡 Notion sync (PRD ↔ Notion page)
💡 Figma plugin (design → code via DEV agent)
💡 Storybook integration (component dev)
💡 Sentry integration (errors → fix queue)
💡 PagerDuty integration (incident → SRE agent)
💡 GitHub Copilot bridge (delegate to Surrogate for complex)
💡 Cursor IDE integration

ML / Self-improvement

💡 RLHF from APPROVE/REWORK signals
💡 RLAIF (AI feedback on agent outputs)
💡 Continual pre-training on axentx code corpus
💡 Distillation (qwen-coder-30B → 7B for edge)
💡 Quantization-aware fine-tuning
💡 Speculative decoding for faster inference
💡 Mixture-of-experts custom training

Datasets

💡 Real-time scrape of GitHub trending (every 1h)
💡 Scrape Hacker News top stories daily
💡 Scrape Reddit r/programming weekly
💡 Scrape Twitter dev threads (X API tier 1 = $100/m, skip)
💡 Curated YouTube transcripts (developer talks, RustConf, KubeCon)
💡 Scrape arxiv-sanity for AI papers
💡 Crawl AWS/GCP/Azure docs nightly
💡 PR diff archive (axentx own PRs as training)
💡 Stack Overflow accepted answers (dump filter)
💡 GitHub issue resolutions (closed issue → PR linkage)

Cloud / Deployment

💡 Multi-region HF Spaces (ap-southeast + us-east + eu-west)
💡 K8s deployment manifests (move beyond HF when scale demands)
💡 Kubernetes operator for axentx orchestration
💡 Lambda@Edge for global low-latency inference
💡 IPFS publish of PRDs (decentralized)

Privacy + Security

💡 E2E encryption for Discord chat
💡 Air-gapped mode (Mac-only, no cloud)
💡 Federated learning (multiple users contribute, no central data)
💡 Zero-knowledge proofs for code provenance
💡 Confidential computing (Intel SGX) for sensitive code
💡 GDPR compliance toolkit (PII scrub, right-to-delete)
💡 SOC 2 Type II readiness checklist
💡 ISO 27001 audit prep

Specialty agents

💡 Compiler engineer (LLVM, optimization passes)
💡 Embedded systems (microcontroller code, real-time)
💡 Game dev (Unity, Unreal, Godot)
💡 Blockchain (Solidity, smart contracts, security)
💡 Quantum computing (Qiskit, circuits)
💡 Robotics (ROS, motion planning)
💡 Bioinformatics (BLAST, sequence analysis)
💡 Quantitative finance (backtesting, risk)
💡 Climate modeling
💡 Legal tech (contract review)

Education

💡 Teach mode (explain decisions step-by-step for learners)
💡 Pair programming mode (turn-taking with user)
💡 Code review school (annotated learning examples)
💡 Daily challenge generator (LeetCode-style, personalized)
💡 Concept explainer (DDD, hexagonal, CAP theorem on demand)

Productivity

💡 Calendar integration (block focus time when in flow)
💡 Pomodoro mode
💡 Energy/mood tracker (suggest break when fatigued)
💡 Distraction blocker (no Twitter when Surrogate active)
💡 Focus music generator (lo-fi via Suno API)

Emerging tech

💡 ASI safety guardrails (per Anthropic Constitutional AI)
💡 World model simulation (test ideas in synth environment)
💡 Causal reasoning (vs correlation)
💡 Theorem prover integration (Lean, Coq for verified code)
💡 Differential privacy in training
💡 Explainable AI for code reviews

Localization

💡 Thai-native pipeline (โค้ดและ comments เป็นไทย)
💡 Japanese, Korean, Chinese support
💡 RTL languages (Arabic, Hebrew)
💡 Local LLM Thai-fluent (typhoon, openthaigpt)
💡 Cultural code review (idioms per locale)

Marketing + community

💡 Public Surrogate-1 demo Space (read-only)
💡 Twitter bot posts daily Surrogate-1 wins
💡 GitHub discussions for community
💡 Discord server for users
💡 Newsletter (weekly improvements)
💡 Blog (axentx engineering)

Speculative

💡 Surrogate-2 (full local inference, no cloud dep)
💡 Custom silicon (qwen-coder optimized FPGA)
💡 BCI integration (Neuralink-style direct intent)
💡 Physical robot (Boston Dynamics + Surrogate brain)
💡 ASI alignment research collaboration

Current Cadence (auto-running on HF)

Task	Frequency	Status
Continuous scrape	8 workers, 5-30s cool-down	✅
Agentic crawler	6 workers, BFS frontier	✅
Skill synthesis	every 3 min	✅
surrogate-dev-loop	every 2 min	✅
work-queue producer	every 5 min	✅
training-pair push to HF	every 3 min	✅
auto-orchestrate-loop	every 20 min	✅
research-apply	every 30 min	✅
keyword tuner	every 60 min	✅
research-loop	every 6h	✅
dataset-enrich	every 12h	✅

Verified working (2026-04-28)

5 commits to HF dataset in 12 min (~4047 pairs uploaded)
Pipeline produces real Python/Go code with DDD patterns
Reviewer issues APPROVE / REWORK / REJECT verdicts
Training feedback loop closing (every stage → HF)