Spaces:

axentx
/

surrogate-1

Runtime error

App Files Files Community

surrogate-1 / docs /v2-research /research-cloud-platform.md

ashirato

docs(v2-research): persist 16-stream research corpus + Phase A runbook

19be69c 13 days ago

preview code

raw

history blame contribute delete

56.9 kB

metadata

title: Cloud + Platform Engineering Deep Research for Surrogate-1 v2
date: 2026-04-29T00:00:00.000Z
purpose: >-
  Train Surrogate-1 (Qwen2.5-Coder-7B + LoRA) into a SOTA Cloud / Platform
  Engineer
scope: AWS + GCP + Azure + Edge + IDP + IaC + K8s + FinOps + Multi-cloud DR

Surrogate-1 SOTA Cloud + Platform Engineer Training Plan

This document is the canonical knowledge base used to design the v2 instruction-tuning curriculum for Surrogate-1. The model must, autonomously and end-to-end:

Design + provision multi-cloud infrastructure (AWS, GCP, Azure, Cloudflare, Vercel)
Author production-grade IaC (Terraform, OpenTofu, CDK, Pulumi, Bicep, Crossplane)
Operate Kubernetes platforms (EKS / GKE / AKS) with GitOps + service mesh
Build internal developer platforms (Backstage, Port, Score, Humanitec)
Handle FinOps lifecycle (Inform / Optimize / Operate, 2025 + Scopes)
Execute multi-cloud disaster recovery + global routing
Stand up edge/serverless (Cloudflare Workers, Vercel Edge, Lambda@Edge)

The research is organized into 14 verticals. Each section closes with the training corpus + eval target for the v2 curriculum.

1. AWS Deep Mastery

1.1 Certification Scope (training-data anchors)

Cert	Code	Topics	Why we mine it
Solutions Architect Associate	SAA-C03	VPC, EC2, S3, RDS, Lambda, IAM basics	Foundational service catalog
Solutions Architect Pro	SAP-C02	Multi-account, hybrid, migration, DR, cost-resilience	Most question banks for org-complexity
DevOps Engineer Pro	DOP-C02	CI/CD, monitoring, IaC, governance	Pipelines + observability
Security Specialty	SCS-C02	KMS, GuardDuty, Inspector, SCPs, IRSA	Hardening + compliance scenarios
Advanced Networking Specialty	ANS-C01	Transit Gateway, Direct Connect, Cloud WAN	Multi-VPC + hybrid networking

The SAP-C02 exam validates designing multi-account strategies, hybrid architectures, migration at scale, cost optimization, security, and resilience — exactly the Surrogate-1 scope. The exam has 65 scored + 10 unscored questions, passing score 750/1000.

1.2 Well-Architected Framework — 6 pillars (Sustainability added Dec 2021, refreshed Nov 2024)

1. Operational Excellence  — IaC, runbooks, observability, post-mortems
2. Security                — IAM, encryption, network, IR
3. Reliability              — RTO/RPO, failover, multi-AZ/multi-region
4. Performance Efficiency   — right-sized compute, modern data services
5. Cost Optimization        — RIs/SPs/Spot/Graviton, lifecycle rules
6. Sustainability           — energy efficiency, region selection, idle cleanup

Lenses Surrogate-1 must recognize: Serverless, SaaS, Migration, Generative AI, IoT, Hybrid Networking, Financial Services, Streaming Media, ML.

1.3 Top 30 services for startup/SaaS workloads

Compute      : EC2, Lambda, Fargate, Batch, ECS, EKS, App Runner
Storage      : S3, EFS, FSx, EBS
DB           : RDS (Postgres/MySQL), Aurora, Aurora DSQL, DynamoDB, ElastiCache (Redis/Valkey), OpenSearch
Network      : VPC, Route53, CloudFront, ALB/NLB/GWLB, Transit Gateway, PrivateLink, API Gateway
Identity     : IAM, IAM Identity Center (SSO), Cognito, Organizations, Verified Permissions
Observability: CloudWatch, X-Ray, Managed Prometheus, Managed Grafana, OpenSearch
Security     : KMS, Secrets Manager, GuardDuty, Inspector, Security Hub, WAF, Shield
Data/AI      : Bedrock, SageMaker, Glue, Athena, Kinesis, MSK, Step Functions
Messaging    : SQS, SNS, EventBridge
DevTools     : CodePipeline, CodeBuild, CodeDeploy, CDK, SAM

1.4 VPC networking patterns

Hub-and-spoke with Transit Gateway — TGW is the managed hub-and-spoke service for VPCs and on-prem; centralizes routing without VPN overlays. Aligns with Well-Architected Reliability pillar REL02-BP04.

Centralized PrivateLink endpoints — Host interface VPC endpoints (e.g., for s3.api, kms, sts, secretsmanager) in a single shared-services VPC. All spoke VPCs reach AWS APIs via TGW → endpoint VPC. Saves cost (one $7.30/month-per-AZ endpoint instead of N).

Decision tree:

Two VPCs, low traffic, no transitive    → VPC Peering
Service consumed across many VPCs       → PrivateLink (endpoint service)
≥3 VPCs with transitive routing needed  → Transit Gateway (hub-and-spoke)
Multi-region + on-prem at scale         → Cloud WAN

1.5 IAM advanced

SCP = guardrail at OU/account level; deny-by-default, no permissions granted, only constrains. SCPs evaluated AND'd with IAM policies + permission boundaries — action allowed only when ALL allow it.

Permission boundary = max permissions a role/user CAN have, regardless of attached policies. Used for delegated admin (developer can create roles, but only ones bounded).

ABAC = attribute-based access control via tags (e.g., aws:PrincipalTag/team must equal aws:ResourceTag/team). Reduces role count drastically. SCPs can lock the tagging itself so principals can't escalate by re-tagging.

Example SCP — deny untagged production resources:

{
  "Sid": "DenyUntaggedEnvProd",
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:RequestTag/Environment": ["prod","staging","dev"]
    }
  }
}

1.6 Cost optimization (FinOps lever in §11)

Compute discount tiers (max savings vs on-demand):

Mechanism	Max Discount	Flexibility
Standard RI	75%	Locked region+family+OS, 1 or 3 yr
Convertible RI	54%	Can change family within OS
EC2 Instance SP	72%	Locked family, any size, any AZ
Compute SP	66%	EC2 + Fargate + Lambda + SageMaker
Spot	90%	Variable interruption (2-min notice)
Graviton	+40% perf/$	ARM64 (must support arch)

June 2025 change — RIs and SPs are restricted to single-end-customer; MSPs can no longer share commitments across accounts.

Surrogate-1 must teach Compute Optimizer recommendations + apply them.

1.7 AWS-specific tools & CLI surface

aws cli      → primary
aws cdk      → preferred IaC (TS/Python). CDK Refactor (Sept 2025) safely renames constructs without replacement
aws sam      → serverless/Lambda focus
aws copilot  → ECS/Fargate (END OF SUPPORT June 12 2026 — migrate to ECS Express or CDK L3)
aws amplify  → frontend + serverless backend, Git-driven CI/CD

1.8 Training corpus for AWS

- AWS Well-Architected docs (all 6 pillar PDFs + 9 lenses)
- AWS official examples: aws-samples/* (8000+ repos)
- terraform-aws-modules/* (vpc 126M downloads, eks 96.3M downloads)
- AWS CDK guide v2 + cdk-patterns/serverless
- SAP-C02 question banks (ExamTopics, Tutorials Dojo)
- AWS Architecture Center reference architectures (multi-account, DR, hybrid)
- Service Control Policy examples: aws-samples/service-control-policy-examples

Eval target: 75% on a custom AWS-design eval (multi-account VPC + hub-spoke + IAM bootstrap + EKS cluster) with cfn-lint + cfn-guard passing.

2. GCP Deep Mastery

2.1 Certifications

Cert	Released	Scope
Cloud Digital Leader	—	Business/strategy
Associate Cloud Engineer	—	gcloud + GCE/GKE/GCS basics
Professional Cloud Architect (PCA)	refreshed Oct 30 2025	Design — ~30% net-new content (Vertex AI, Gemini, AI Hypercomputer)
Professional Cloud Network Engineer (PCNE)	—	VPC, hybrid, Cloud Interconnect
Professional Cloud DevOps Engineer	—	SLO, CI/CD, observability
Professional Cloud Security Engineer	—	Org policies, VPC-SC, BeyondCorp
Professional Cloud Database Engineer	—	Cloud SQL, AlloyDB, Spanner

PCA exam covers Compute Engine, Cloud Storage, App Engine, GKE, with the Oct 2025 refresh adding ~30% new content focused on Vertex AI, Gemini integration, and AI Hypercomputer.

2.2 GKE advanced

GKE Autopilot — Google manages provisioning, scaling, security, add-ons. Bills per-pod resource request (not nodes). Best when team doesn't want to tune nodepools.

GKE Standard — Customer-managed nodepools; required for DaemonSets that need privileged hostPath, custom CNI, niche GPU/TPU shapes.

GKE version ladder — GKE adopts new K8s versions fastest (~2 weeks). Autopilot gets 30 months extended support; AKS LTS 24 months; EKS Extended Support +12 months.

Anthos / GKE Enterprise — Multi-cluster across on-prem + AWS + Azure. Provides Config Sync (GitOps), Service Mesh, Policy Controller. Now folded into GKE Enterprise SKU.

2.3 BigQuery + Vertex AI integration (2025)

AI.GENERATE, AI.GENERATE_TABLE, AI.EMBED, AI.SIMILARITY are now GA in BigQuery.
BQML supports Gemini 3.0 for generative SQL functions.
Vertex AI End User Credentials (2025) lets Vertex models authenticate via the calling user's IAM — no service-account proxy.

This is core for any data-platform engineering Surrogate-1 builds.

2.4 Cloud Run + Cloud Functions

Cloud Run gen2 = container-as-a-service, scales to zero, max 60-min timeout, supports websockets/streaming.
Cloud Functions gen2 = built ON Cloud Run; choose Functions for trigger-driven, Run for HTTP/services.
Cloud Run jobs = batch workloads (cron via Cloud Scheduler).

2.5 GCP-specific tools

gcloud           → primary CLI
Terraform google → official provider, fastest day-1 support for new services
Config Connector → GCP-native Crossplane equivalent (KCC). Manage GCP resources via K8s CRDs
Cloud Deploy     → managed GitOps for GKE
Cloud Build      → CI (yaml + buildpacks)

2.6 Training corpus for GCP

- GCP architecture center (cloud.google.com/architecture)
- terraform-google-modules/* (network, kubernetes-engine, cloud-foundation-fabric)
- Cloud Foundation Fabric (Google's reference org setup)
- gcp-pca-study-guide repos
- Anthos config-management examples
- BQML + Vertex AI codelabs

3. Azure Deep Mastery

3.1 Certifications

Cert	Code	Scope
Administrator Associate	AZ-104	RBAC, IAM, networking, storage, Bicep basics
Solutions Architect Expert	AZ-305	Design — governance, identity, infra, app, integration
Security Engineer	AZ-500	Defender, Sentinel, Conditional Access
DevOps Engineer Expert	AZ-400	Pipelines, IaC, monitoring

AZ-305 (refreshed April 17 2026) covers: Identity/governance/monitoring, data storage, infrastructure & availability, application architecture, network solutions, data integration, business continuity. Prereq: AZ-104.

3.2 Azure compute deep cuts

AKS              → managed K8s; "AKS LTS" = 24-mo extended support per minor
App Service      → PaaS web hosting (Plans = Basic/Standard/Premium/Isolated)
Functions        → consumption / premium / dedicated
Container Apps   → CaaS on KEDA (scale-to-zero from events)
Container Instances (ACI) → single-pod throwaway
Virtual Machine Scale Sets (VMSS) → IaaS auto-scaling
Azure Spring Apps → managed Spring Boot

3.3 Azure DevOps + GitHub Enterprise (Microsoft owns both)

Azure DevOps = Boards + Repos + Pipelines + Artifacts. Mature for .NET-heavy orgs.
GitHub Enterprise + Actions = where new investment is going (Microsoft's strategic direction).
2025 trend: most new Azure customers go GitHub-first; Azure DevOps is in maintenance mode.

3.4 Azure tooling

az cli   → primary
Bicep    → DSL that transpiles to ARM. JSON ARM templates → DEPRECATED for new work
Pulumi   → first-class Azure native provider
Terraform azurerm + azuread → mature, official

Bicep simplifies ARM but is Azure-only — for multi-cloud orgs, Terraform remains primary.

3.5 Training corpus for Azure

- Cloud Adoption Framework (Microsoft's enterprise reference)
- Azure-Samples/* GitHub org
- Azure Verified Modules (AVM) — Microsoft's curated Bicep + Terraform modules
- AZ-305 study guides + Microsoft Learn content
- Azure Architecture Center patterns

4. Multi-Cloud Strategy

4.1 Workload portability tools

Tool	Approach	Best fit
Crossplane	K8s-native control plane → cloud APIs via providers	Platform teams already on K8s
Anthos	GCP-managed clusters across clouds + on-prem	GKE-centric orgs wanting unified control
Azure Arc	Azure-managed servers/K8s outside Azure	Azure-centric hybrid
Terraform	IaC abstraction (provider-per-cloud)	Most common; least lock-in
Pulumi	Real code (Python/TS); equivalent provider coverage	Engineering-heavy teams

4.2 Crossplane v2 (Aug 2025)

Major upgrades:

Compositions can include any K8s resource — not just Crossplane MRs. Mix RDSInstance + Deployment + CloudNativePG cluster in one XR.
Namespace-first — XRs and MRs are namespaced by default (was cluster-scoped).
Operations — function pipelines for cert monitoring, rolling upgrades, scheduled maintenance.
Multi-cloud status — AWS providers fully migrated; Azure/GCP/Terraform providers still being updated to v2.

4.3 DR / failover patterns

Pattern	RTO	RPO	Cost (vs single-region)
Backup & restore	hours-days	hours	1.0x (storage only)
Pilot light	10s of min	minutes	1.1-1.3x
Warm standby	minutes	minutes	1.5-1.8x
Multi-site active/active	seconds	~0	1.8-2.5x

Multi-cloud active/active typically costs 1.8–2.5x single-cloud due to duplicate infra + ops overhead. Recommendation: active/passive across clouds + active/active across regions WITHIN primary cloud.

4.4 Latency-based routing

Route53 latency policy   → AWS-native, cheapest
Cloud DNS geo-routing    → GCP-native
Azure Traffic Manager    → Azure-native
Cloudflare load balancer → multi-cloud
NS1 / Constellix         → enterprise multi-cloud DNS

Cloudflare LB is the most common cross-cloud answer because it sits OUTSIDE the providers.

4.5 Cost arbitrage

GPU cost: GCP < AWS < Azure (TPUs are GCP-only and cheaper per FLOP at scale)
Egress: AWS most expensive; Cloudflare R2 has $0 egress (S3-compatible)
Object storage: B2 ($6/TB/mo) < R2 ($15) < S3 standard ($23) < GCS standard ($26)
Reserved discounts: deepest in AWS (75% std RI), shallower in Azure (65%), GCP CUDs ~57%

4.6 Vendor lock-in mitigation

1. Use OSS data formats (Parquet, Iceberg, Delta) — not proprietary
2. Use OSS DBs (Postgres / Redis-compatible Valkey) — not Aurora-only or Cosmos-only
3. Use OCI containers + K8s — cluster portability via Crossplane/Anthos
4. Use Terraform with multi-provider modules — abstract per-cloud differences
5. Avoid managed-vendor-only auth — use OIDC + Keycloak or Auth0 (cross-cloud)
6. Multi-cloud DNS (Cloudflare/NS1) so Route53/Cloud DNS isn't single point

5. IaC Mastery

5.1 Terraform / OpenTofu (post-BSL fork)

HashiCorp Terraform OSS under BSL discontinued after July 2025 → OpenTofu is the OSS continuation under Linux Foundation.
Most TACOS (Spacelift, Env0, Scalr) support both. Most modules still work in both.
For new orgs in 2026 → default OpenTofu.

Best practices (2025):

1. Remote backend (S3+DynamoDB lock, GCS, Azure Blob) — never local state
2. Split state: per-environment (dev/staging/prod) + per-domain (network, data, compute)
   - Terralith state >50MB causes timeouts; >10MB visible perf hit
3. Module versioning: `~> 2.5` (allow patch+minor, block major)
4. Pre-commit: terraform fmt + validate + tflint + tfsec/checkov
5. CI/CD: Atlantis (OSS, self-host) or Spacelift / Env0 / Scalr / Terramate (SaaS)
6. State locking always on
7. Drift detection: `terraform plan -refresh-only` on schedule (Spacelift / Atlantis cron)
8. Workspaces only for environment isolation; NOT for tenant separation

Workspace anti-pattern: using workspaces for cust-1, cust-2, cust-3 — should be separate state files / dirs instead. Workspaces good for dev, staging, prod of same module.

5.2 CloudFormation

- Nested stacks → for >500 resources / cross-stack dependencies
- Custom resources → Lambda-backed for CFN gaps. Use AwsCustomResource (CDK) for single-API-call
- Change sets → preview before apply (mandatory for prod)
- Stack policies → prevent accidental updates to specific resources
- Service Catalog → curated CFN templates exposed to devs
- StackSets → multi-account/multi-region rollout

5.3 AWS CDK best practices

- Constructs L1 (raw CFN) / L2 (curated AWS) / L3 (composite patterns)
- Aspects → enforce policy across all constructs (e.g., "all S3 buckets must encrypt")
  - Aspects run at synth-time → cfn-guard runs post-synth → both = defense in depth
- Don't extend Construct unless interacting with AWS resources directly; helper class is enough
- Custom resources: use AwsCustomResource for single API call; full Lambda-backed for complex
- CDK Refactor (Sept 2025) → safely rename or move resources without replacement
- Pipelines L3 = managed CodePipeline that self-mutates

5.4 Pulumi

Real code (TS/Python/Go/.NET/Java) — language loops, classes, unit tests with native frameworks.
Pulumi onboarding ~30% faster for engineers already knowing TS/Python (vs HCL).
Day-1 support for new cloud services because Pulumi wraps SDKs directly.
Pulumi ESC = encrypted env+secrets store; Pulumi Deployments = managed runners.

5.5 Crossplane (K8s-native multi-cloud)

# Composition that creates RDS + Deployment + Service in one XR
apiVersion: apiextensions.crossplane.io/v2
kind: Composition
metadata:
  name: web-app-with-db
spec:
  compositeTypeRef:
    apiVersion: example.io/v1alpha1
    kind: WebApp
  pipeline:
  - step: provision-db
    functionRef:
      name: function-patch-and-transform
    input:
      apiVersion: pt.fn.crossplane.io/v1beta1
      kind: Resources
      resources:
      - name: rds
        base:
          apiVersion: rds.aws.upbound.io/v1beta1
          kind: Instance
          spec:
            forProvider:
              instanceClass: db.t3.medium
              engine: postgres
              engineVersion: "16"
              allocatedStorage: 50
      - name: deployment
        base:
          apiVersion: apps/v1
          kind: Deployment
          spec:
            replicas: 3

5.6 IaC TACOS comparison

Tool	OSS / SaaS	IaC Coverage	Best For
Atlantis	OSS, self-host	TF/OpenTofu/Terragrunt	Free, GitHub-PR workflow
Spacelift	SaaS + self-hosted	TF/OpenTofu/Terragrunt/Pulumi/CFN/K8s/Ansible	Enterprise multi-IaC
Env0	SaaS only	Multi-IaC + strong FinOps	FinOps-aware deployment
Terramate	OSS + SaaS	TF/OpenTofu	Stack orchestration + DAGs
Scalr	SaaS + self-hosted	TF/OpenTofu	TFC alternative
Terraform Cloud	SaaS	TF only	Default if already HashiCorp

5.7 Real Terraform module example (multi-cloud DRY)

# environments/prod/main.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "prod-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["us-east-1a", "us-east-1b", "us-east-1c"]

  private_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets   = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = false  # one per AZ for HA
  enable_vpn_gateway     = false
  enable_dns_hostnames   = true
  enable_flow_log        = true
  flow_log_destination_type = "cloud-watch-logs"

  tags = local.common_tags
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-platform"
  cluster_version = "1.32"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_endpoint_public_access = false
  cluster_endpoint_private_access = true

  cluster_addons = {
    coredns                = { most_recent = true }
    kube-proxy             = { most_recent = true }
    vpc-cni                = { most_recent = true }
    aws-ebs-csi-driver     = { most_recent = true }
    eks-pod-identity-agent = { most_recent = true }
  }

  eks_managed_node_groups = {
    system = {
      instance_types = ["t3.medium"]
      min_size = 2
      max_size = 4
      desired_size = 2
      labels = { workload = "system" }
      taints = [{ key = "system", value = "true", effect = "NO_SCHEDULE" }]
    }
    karpenter = {
      instance_types = ["m6g.large"]  # Graviton
      capacity_type  = "ON_DEMAND"
      min_size = 1
      max_size = 2
      desired_size = 1
      labels = { workload = "karpenter" }
    }
  }

  enable_irsa = true
  enable_cluster_creator_admin_permissions = true
}

5.8 Training corpus for IaC

- HashiCorp learn.hashicorp.com/terraform tutorials (1000+ lessons)
- terraform-aws-modules / terraform-google-modules / Azure/terraform-azurerm-* (AVM)
- Pulumi pulumi/examples (1500+)
- aws-samples/aws-cdk-examples
- Crossplane upbound/configurations (reference platforms)
- Awesome-terraform / awesome-pulumi GitHub lists
- IaC-Eval benchmark (academic Terraform benchmark)
- TACOS docs: Spacelift, Env0, Atlantis, Terramate

6. Kubernetes Platform Engineering

6.1 Kubernetes 1.32 → 1.35 highlights (2025)

Version	Released	Key features
1.32	Dec 2024	KubeletFineGrainedAuthz; Memory Manager GA; Anonymous Auth Configurable Endpoints
1.33	Apr 2025	Sidecars GA; supplementalGroupsPolicy beta; in-place pod resize beta
1.34	Aug 2025	DRA core GA (Dynamic Resource Allocation for GPUs/TPUs/FPGAs)
1.35	Dec 2025	Fine-grained Supplemental Groups GA; TLS 1.3 baseline

Pod Security Standards are GA since v1.25 (NOT 2025). Three levels: Privileged / Baseline / Restricted, applied via pod-security.kubernetes.io/<mode> namespace labels.

6.2 Helm vs Kustomize vs Carvel

Tool	Approach	Strength	Weakness
Helm	Templating + values + chart	Package manager (75% adoption); Helm 4 (Nov 2025) adds server-side apply	Templating debug pain
Kustomize	Patch-based overlays on bases	No magic; built into kubectl	No release/version concept; needs ArgoCD/Flux for state
Carvel	ytt + kapp + kbld + imgpkg	Strong CI bundling; image relocation	Steeper learning curve, smaller community

Mature pattern: Helm to install upstream charts (Cilium, ArgoCD, Prometheus); Kustomize overlays per environment. Use ArgoCD helm source with valuesObject overrides.

6.3 GitOps — ArgoCD vs FluxCD (2025 reality)

Weaveworks closed in early 2024 — Flux became fully community-driven (CNCF graduated). ArgoCD has clearer commercial path (Akuity, CodeFresh).

Aspect	ArgoCD	FluxCD
UI	Strong native dashboard	None native (use Weave GitOps or third-party)
RBAC	Built-in + Projects multi-tenancy	Standard K8s RBAC only
Architecture	Hub-and-spoke	Decentralized, K8s-idiomatic
Multi-cluster	Native (App-of-Apps, ApplicationSets)	Per-cluster Flux + Notification Controller
Best for	Most enterprises in 2025	Air-gapped / minimal-deps / true GitOps purists

Default 2026 recommendation: ArgoCD for most orgs.

6.4 Service Mesh — Istio vs Linkerd vs Cilium (2025)

Mesh	Sidecars	Data plane	Best fit
Istio	Sidecar OR Ambient (ztunnel + waypoint)	Envoy	Advanced traffic mgmt, deep telemetry
Linkerd	Sidecar only	linkerd2-proxy (lightweight Rust)	Simplicity + lowest overhead
Cilium	Sidecarless	eBPF + Envoy (L7)	Network policy + perf at scale

Memory cost reality: 500 services on Istio sidecar = ~25–50 GB more RAM than same on Linkerd. Translates to real $$.

Cilium caveat: eBPF can't parse HTTP/gRPC or do mTLS termination — Cilium still uses Envoy for L7, so the perf delta vs Istio at L7 is small.

Decision tree:

Tiny team, just want mTLS + observability     → Linkerd
Already on Cilium CNI, want unified           → Cilium Service Mesh
Need full traffic mgmt (canary, mirror, fault) → Istio Ambient

6.5 Ingress + Gateway API (the Ingress era is ending)

Ingress-NGINX official maintenance halt March 2026. Gateway API is the K8s-official successor.

Gateway API provides:

Protocol-agnostic (HTTP, TCP, gRPC, TLS passthrough)
Role-split: GatewayClass (provider) → Gateway (cluster operator) → HTTPRoute (app dev)
Built-in canary/blue-green via weighted routes
Both north-south AND east-west

Ingress controllers / Gateway implementations:

Implementation	Notes
Envoy Gateway	Reference implementation; CNCF
Istio	Native Gateway API support (replaces Istio VirtualService for new)
NGINX Gateway Fabric	NGINX-backed, replaces ingress-nginx
Cilium Gateway	CNI-integrated
Traefik	Long-time leader for Ingress; Gateway API supported

Migration: ingress2gateway 1.0 (2026) translates Ingress + annotations → Gateway API resources.

6.6 Operators

Operator SDK (Red Hat)        → Go/Helm/Ansible scaffolding
Kubebuilder                   → upstream K8s SIG; cleaner Go
KUDO                          → declarative operator definition
metacontroller                → Lua/JSONNET-style hooks (lightweight)

When to write an operator: state machine that doesn't fit Deployment (e.g., DB clustering, leader election, custom backup).

When NOT: just templating → use Helm/Kustomize.

6.7 Multi-cluster — Karmada vs Cluster API vs OCM

KubeFed is EOL (no commits since 2020).

Tool	Approach
Karmada	CNCF Incubation; multi-cluster scheduling + propagation policy. v1.15 (Oct 2025) adds multi-template workload awareness + structured logging
Cluster API (CAPI)	Declarative cluster lifecycle (CAPA AWS, CAPG GCP, CAPZ Azure providers)
Open Cluster Management (OCM)	Red Hat-led; ACM commercial product
Anthos / GKE Enterprise	GCP-managed; folds in Config Sync + Mesh + Policy
Azure Arc	Azure-managed; brings Azure Policy/Monitor to any cluster

Pattern: CAPI provisions clusters, Karmada propagates workloads, ArgoCD reconciles config.

6.8 Cost — Kubecost vs OpenCost

OpenCost (Apache 2.0) — free, single-cluster focus, real-time allocation by pod/namespace/controller, multi-cloud (AWS/GCP/Azure). Now ships with built-in MCP server (2025) for AI agent access.
Kubecost (IBM-owned post-2024 acquisition) — adds budgets, RBAC, multi-cluster aggregation, automated cost policies. Starts $449/mo, enterprise on quote.

6.9 Karpenter + Spot + Graviton

Real customer outcomes:

Tinybird: 20% AWS bill reduction with EKS+Karpenter+Spot
Series B SaaS (200 microservices): $52k → $23k/mo (56%) with Graviton mix + Karpenter + Spot
One reported migration: $50k → $22k/mo Karpenter + Spot + VPA

NodePool best practices:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["arm64", "amd64"]  # Graviton preferred but allow x86 fallback
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["m", "c", "r"]
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values: ["6"]  # m6g+, c6g+, r6g+
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 1000

6.10 Helm chart example for a service

# values.yaml
image:
  repository: ghcr.io/org/api
  tag: ""  # set by ArgoCD Image Updater or via CI
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 30
  targetCPUUtilizationPercentage: 70

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123:role/api-irsa  # IRSA for AWS

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65532
  fsGroup: 65532

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault

podDisruptionBudget:
  enabled: true
  minAvailable: 2

networkPolicy:
  enabled: true
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: gateway

6.11 Training corpus for K8s

- kubernetes/kubernetes (source + design proposals KEPs)
- kubernetes/website (docs)
- helm/charts (deprecated) + bitnami/charts + community charts
- argo-cd/argo-cd repo + examples
- karmada-io/karmada
- cilium/cilium (eBPF code + e2e tests)
- istio/istio
- linkerd/linkerd2
- aws/karpenter-provider-aws
- backstage/backstage source + plugins
- run-x/awesome-kubernetes
- KubeCon talk transcripts (CNCF YouTube; can transcribe via Whisper)

Eval target: 70% on K8s-Bench (manifest validity + Helm chart that helm template validates + ArgoCD Application that syncs + NetworkPolicy that locks down by default).

7. Internal Developer Platform (IDP)

7.1 IDP landscape (2025)

Tool	Type	Strength	TTV (time-to-value)
Backstage (Spotify, CNCF)	OSS framework, build-it-yourself portal	Most flexible; 120+ Spotify-internal plugins; CNCF	3–6 months
Port	Commercial SaaS portal	No-code, fast to deploy	Days
Cortex	Commercial — service ownership + scorecards	Best for >50-eng orgs needing governance	Weeks
OpsLevel	Commercial — quality scorecards	Strong dashboards	Weeks
Humanitec	Platform Orchestrator (NOT a portal)	Backend that resolves Score files into infra	Weeks
Encore	All-in-one (codegen + infra)	Strong opinionated dev workflow	Days
Cloudomation	Workflow automation IDP	Low-code for non-K8s orgs	Days

Key mental model: Portal (Backstage/Port) ≠ Orchestrator (Humanitec). You often need BOTH — portal as UI, orchestrator as the backend that creates the actual cloud resources.

7.2 Backstage core

Catalog          → entities (Component, System, API, Resource, Group, User)
TechDocs         → MkDocs-based, lives next to code
Software Templates → Cookiecutter-style scaffolds (repo + IaC + pipeline + DB)
Search           → indexes catalog + docs
RBAC             → Spotify's RBAC plugin (commercial)
Soundcheck       → Spotify's tech-standards/scorecard plugin (commercial)
Insights         → adoption analytics (commercial)
Cloud Backstage  → managed hosted (commercial)

Open-source plus commercial Spotify Portal (RBAC, Insights, Soundcheck) = "production-ready Backstage."

catalog-info.yaml example:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: orders-api
  description: Order management service
  annotations:
    github.com/project-slug: org/orders-api
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/integration-key: ${SECRET_PD}
    sonarqube.org/project-key: org_orders-api
    grafana/dashboard-selector: "tags @> 'orders'"
  tags: [java, spring-boot, payments-domain]
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: payments
  providesApis: [orders-rest-api]
  consumesApis: [users-rest-api]
  dependsOn: [resource:orders-db]

7.3 Score (CNCF, 2024) — workload spec

Score is platform-agnostic workload spec. Developer writes ONE YAML; platform team's score-compose / score-helm / score-k8s translates.

# score.yaml
apiVersion: score.dev/v1b1
metadata:
  name: orders-api
containers:
  api:
    image: ghcr.io/org/orders-api:${TAG}
    variables:
      DATABASE_URL: postgres://${secrets.DB_PASSWORD}@${resources.db.host}/orders
      REDIS_URL: ${resources.cache.url}
    resources:
      requests: { cpu: "100m", memory: "256Mi" }
service:
  ports:
    web: { port: 8080 }
resources:
  db:
    type: postgres
  cache:
    type: redis
  route:
    type: dns
    params: { host: orders.example.com }

The platform team configures resource definitions (e.g., db.postgres → AWS RDS via Crossplane) — devs don't see/care.

7.4 OAM vs Score

OAM is broader (whole-app model with traits + scopes + components); Score is single-workload + simpler. Score is winning in 2025 because of its narrower scope and CNCF backing.

7.5 Humanitec orchestrator pattern

Developer:    score.yaml in repo
GitOps:       commit → CI → calls Humanitec API
Humanitec:    resolves score against Resource Definitions
              → creates EKS Deployment + RDS + Redis + Route53 record
Platform:     defines Resource Definitions (e.g., postgres → AWS RDS via TF/Crossplane)

7.6 Training corpus for IDP

- backstage/backstage source + ALL community plugins (roadie/* / spotify/*)
- score-spec/spec + reference implementations (score-compose/score-helm/score-k8s)
- Humanitec docs + Resource Definition examples
- Port templates marketplace
- Cortex YAML scorecard library
- platformengineering.org community articles
- KubeCon Platform Engineering Day talks (transcripts)

8. Edge + Serverless Platforms

8.1 Latency / cold-start reality (2025)

Platform	P50 latency	Cold start	POPs
Cloudflare Workers	10–30ms	<1ms (V8 isolates)	330+
Vercel Edge Functions	<50ms	sub-50ms	18 (uses Lambda@Edge under hood in some regions)
Lambda@Edge (Node)	50–80ms	250–800ms	AWS edge POPs
Lambda@Edge (Python)	similar	400–1200ms	same
Fastly Compute@Edge (WASM)	~5–10ms	<1ms	80+
Deno Deploy	low	low	global
Bun runtime	fastest cold-start of any Node-compat	n/a	self-hosted

Cloudflare Workers ~441% faster than Lambda at p95, and unlimited bandwidth on free tier.

8.2 Cloudflare ecosystem

Workers       → V8 isolate functions (JS/TS/WASM)
Pages         → static + Workers (serverless full-stack)
R2            → S3-compatible object storage, zero egress
D1            → serverless SQLite (replicated)
KV            → eventually-consistent KV
Durable Objects → strongly-consistent stateful primitives
Queues        → managed message queue
Workers AI    → run LLMs at the edge (Llama, Whisper, Stable Diffusion)
Vectorize     → vector DB (RAG at edge)
Hyperdrive    → connection pooler for Postgres/MySQL behind edge
Stream        → video transcoding + delivery

8.3 Vercel ecosystem

Edge Functions    → Cloudflare Workers-compatible runtime (Node + Python + Go + Ruby)
Edge Middleware   → run BEFORE the request enters serverless
Serverless Funcs  → Lambda@Edge under the hood
Postgres         → managed Postgres (built on Neon)
KV               → built on Upstash Redis
Blob             → object storage

8.4 Multi-region edge strategies

Pattern 1 — Edge cache + origin in primary region
  Cloudflare cache → S3/Lambda in us-east-1
  Trade: simple, 100ms+ for cache misses

Pattern 2 — Workers + DB-at-edge
  CF Workers → D1/Hyperdrive
  Trade: edge writes; eventual consistency
  Use: read-heavy auth, profile, feature flags

Pattern 3 — Multi-region active/active
  CF LB → Workers in EU + US + APAC → regional Aurora DSQL
  Trade: cost 2x; near-zero RTO across regions

Pattern 4 — Global table + edge CDN
  CF Cache → Lambda → DynamoDB Global Tables (multi-master)
  Trade: replication lag; eventual consistency

8.5 WASM serverless (2025–2026)

WASI 0.2 (Component Model) GA → portable across runtimes (Wasmtime, Spin, wasmCloud, Wasmer).
Cold starts: microseconds (vs 100–500ms for containers).
Major clouds now offer Wasm-based FaaS as mainstream option.
Wing language shutdown April 2025 — OSS code lives on but no commercial backing.

9. Database Platform

9.1 Postgres options

Service	Multi-region	Best for
RDS Postgres	Read replicas	Standard managed
Aurora Postgres	Cross-region read replicas + Global DB (1 writer)	Standard scale-out
Aurora DSQL	Active/active, strong consistency, GA May 2025	New globally-distributed apps
AlloyDB (GCP)	HA + read pool nodes	Postgres-compat OLTP+OLAP at GCP
Cloud SQL (GCP)	Single-region HA	Standard managed
Azure Database for Postgres Flex	Single-region HA	Standard managed
Neon	Branching (Git-like)	Dev velocity
Supabase	Postgres + auth + realtime	Full BaaS
Crunchy Bridge	Multi-cloud Postgres	Vendor-neutral
PlanetScale	(Now Postgres + Vitess)	Sharded scale-out

9.2 Aurora DSQL deep cuts (GA May 2025)

Disaggregated architecture: query processor + adjudicator + journal + crossbar — each scales independently.
99.99% single-region SLA, 99.999% multi-region.
Active/active multi-master (peers); third region as log-only witness.
Region groupings only — US (us-east-1, us-east-2, us-west-2), EU (eu-west-1/2/3), APAC (ap-northeast-1/2/3).
No cross-continent yet.
PostgreSQL wire-compatible.

9.3 Distributed SQL (NewSQL)

DB	TPC-C (TPS)	PG compat	Multi-region
CockroachDB	45k	wire only	Best with geo-partitioning
YugabyteDB	48k	full (reuses PG query layer)	Strong with row-level geo
TiDB	40k+ (write-heavy lead)	MySQL primary	✓
Aurora DSQL	benchmarked fastest by AWS	wire	Region-grouped
Spanner	1M+ at scale	GoogleSQL or PG dialect	Global by design

YugabyteDB wins for PG migration (full compat). CockroachDB wins for geo-partitioning. Spanner remains gold standard at hyperscale.

9.4 Vitess (MySQL sharding)

Open-source MySQL sharding system.
Powers YouTube, Slack, GitHub, PlanetScale.
Functions: query routing, online schema migration (with gh-ost), connection pooling, transparent sharding.
Newer alternative: CockroachDB / Aurora DSQL eliminate manual sharding.

9.5 NoSQL

DynamoDB             → AWS, single-digit-ms; on-demand or provisioned
DynamoDB Global Tables → multi-region multi-master (last-writer-wins)
Spanner              → strongly consistent global
Cosmos DB            → multi-model (SQL/MongoDB/Cassandra/Gremlin); 5 consistency levels
Cassandra/Scylla     → wide-column; high write throughput
MongoDB Atlas        → document; managed across all 3 clouds

9.6 Vector DBs (2025 production benchmarks)

DB	p99 latency	QPS	Notable
Qdrant	2ms	12k	Best low-latency, $25/mo+ cloud
Milvus / Zilliz	5ms	8k	Billion-scale; built-in BM25 + dense (30x faster than Elasticsearch)
Pinecone	8ms	5k	Fully managed, 99% recall
Weaviate	10ms	4k	BlockMax WAND (10x keyword speed); MUVERA multi-vector
pgvector	varies	depends	If you already have Postgres
OpenSearch k-NN	varies	depends	If you already have OpenSearch

9.7 Migration tools (2025)

Tool	Approach	Best for
Liquibase	Imperative changelogs (XML/YAML/JSON/SQL); FSL license post-v5; AI rollback assist (2025)	Multi-DB enterprise
Flyway	Numbered SQL files; Java ecosystem standard; Teams tier discontinued 2025	Java teams
Atlas (atlasgo.io)	Declarative HCL + computed migration plan	Terraform-style schema-as-code
Prisma Migrate	Declarative, ORM-coupled	Node/TS apps
goose	Plain SQL/Go migrations	Go services

Atlas is the modern recommendation — same paradigm as Terraform.

10. Networking Deep

10.1 DNS

Route53          → AWS native, latency/geo/failover; alias records to AWS resources
Cloud DNS        → GCP native
Azure DNS        → Azure native
Cloudflare DNS   → fastest authoritative (1.1.1.1 is recursive); free
NS1 / Constellix → enterprise multi-cloud DNS, advanced traffic steering

10.2 CDN performance (Cloudflare 95p TTFB benchmark, Nov 2024–Mar 2025)

Cloudflare fastest in ~48% of top 1000 networks.
Fastly extremely close in many networks (e.g., +0.2% lead on Comcast).
CloudFront strong inside AWS-heavy stacks (free egress to AWS origins).
All have edge compute now: Workers / Compute@Edge / Lambda@Edge.

10.3 Load balancers (AWS)

ALB  (L7)  → HTTP/HTTPS/gRPC; WAF integration; target group flexibility
NLB  (L4)  → TCP/TLS/UDP; static IPs; >millions RPS
GWLB        → traffic inspection (third-party firewall in chain)
ELB Classic → legacy, avoid
GAL (Global Accelerator) → anycast IPs in front of ALB/NLB for global traffic

10.4 Zero Trust Network Access (2025)

Tool	Architecture	Best fit
Tailscale	WireGuard mesh + identity overlay	Fastest dev access; great for SSH/RDP/DB
Twingate	Layer 4 ZTNA (no mesh); resource-grain	App-name + group-based access
Cloudflare Access + WARP	SASE — Access for apps + Gateway for SWG	When Cloudflare is the wider stack
Zscaler	Enterprise SASE	Big-org compliance
Pomerium	Self-hosted reverse-proxy ZTNA	OSS option

Tailscale wins on dev velocity (sign in, get tailnet); Cloudflare Access wins on full SASE; Twingate wins on resource granularity.

10.5 WAF

AWS WAF              → tied to CloudFront/ALB/API Gateway
Cloudflare WAF       → in front of any origin
Azure Front Door WAF → tied to AFD
Akamai App & API Protector → enterprise

10.6 DDoS protection

AWS Shield Advanced  → $3000/mo + transfer; 24/7 SRT
Cloudflare           → unmetered DDoS protection (free tier!)
Google Cloud Armor   → tier-based
Azure DDoS Protection Standard → per-resource

11. FinOps + Cost Engineering

11.1 FinOps Foundation Framework 2025 (Inform / Optimize / Operate)

INFORM   → Visibility, allocation, benchmarking, budgeting, forecasting
OPTIMIZE → Identify and execute waste reduction
OPERATE  → KPI tracking, governance policies aligned with business

11.2 2025 framework changes — Scopes

The 2025 Framework adds Scopes as a structural element. Scopes define context: Public Cloud, SaaS (Snowflake, Salesforce), GenAI (LLM API spend), Data Center, Private Cloud. Each capability is now applied per scope.

11.3 Cost allocation tags (mandatory at provision time)

Required tags for every resource:
- Environment  : prod/staging/dev/sandbox
- Owner        : team-name (matches catalog)
- CostCenter   : finance code
- Project      : product/feature
- DataClass    : public/internal/confidential/regulated

Enforce via:

AWS: SCP aws:RequestTag/X (deny on creation), Tag Policies
GCP: Org Policy required labels
Azure: Azure Policy required tags

11.4 Showback / chargeback

Showback   → "your team used $X" (no actual billing)
Chargeback → cross-charge cost center (real finance impact)

Tools: Vantage, CloudHealth, Apptio Cloudability, Kubecost (k8s-specific), Infracost (pre-deploy IaC estimate).

11.5 Anomaly detection

AWS Cost Anomaly Detection (free)
Vantage anomalies + alerts (commercial)
CloudZero / Spend.io
ProsperOps                → automated commitment management

11.6 Right-sizing automation

AWS Compute Optimizer    → free; recs for EC2, Lambda, EBS, ASG, ECS
GCP Recommender          → equivalent
Azure Advisor            → equivalent
ScaleOps / StormForge    → K8s VPA recommender for prod

11.7 Spot orchestration (Karpenter, ProsperOps)

Already covered §6.9. Karpenter native AWS spot fleet; CAST AI / ScaleOps for cross-cloud.

11.8 Training corpus for FinOps

- FinOps Foundation framework docs (finops.org/framework)
- AWS / GCP / Azure cost optimization whitepapers
- Vantage / CloudZero / Apptio public benchmarks
- KubeCon FinOps track talks (transcripts)
- Real customer cost-cut case studies (already collected: Tinybird, Series B SaaS examples)

12. 2025–2026 Platform Engineering Trends

12.1 Internal LLM gateways (the 2026 must-have)

Tool	Type	Key strength	Cost
LiteLLM	OSS, self-host	OpenAI-compat; cheapest at $10k+ MRR; 100+ providers	Free + infra
Portkey	SaaS or self-host	SOC2/HIPAA/ISO27001; observability; 250+ LLMs	$49/mo+
OpenRouter	SaaS	Pay-per-token; consumer-friendly	5% markup
Helicone	OSS observability	Caching + analytics	Free + cloud
Truefoundry / Bifrost	SaaS	LLM gateway + ML platform	Quote

LiteLLM is the default for orgs serious about cost — runs as your own proxy, no markup.

12.2 AI agents in platform engineering

Resolve.ai — AI SRE; auto-investigates alerts, RCA in minutes, MTTR -80%; customers: Coinbase, DoorDash, Toast, Zscaler. $40M Series A extension at $1.5B (2026).
Aviator (aviator.co) — AI code review + merge queues + deployment.
OpenText DevOps Aviator — AI for performance engineering scripts.
Cursor / Sourcegraph Cody / GitHub Copilot Workspace — IDE-side coding agents.
Codeium / Tabnine / Continue — open-source IDE agents.

12.3 Per-PR ephemeral environments

Tool	Approach
Coherence	PR comment with auto-preview URL; spot-backed for cost
Uffizzi	OSS + cloud; vCluster-based isolated environments
Render Preview	Built-in to Render
Vercel Preview	Built-in to Vercel
Netlify Deploy Previews	Built-in to Netlify
Argo CD ApplicationSet PR Generator	OSS K8s-native
vCluster + ArgoCD	DIY pattern; cheapest at scale

Best practice: every PR gets a unique URL, smoke tests run against it, design reviewer can click before merge.

12.4 WASM-based services

Production runtimes: Wasmtime (Bytecode Alliance), wasmCloud, Spin (Fermyon), Wasmer.
Use cases: edge serverless, plugin systems (Envoy filters, Istio extensions, Postgres extensions), embedded scripting.
Platforms moving to Wasm: Fastly Compute@Edge (WASM-only), Cloudflare Workers (V8 + WASM), Spin Hub.

12.5 AI-native databases / observability

LangSmith        → LLM tracing + evals (LangChain)
Helicone         → LLM tracing + caching (OSS option)
Phoenix (Arize)  → OSS LLM observability
Langfuse         → OSS, self-host LLM observability
Weights & Biases Weave → MLOps + LLM

12.6 Autonomous Cloud Engineer (the Surrogate-1 mission)

The path is converging on:

MCP (Model Context Protocol) — standardize how agents pull cloud state. AWS docs MCP, OpenCost MCP (2025), terraform-mcp-server.
Multi-agent systems — research / planner / executor / critic agents (CrewAI, LangGraph, AutoGen).
Tool-using agents — agents that call terraform plan, kubectl apply, aws sts get-caller-identity, gh pr create.

Surrogate-1's training MUST include MCP-call patterns + tool-use traces.

13. Training Data Sources

13.1 Curated GitHub repos

Cloud
- awesome-aws (donnemartin)
- awesome-gcp (GoogleCloudPlatform/awesome-google-cloud)
- awesome-azure (kristofferandreasen/awesome-azure)
- aws-samples/* (8000+ official AWS samples)
- GoogleCloudPlatform/* (1500+ GCP samples)
- Azure-Samples/*

K8s
- run-x/awesome-kubernetes
- ramitsurana/awesome-kubernetes
- tomhuang12/awesome-k8s-resources
- kubernetes/kubernetes (source + KEPs)
- kubernetes-sigs/* (CAPI, Gateway API, Karpenter)
- helm/charts (deprecated but reference)
- bitnami/charts
- argoproj/argo-cd

IaC
- hashicorp/terraform
- terraform-aws-modules/* (40+ official modules)
- terraform-google-modules/*
- Azure/terraform-azurerm-* (AVM)
- pulumi/examples
- aws/aws-cdk
- crossplane/crossplane + upbound/configurations

Platform
- backstage/backstage + roadie/* + spotify/* community plugins
- score-spec/spec
- humanitec-architecture/*

Eval
- codefuse-ai/codefuse-devops-eval
- IaC-Eval (academic)
- NL2Bash

13.2 Reddit communities (curate top-voted threads, last 2 yrs)

r/devops, r/aws, r/AZURE, r/googlecloud, r/kubernetes, r/Terraform, r/sysadmin, r/sre, r/platformengineering

13.3 Conference talks (transcribe via Whisper, MIT-licensed for TLP/CNCF)

KubeCon + CloudNativeCon (CNCF YouTube; ~600 talks/year)
AWS re:Invent (multiple thousand sessions, breakouts archived)
Google Cloud Next (annual)
Microsoft Ignite / Build
HashiConf
PlatformCon (annual, online)
SREcon (USENIX)

13.4 Public datasets on HuggingFace

- CatOwl/Terraform                   (Terraform code corpus)
- nvidia/OpenCodeReasoning            (reasoning over code)
- bigcode/the-stack-v2                (filtered code, has IaC files)
- mhhmm/codealpaca-iac                (instruction tuning for IaC)
- Custom: collect from terraform-aws-modules/eks/aws + variants

13.5 Documentation (for retrieval / SFT context)

AWS docs (full), GCP docs, Azure docs (Microsoft Learn), CNCF docs, K8s docs, Helm docs, Terraform/OpenTofu docs.
AWS Well-Architected Framework PDFs (one per pillar).
Google Cloud Architecture Framework.
Azure Cloud Adoption Framework + Well-Architected Framework.

13.6 Synthesized data (recommended approach)

For Surrogate-1 v2:

1. Take each terraform-aws-modules example
2. Mutate: change region, instance type, AZ count, subnet sizes
3. Build instruction format: "Build me a 3-AZ VPC in us-west-2 with public+private+db subnets using terraform-aws-modules/vpc/aws"
4. Output: working main.tf + outputs.tf + variables.tf

5. For each AWS service, generate:
   - "What is X" Q&A from official docs
   - "Compare X vs Y" from official docs
   - "Migrate from X to Y" code examples

6. Multi-step trajectories:
   - "Build me a SaaS platform on AWS" → 30+ step reasoning trace through architecture decisions

Total target: ~100k–250k cloud/platform instruction-tuning examples.

14. Eval Benchmarks

14.1 Existing benchmarks

Benchmark	What it tests	Surrogate-1 fit
codefuse-ai/codefuse-devops-eval	DevOps Q&A multiple-choice	Quick sanity check
IaC-Eval (academic)	Terraform generation correctness	Direct fit
KubeBench (community)	K8s manifest validity	Direct fit
NL2Bash	Bash command from NL	Tooling sub-skill
BIG-Bench (subset)	Various reasoning	General
HumanEval / MBPP	General coding	Already passes (Qwen2.5-Coder-7B baseline)

14.2 Custom Surrogate-1 v2 evals (we author)

Surrogate-1 Cloud Eval v2:
1. Terraform generation (200 prompts, varying complexity)
   - Pass = `terraform validate` + `terraform plan` succeeds
   - Score: % passing × % correct logical structure (judge LLM)

2. Helm chart authoring (50 prompts)
   - Pass = `helm template` produces valid YAML
   - Score: % passing × `kubeval` validation rate

3. CDK/CFN authoring (100 prompts)
   - Pass = `cdk synth` succeeds
   - Score: + `cfn-lint` clean rate, + `cfn-guard` policy pass

4. ArgoCD Application + Kustomize (50 prompts)
   - Pass = ArgoCD CLI dry-run succeeds

5. Multi-cloud DR scenario (30 prompts)
   - Open-ended: "Design active/passive across AWS+GCP for a SaaS, RTO=15min, RPO=1min"
   - Score: judged by GPT-5 / Claude / human reviewer on architecture quality

6. Cost optimization (50 prompts)
   - Given a `terraform plan` output, return cost reductions (Graviton swap, RIs/SPs, Spot)
   - Score: judged on $$ accuracy (vs Infracost ground truth)

7. K8s troubleshooting (50 prompts)
   - Given pod logs + describe output, return root cause + fix
   - Score: % matching ground truth

8. Tool-use traces (100 prompts)
   - Given a goal, agent must call `aws cli` / `kubectl` / `terraform` correctly
   - Score: % achieving goal (sandbox eval)

Total: ~630 prompts. Run with rubric judges (GPT-5/Claude). Surrogate-1 v2 target: 65% overall (above Qwen2.5-Coder-7B baseline of ~38%).

14.3 Capability tiers (target)

Tier	Capability	v2 Target
1	Recognize + classify cloud services	95%
2	Author single-file IaC (Terraform/CDK/Helm)	75%
3	Author multi-file project (VPC + EKS + RDS + ArgoCD)	60%
4	End-to-end design trace ("build SaaS on AWS")	50%
5	Multi-cloud DR design + tool execution	35% (stretch)

v2 Curriculum Integration Plan

For the v2 LoRA fine-tune of Qwen2.5-Coder-7B → Surrogate-1:

Data mix (target ~250k instruction examples)

40%  IaC generation (Terraform / OpenTofu / CDK / Pulumi / Bicep / Crossplane)
20%  K8s authoring (Helm / Kustomize / ArgoCD / Karpenter)
15%  Cloud architecture Q&A (mined from cert prep + docs)
10%  Cost optimization scenarios (FinOps mined + synthesized)
10%  IDP / Backstage / Score / Humanitec patterns
5%   Multi-step tool-use traces (terraform plan → fix → apply)

Key sources (direct ingestion priorities)

1. terraform-aws-modules/* + terraform-google-modules/* + Azure AVM (canonical IaC)
2. backstage/backstage source + plugin examples
3. AWS Well-Architected docs (all pillars + lenses)
4. GCP Cloud Adoption Framework
5. CNCF KubeCon transcripts (Whisper-extracted)
6. score-spec + humanitec docs
7. OpenCost docs + MCP-pattern examples
8. Real customer post-mortems (Tinybird $-20k, Series-B SaaS $-29k)
9. IaC-Eval benchmark training set
10. CodeFuse DevOps-Eval training set

Eval gates

v2 cannot ship until ≥65% overall on Surrogate-1 Cloud Eval v2.
Tier-3 (multi-file) ≥60% is the practical bar for autonomous infra building.
Add MCP-tool-use trajectory eval (sandbox terraform/kubectl/aws calls).

Sources Consulted

AWS Well-Architected Framework (6 pillars docs, Sustainability pillar Nov 2024 refresh)
Terraform / OpenTofu best practices (Terramate, Spacelift, env0, Scalr 2025 articles)
Kubernetes 1.32-1.35 release notes; CNCF security blog Dec 2025
Backstage docs + Spotify Backstage portal blog (2025)
ArgoCD / FluxCD comparison articles (2025-2026 post-Weaveworks closure)
Crossplane v2.0 release blog + InfoQ article (Aug 2025)
Karpenter cost optimization blogs (Tinybird; Series-B SaaS case studies)
Cloudflare Workers / Vercel Edge / Lambda@Edge benchmarks (2025)
FinOps Foundation 2025 framework + Scopes update
Istio / Linkerd / Cilium 2025 benchmarks (deepness-lab academic paper)
Pulumi / Terraform / CDK / Bicep 2025 comparisons
CockroachDB / YugabyteDB / Spanner / Aurora DSQL 2025 benchmarks
AWS SAP-C02 / GCP PCA (Oct 2025 refresh) / Azure AZ-305 (April 2026 refresh)
Backstage / Port / Cortex / Humanitec IDP comparison (2025-2026)
Karmada v1.15 + KubeFed EOL + Cluster API
Coherence / Uffizzi ephemeral environments (2025)
AWS CDK best practices (CDK Refactor Sept 2025)
VPC Transit Gateway / PrivateLink hub-spoke patterns
Helm / Kustomize / Carvel comparison (Helm 4 Nov 2025)
terraform-aws-modules registry top downloads (May 2025 stats)
Liquibase / Flyway / Atlas migration tools (2025 license + features)
Aurora DSQL GA announcement (May 2025)
CDN benchmarks (Cloudflare 95p TTFB 2024-2025)
AWS Savings Plans / Reserved Instances June 2025 policy changes
IAM SCPs + Permission Boundaries + ABAC patterns
GKE / EKS / AKS managed K8s comparison (2025-2026)
terraform-aws-modules registry usage (vpc 126M, eks 96.3M downloads)
Vertex AI / BigQuery / Gemini integration (2025)
Resolve.ai AI SRE + Aviator (2025-2026)
LiteLLM / Portkey / OpenRouter LLM gateway comparison (2025)
Multi-cloud DR active/active vs active/passive patterns
Wing language shutdown (April 2025) + WASM serverless trends
Awesome-aws / awesome-kubernetes curated lists
Kubecost / OpenCost cost visibility (Kubecost IBM acquisition 2024)
Atlantis / Spacelift / Env0 / Terramate IaC platforms
Score spec + OAM workload specifications
Karpenter NodePool + Spot + Graviton best practices
Tailscale / Twingate / Cloudflare Access ZTNA comparison
Vector DB benchmarks (Pinecone / Weaviate / Qdrant / Milvus 2025)
AWS Copilot end-of-support (June 12 2026) + SAM + Amplify
Gateway API + ingress-nginx retirement (March 2026)
DevOps eval benchmarks + IaC-Eval academic benchmark