surrogate-1 / docs /v2-research /research-cloud-platform.md
ashirato's picture
docs(v2-research): persist 16-stream research corpus + Phase A runbook
19be69c
metadata
title: Cloud + Platform Engineering Deep Research for Surrogate-1 v2
date: 2026-04-29T00:00:00.000Z
purpose: >-
  Train Surrogate-1 (Qwen2.5-Coder-7B + LoRA) into a SOTA Cloud / Platform
  Engineer
scope: AWS + GCP + Azure + Edge + IDP + IaC + K8s + FinOps + Multi-cloud DR

Surrogate-1 SOTA Cloud + Platform Engineer Training Plan

This document is the canonical knowledge base used to design the v2 instruction-tuning curriculum for Surrogate-1. The model must, autonomously and end-to-end:

  1. Design + provision multi-cloud infrastructure (AWS, GCP, Azure, Cloudflare, Vercel)
  2. Author production-grade IaC (Terraform, OpenTofu, CDK, Pulumi, Bicep, Crossplane)
  3. Operate Kubernetes platforms (EKS / GKE / AKS) with GitOps + service mesh
  4. Build internal developer platforms (Backstage, Port, Score, Humanitec)
  5. Handle FinOps lifecycle (Inform / Optimize / Operate, 2025 + Scopes)
  6. Execute multi-cloud disaster recovery + global routing
  7. Stand up edge/serverless (Cloudflare Workers, Vercel Edge, Lambda@Edge)

The research is organized into 14 verticals. Each section closes with the training corpus + eval target for the v2 curriculum.


1. AWS Deep Mastery

1.1 Certification Scope (training-data anchors)

Cert Code Topics Why we mine it
Solutions Architect Associate SAA-C03 VPC, EC2, S3, RDS, Lambda, IAM basics Foundational service catalog
Solutions Architect Pro SAP-C02 Multi-account, hybrid, migration, DR, cost-resilience Most question banks for org-complexity
DevOps Engineer Pro DOP-C02 CI/CD, monitoring, IaC, governance Pipelines + observability
Security Specialty SCS-C02 KMS, GuardDuty, Inspector, SCPs, IRSA Hardening + compliance scenarios
Advanced Networking Specialty ANS-C01 Transit Gateway, Direct Connect, Cloud WAN Multi-VPC + hybrid networking

The SAP-C02 exam validates designing multi-account strategies, hybrid architectures, migration at scale, cost optimization, security, and resilience β€” exactly the Surrogate-1 scope. The exam has 65 scored + 10 unscored questions, passing score 750/1000.

1.2 Well-Architected Framework β€” 6 pillars (Sustainability added Dec 2021, refreshed Nov 2024)

1. Operational Excellence  β€” IaC, runbooks, observability, post-mortems
2. Security                β€” IAM, encryption, network, IR
3. Reliability              β€” RTO/RPO, failover, multi-AZ/multi-region
4. Performance Efficiency   β€” right-sized compute, modern data services
5. Cost Optimization        β€” RIs/SPs/Spot/Graviton, lifecycle rules
6. Sustainability           β€” energy efficiency, region selection, idle cleanup

Lenses Surrogate-1 must recognize: Serverless, SaaS, Migration, Generative AI, IoT, Hybrid Networking, Financial Services, Streaming Media, ML.

1.3 Top 30 services for startup/SaaS workloads

Compute      : EC2, Lambda, Fargate, Batch, ECS, EKS, App Runner
Storage      : S3, EFS, FSx, EBS
DB           : RDS (Postgres/MySQL), Aurora, Aurora DSQL, DynamoDB, ElastiCache (Redis/Valkey), OpenSearch
Network      : VPC, Route53, CloudFront, ALB/NLB/GWLB, Transit Gateway, PrivateLink, API Gateway
Identity     : IAM, IAM Identity Center (SSO), Cognito, Organizations, Verified Permissions
Observability: CloudWatch, X-Ray, Managed Prometheus, Managed Grafana, OpenSearch
Security     : KMS, Secrets Manager, GuardDuty, Inspector, Security Hub, WAF, Shield
Data/AI      : Bedrock, SageMaker, Glue, Athena, Kinesis, MSK, Step Functions
Messaging    : SQS, SNS, EventBridge
DevTools     : CodePipeline, CodeBuild, CodeDeploy, CDK, SAM

1.4 VPC networking patterns

Hub-and-spoke with Transit Gateway β€” TGW is the managed hub-and-spoke service for VPCs and on-prem; centralizes routing without VPN overlays. Aligns with Well-Architected Reliability pillar REL02-BP04.

Centralized PrivateLink endpoints β€” Host interface VPC endpoints (e.g., for s3.api, kms, sts, secretsmanager) in a single shared-services VPC. All spoke VPCs reach AWS APIs via TGW β†’ endpoint VPC. Saves cost (one $7.30/month-per-AZ endpoint instead of N).

Decision tree:

Two VPCs, low traffic, no transitive    β†’ VPC Peering
Service consumed across many VPCs       β†’ PrivateLink (endpoint service)
β‰₯3 VPCs with transitive routing needed  β†’ Transit Gateway (hub-and-spoke)
Multi-region + on-prem at scale         β†’ Cloud WAN

1.5 IAM advanced

SCP = guardrail at OU/account level; deny-by-default, no permissions granted, only constrains. SCPs evaluated AND'd with IAM policies + permission boundaries β€” action allowed only when ALL allow it.

Permission boundary = max permissions a role/user CAN have, regardless of attached policies. Used for delegated admin (developer can create roles, but only ones bounded).

ABAC = attribute-based access control via tags (e.g., aws:PrincipalTag/team must equal aws:ResourceTag/team). Reduces role count drastically. SCPs can lock the tagging itself so principals can't escalate by re-tagging.

Example SCP β€” deny untagged production resources:

{
  "Sid": "DenyUntaggedEnvProd",
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "aws:RequestTag/Environment": ["prod","staging","dev"]
    }
  }
}

1.6 Cost optimization (FinOps lever in Β§11)

Compute discount tiers (max savings vs on-demand):

Mechanism Max Discount Flexibility
Standard RI 75% Locked region+family+OS, 1 or 3 yr
Convertible RI 54% Can change family within OS
EC2 Instance SP 72% Locked family, any size, any AZ
Compute SP 66% EC2 + Fargate + Lambda + SageMaker
Spot 90% Variable interruption (2-min notice)
Graviton +40% perf/$ ARM64 (must support arch)

June 2025 change β€” RIs and SPs are restricted to single-end-customer; MSPs can no longer share commitments across accounts.

Surrogate-1 must teach Compute Optimizer recommendations + apply them.

1.7 AWS-specific tools & CLI surface

aws cli      β†’ primary
aws cdk      β†’ preferred IaC (TS/Python). CDK Refactor (Sept 2025) safely renames constructs without replacement
aws sam      β†’ serverless/Lambda focus
aws copilot  β†’ ECS/Fargate (END OF SUPPORT June 12 2026 β€” migrate to ECS Express or CDK L3)
aws amplify  β†’ frontend + serverless backend, Git-driven CI/CD

1.8 Training corpus for AWS

- AWS Well-Architected docs (all 6 pillar PDFs + 9 lenses)
- AWS official examples: aws-samples/* (8000+ repos)
- terraform-aws-modules/* (vpc 126M downloads, eks 96.3M downloads)
- AWS CDK guide v2 + cdk-patterns/serverless
- SAP-C02 question banks (ExamTopics, Tutorials Dojo)
- AWS Architecture Center reference architectures (multi-account, DR, hybrid)
- Service Control Policy examples: aws-samples/service-control-policy-examples

Eval target: 75% on a custom AWS-design eval (multi-account VPC + hub-spoke + IAM bootstrap + EKS cluster) with cfn-lint + cfn-guard passing.


2. GCP Deep Mastery

2.1 Certifications

Cert Released Scope
Cloud Digital Leader β€” Business/strategy
Associate Cloud Engineer β€” gcloud + GCE/GKE/GCS basics
Professional Cloud Architect (PCA) refreshed Oct 30 2025 Design β€” ~30% net-new content (Vertex AI, Gemini, AI Hypercomputer)
Professional Cloud Network Engineer (PCNE) β€” VPC, hybrid, Cloud Interconnect
Professional Cloud DevOps Engineer β€” SLO, CI/CD, observability
Professional Cloud Security Engineer β€” Org policies, VPC-SC, BeyondCorp
Professional Cloud Database Engineer β€” Cloud SQL, AlloyDB, Spanner

PCA exam covers Compute Engine, Cloud Storage, App Engine, GKE, with the Oct 2025 refresh adding ~30% new content focused on Vertex AI, Gemini integration, and AI Hypercomputer.

2.2 GKE advanced

GKE Autopilot β€” Google manages provisioning, scaling, security, add-ons. Bills per-pod resource request (not nodes). Best when team doesn't want to tune nodepools.

GKE Standard β€” Customer-managed nodepools; required for DaemonSets that need privileged hostPath, custom CNI, niche GPU/TPU shapes.

GKE version ladder β€” GKE adopts new K8s versions fastest (~2 weeks). Autopilot gets 30 months extended support; AKS LTS 24 months; EKS Extended Support +12 months.

Anthos / GKE Enterprise β€” Multi-cluster across on-prem + AWS + Azure. Provides Config Sync (GitOps), Service Mesh, Policy Controller. Now folded into GKE Enterprise SKU.

2.3 BigQuery + Vertex AI integration (2025)

  • AI.GENERATE, AI.GENERATE_TABLE, AI.EMBED, AI.SIMILARITY are now GA in BigQuery.
  • BQML supports Gemini 3.0 for generative SQL functions.
  • Vertex AI End User Credentials (2025) lets Vertex models authenticate via the calling user's IAM β€” no service-account proxy.

This is core for any data-platform engineering Surrogate-1 builds.

2.4 Cloud Run + Cloud Functions

  • Cloud Run gen2 = container-as-a-service, scales to zero, max 60-min timeout, supports websockets/streaming.
  • Cloud Functions gen2 = built ON Cloud Run; choose Functions for trigger-driven, Run for HTTP/services.
  • Cloud Run jobs = batch workloads (cron via Cloud Scheduler).

2.5 GCP-specific tools

gcloud           β†’ primary CLI
Terraform google β†’ official provider, fastest day-1 support for new services
Config Connector β†’ GCP-native Crossplane equivalent (KCC). Manage GCP resources via K8s CRDs
Cloud Deploy     β†’ managed GitOps for GKE
Cloud Build      β†’ CI (yaml + buildpacks)

2.6 Training corpus for GCP

- GCP architecture center (cloud.google.com/architecture)
- terraform-google-modules/* (network, kubernetes-engine, cloud-foundation-fabric)
- Cloud Foundation Fabric (Google's reference org setup)
- gcp-pca-study-guide repos
- Anthos config-management examples
- BQML + Vertex AI codelabs

3. Azure Deep Mastery

3.1 Certifications

Cert Code Scope
Administrator Associate AZ-104 RBAC, IAM, networking, storage, Bicep basics
Solutions Architect Expert AZ-305 Design β€” governance, identity, infra, app, integration
Security Engineer AZ-500 Defender, Sentinel, Conditional Access
DevOps Engineer Expert AZ-400 Pipelines, IaC, monitoring

AZ-305 (refreshed April 17 2026) covers: Identity/governance/monitoring, data storage, infrastructure & availability, application architecture, network solutions, data integration, business continuity. Prereq: AZ-104.

3.2 Azure compute deep cuts

AKS              β†’ managed K8s; "AKS LTS" = 24-mo extended support per minor
App Service      β†’ PaaS web hosting (Plans = Basic/Standard/Premium/Isolated)
Functions        β†’ consumption / premium / dedicated
Container Apps   β†’ CaaS on KEDA (scale-to-zero from events)
Container Instances (ACI) β†’ single-pod throwaway
Virtual Machine Scale Sets (VMSS) β†’ IaaS auto-scaling
Azure Spring Apps β†’ managed Spring Boot

3.3 Azure DevOps + GitHub Enterprise (Microsoft owns both)

  • Azure DevOps = Boards + Repos + Pipelines + Artifacts. Mature for .NET-heavy orgs.
  • GitHub Enterprise + Actions = where new investment is going (Microsoft's strategic direction).
  • 2025 trend: most new Azure customers go GitHub-first; Azure DevOps is in maintenance mode.

3.4 Azure tooling

az cli   β†’ primary
Bicep    β†’ DSL that transpiles to ARM. JSON ARM templates β†’ DEPRECATED for new work
Pulumi   β†’ first-class Azure native provider
Terraform azurerm + azuread β†’ mature, official

Bicep simplifies ARM but is Azure-only β€” for multi-cloud orgs, Terraform remains primary.

3.5 Training corpus for Azure

- Cloud Adoption Framework (Microsoft's enterprise reference)
- Azure-Samples/* GitHub org
- Azure Verified Modules (AVM) β€” Microsoft's curated Bicep + Terraform modules
- AZ-305 study guides + Microsoft Learn content
- Azure Architecture Center patterns

4. Multi-Cloud Strategy

4.1 Workload portability tools

Tool Approach Best fit
Crossplane K8s-native control plane β†’ cloud APIs via providers Platform teams already on K8s
Anthos GCP-managed clusters across clouds + on-prem GKE-centric orgs wanting unified control
Azure Arc Azure-managed servers/K8s outside Azure Azure-centric hybrid
Terraform IaC abstraction (provider-per-cloud) Most common; least lock-in
Pulumi Real code (Python/TS); equivalent provider coverage Engineering-heavy teams

4.2 Crossplane v2 (Aug 2025)

Major upgrades:

  • Compositions can include any K8s resource β€” not just Crossplane MRs. Mix RDSInstance + Deployment + CloudNativePG cluster in one XR.
  • Namespace-first β€” XRs and MRs are namespaced by default (was cluster-scoped).
  • Operations β€” function pipelines for cert monitoring, rolling upgrades, scheduled maintenance.
  • Multi-cloud status β€” AWS providers fully migrated; Azure/GCP/Terraform providers still being updated to v2.

4.3 DR / failover patterns

Pattern RTO RPO Cost (vs single-region)
Backup & restore hours-days hours 1.0x (storage only)
Pilot light 10s of min minutes 1.1-1.3x
Warm standby minutes minutes 1.5-1.8x
Multi-site active/active seconds ~0 1.8-2.5x

Multi-cloud active/active typically costs 1.8–2.5x single-cloud due to duplicate infra + ops overhead. Recommendation: active/passive across clouds + active/active across regions WITHIN primary cloud.

4.4 Latency-based routing

Route53 latency policy   β†’ AWS-native, cheapest
Cloud DNS geo-routing    β†’ GCP-native
Azure Traffic Manager    β†’ Azure-native
Cloudflare load balancer β†’ multi-cloud
NS1 / Constellix         β†’ enterprise multi-cloud DNS

Cloudflare LB is the most common cross-cloud answer because it sits OUTSIDE the providers.

4.5 Cost arbitrage

  • GPU cost: GCP < AWS < Azure (TPUs are GCP-only and cheaper per FLOP at scale)
  • Egress: AWS most expensive; Cloudflare R2 has $0 egress (S3-compatible)
  • Object storage: B2 ($6/TB/mo) < R2 ($15) < S3 standard ($23) < GCS standard ($26)
  • Reserved discounts: deepest in AWS (75% std RI), shallower in Azure (65%), GCP CUDs ~57%

4.6 Vendor lock-in mitigation

1. Use OSS data formats (Parquet, Iceberg, Delta) β€” not proprietary
2. Use OSS DBs (Postgres / Redis-compatible Valkey) β€” not Aurora-only or Cosmos-only
3. Use OCI containers + K8s β€” cluster portability via Crossplane/Anthos
4. Use Terraform with multi-provider modules β€” abstract per-cloud differences
5. Avoid managed-vendor-only auth β€” use OIDC + Keycloak or Auth0 (cross-cloud)
6. Multi-cloud DNS (Cloudflare/NS1) so Route53/Cloud DNS isn't single point

5. IaC Mastery

5.1 Terraform / OpenTofu (post-BSL fork)

  • HashiCorp Terraform OSS under BSL discontinued after July 2025 β†’ OpenTofu is the OSS continuation under Linux Foundation.
  • Most TACOS (Spacelift, Env0, Scalr) support both. Most modules still work in both.
  • For new orgs in 2026 β†’ default OpenTofu.

Best practices (2025):

1. Remote backend (S3+DynamoDB lock, GCS, Azure Blob) β€” never local state
2. Split state: per-environment (dev/staging/prod) + per-domain (network, data, compute)
   - Terralith state >50MB causes timeouts; >10MB visible perf hit
3. Module versioning: `~> 2.5` (allow patch+minor, block major)
4. Pre-commit: terraform fmt + validate + tflint + tfsec/checkov
5. CI/CD: Atlantis (OSS, self-host) or Spacelift / Env0 / Scalr / Terramate (SaaS)
6. State locking always on
7. Drift detection: `terraform plan -refresh-only` on schedule (Spacelift / Atlantis cron)
8. Workspaces only for environment isolation; NOT for tenant separation

Workspace anti-pattern: using workspaces for cust-1, cust-2, cust-3 β€” should be separate state files / dirs instead. Workspaces good for dev, staging, prod of same module.

5.2 CloudFormation

- Nested stacks β†’ for >500 resources / cross-stack dependencies
- Custom resources β†’ Lambda-backed for CFN gaps. Use AwsCustomResource (CDK) for single-API-call
- Change sets β†’ preview before apply (mandatory for prod)
- Stack policies β†’ prevent accidental updates to specific resources
- Service Catalog β†’ curated CFN templates exposed to devs
- StackSets β†’ multi-account/multi-region rollout

5.3 AWS CDK best practices

- Constructs L1 (raw CFN) / L2 (curated AWS) / L3 (composite patterns)
- Aspects β†’ enforce policy across all constructs (e.g., "all S3 buckets must encrypt")
  - Aspects run at synth-time β†’ cfn-guard runs post-synth β†’ both = defense in depth
- Don't extend Construct unless interacting with AWS resources directly; helper class is enough
- Custom resources: use AwsCustomResource for single API call; full Lambda-backed for complex
- CDK Refactor (Sept 2025) β†’ safely rename or move resources without replacement
- Pipelines L3 = managed CodePipeline that self-mutates

5.4 Pulumi

  • Real code (TS/Python/Go/.NET/Java) β€” language loops, classes, unit tests with native frameworks.
  • Pulumi onboarding ~30% faster for engineers already knowing TS/Python (vs HCL).
  • Day-1 support for new cloud services because Pulumi wraps SDKs directly.
  • Pulumi ESC = encrypted env+secrets store; Pulumi Deployments = managed runners.

5.5 Crossplane (K8s-native multi-cloud)

# Composition that creates RDS + Deployment + Service in one XR
apiVersion: apiextensions.crossplane.io/v2
kind: Composition
metadata:
  name: web-app-with-db
spec:
  compositeTypeRef:
    apiVersion: example.io/v1alpha1
    kind: WebApp
  pipeline:
  - step: provision-db
    functionRef:
      name: function-patch-and-transform
    input:
      apiVersion: pt.fn.crossplane.io/v1beta1
      kind: Resources
      resources:
      - name: rds
        base:
          apiVersion: rds.aws.upbound.io/v1beta1
          kind: Instance
          spec:
            forProvider:
              instanceClass: db.t3.medium
              engine: postgres
              engineVersion: "16"
              allocatedStorage: 50
      - name: deployment
        base:
          apiVersion: apps/v1
          kind: Deployment
          spec:
            replicas: 3

5.6 IaC TACOS comparison

Tool OSS / SaaS IaC Coverage Best For
Atlantis OSS, self-host TF/OpenTofu/Terragrunt Free, GitHub-PR workflow
Spacelift SaaS + self-hosted TF/OpenTofu/Terragrunt/Pulumi/CFN/K8s/Ansible Enterprise multi-IaC
Env0 SaaS only Multi-IaC + strong FinOps FinOps-aware deployment
Terramate OSS + SaaS TF/OpenTofu Stack orchestration + DAGs
Scalr SaaS + self-hosted TF/OpenTofu TFC alternative
Terraform Cloud SaaS TF only Default if already HashiCorp

5.7 Real Terraform module example (multi-cloud DRY)

# environments/prod/main.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "prod-vpc"
  cidr = "10.0.0.0/16"
  azs  = ["us-east-1a", "us-east-1b", "us-east-1c"]

  private_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets   = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = false  # one per AZ for HA
  enable_vpn_gateway     = false
  enable_dns_hostnames   = true
  enable_flow_log        = true
  flow_log_destination_type = "cloud-watch-logs"

  tags = local.common_tags
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "prod-platform"
  cluster_version = "1.32"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_endpoint_public_access = false
  cluster_endpoint_private_access = true

  cluster_addons = {
    coredns                = { most_recent = true }
    kube-proxy             = { most_recent = true }
    vpc-cni                = { most_recent = true }
    aws-ebs-csi-driver     = { most_recent = true }
    eks-pod-identity-agent = { most_recent = true }
  }

  eks_managed_node_groups = {
    system = {
      instance_types = ["t3.medium"]
      min_size = 2
      max_size = 4
      desired_size = 2
      labels = { workload = "system" }
      taints = [{ key = "system", value = "true", effect = "NO_SCHEDULE" }]
    }
    karpenter = {
      instance_types = ["m6g.large"]  # Graviton
      capacity_type  = "ON_DEMAND"
      min_size = 1
      max_size = 2
      desired_size = 1
      labels = { workload = "karpenter" }
    }
  }

  enable_irsa = true
  enable_cluster_creator_admin_permissions = true
}

5.8 Training corpus for IaC

- HashiCorp learn.hashicorp.com/terraform tutorials (1000+ lessons)
- terraform-aws-modules / terraform-google-modules / Azure/terraform-azurerm-* (AVM)
- Pulumi pulumi/examples (1500+)
- aws-samples/aws-cdk-examples
- Crossplane upbound/configurations (reference platforms)
- Awesome-terraform / awesome-pulumi GitHub lists
- IaC-Eval benchmark (academic Terraform benchmark)
- TACOS docs: Spacelift, Env0, Atlantis, Terramate

6. Kubernetes Platform Engineering

6.1 Kubernetes 1.32 β†’ 1.35 highlights (2025)

Version Released Key features
1.32 Dec 2024 KubeletFineGrainedAuthz; Memory Manager GA; Anonymous Auth Configurable Endpoints
1.33 Apr 2025 Sidecars GA; supplementalGroupsPolicy beta; in-place pod resize beta
1.34 Aug 2025 DRA core GA (Dynamic Resource Allocation for GPUs/TPUs/FPGAs)
1.35 Dec 2025 Fine-grained Supplemental Groups GA; TLS 1.3 baseline

Pod Security Standards are GA since v1.25 (NOT 2025). Three levels: Privileged / Baseline / Restricted, applied via pod-security.kubernetes.io/<mode> namespace labels.

6.2 Helm vs Kustomize vs Carvel

Tool Approach Strength Weakness
Helm Templating + values + chart Package manager (75% adoption); Helm 4 (Nov 2025) adds server-side apply Templating debug pain
Kustomize Patch-based overlays on bases No magic; built into kubectl No release/version concept; needs ArgoCD/Flux for state
Carvel ytt + kapp + kbld + imgpkg Strong CI bundling; image relocation Steeper learning curve, smaller community

Mature pattern: Helm to install upstream charts (Cilium, ArgoCD, Prometheus); Kustomize overlays per environment. Use ArgoCD helm source with valuesObject overrides.

6.3 GitOps β€” ArgoCD vs FluxCD (2025 reality)

Weaveworks closed in early 2024 β€” Flux became fully community-driven (CNCF graduated). ArgoCD has clearer commercial path (Akuity, CodeFresh).

Aspect ArgoCD FluxCD
UI Strong native dashboard None native (use Weave GitOps or third-party)
RBAC Built-in + Projects multi-tenancy Standard K8s RBAC only
Architecture Hub-and-spoke Decentralized, K8s-idiomatic
Multi-cluster Native (App-of-Apps, ApplicationSets) Per-cluster Flux + Notification Controller
Best for Most enterprises in 2025 Air-gapped / minimal-deps / true GitOps purists

Default 2026 recommendation: ArgoCD for most orgs.

6.4 Service Mesh β€” Istio vs Linkerd vs Cilium (2025)

Mesh Sidecars Data plane Best fit
Istio Sidecar OR Ambient (ztunnel + waypoint) Envoy Advanced traffic mgmt, deep telemetry
Linkerd Sidecar only linkerd2-proxy (lightweight Rust) Simplicity + lowest overhead
Cilium Sidecarless eBPF + Envoy (L7) Network policy + perf at scale

Memory cost reality: 500 services on Istio sidecar = ~25–50 GB more RAM than same on Linkerd. Translates to real $$.

Cilium caveat: eBPF can't parse HTTP/gRPC or do mTLS termination β€” Cilium still uses Envoy for L7, so the perf delta vs Istio at L7 is small.

Decision tree:

Tiny team, just want mTLS + observability     β†’ Linkerd
Already on Cilium CNI, want unified           β†’ Cilium Service Mesh
Need full traffic mgmt (canary, mirror, fault) β†’ Istio Ambient

6.5 Ingress + Gateway API (the Ingress era is ending)

Ingress-NGINX official maintenance halt March 2026. Gateway API is the K8s-official successor.

Gateway API provides:

  • Protocol-agnostic (HTTP, TCP, gRPC, TLS passthrough)
  • Role-split: GatewayClass (provider) β†’ Gateway (cluster operator) β†’ HTTPRoute (app dev)
  • Built-in canary/blue-green via weighted routes
  • Both north-south AND east-west

Ingress controllers / Gateway implementations:

Implementation Notes
Envoy Gateway Reference implementation; CNCF
Istio Native Gateway API support (replaces Istio VirtualService for new)
NGINX Gateway Fabric NGINX-backed, replaces ingress-nginx
Cilium Gateway CNI-integrated
Traefik Long-time leader for Ingress; Gateway API supported

Migration: ingress2gateway 1.0 (2026) translates Ingress + annotations β†’ Gateway API resources.

6.6 Operators

Operator SDK (Red Hat)        β†’ Go/Helm/Ansible scaffolding
Kubebuilder                   β†’ upstream K8s SIG; cleaner Go
KUDO                          β†’ declarative operator definition
metacontroller                β†’ Lua/JSONNET-style hooks (lightweight)

When to write an operator: state machine that doesn't fit Deployment (e.g., DB clustering, leader election, custom backup).

When NOT: just templating β†’ use Helm/Kustomize.

6.7 Multi-cluster β€” Karmada vs Cluster API vs OCM

KubeFed is EOL (no commits since 2020).

Tool Approach
Karmada CNCF Incubation; multi-cluster scheduling + propagation policy. v1.15 (Oct 2025) adds multi-template workload awareness + structured logging
Cluster API (CAPI) Declarative cluster lifecycle (CAPA AWS, CAPG GCP, CAPZ Azure providers)
Open Cluster Management (OCM) Red Hat-led; ACM commercial product
Anthos / GKE Enterprise GCP-managed; folds in Config Sync + Mesh + Policy
Azure Arc Azure-managed; brings Azure Policy/Monitor to any cluster

Pattern: CAPI provisions clusters, Karmada propagates workloads, ArgoCD reconciles config.

6.8 Cost β€” Kubecost vs OpenCost

  • OpenCost (Apache 2.0) β€” free, single-cluster focus, real-time allocation by pod/namespace/controller, multi-cloud (AWS/GCP/Azure). Now ships with built-in MCP server (2025) for AI agent access.
  • Kubecost (IBM-owned post-2024 acquisition) β€” adds budgets, RBAC, multi-cluster aggregation, automated cost policies. Starts $449/mo, enterprise on quote.

6.9 Karpenter + Spot + Graviton

Real customer outcomes:

  • Tinybird: 20% AWS bill reduction with EKS+Karpenter+Spot
  • Series B SaaS (200 microservices): $52k β†’ $23k/mo (56%) with Graviton mix + Karpenter + Spot
  • One reported migration: $50k β†’ $22k/mo Karpenter + Spot + VPA

NodePool best practices:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["arm64", "amd64"]  # Graviton preferred but allow x86 fallback
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ["m", "c", "r"]
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values: ["6"]  # m6g+, c6g+, r6g+
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 1000

6.10 Helm chart example for a service

# values.yaml
image:
  repository: ghcr.io/org/api
  tag: ""  # set by ArgoCD Image Updater or via CI
  pullPolicy: IfNotPresent

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 30
  targetCPUUtilizationPercentage: 70

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123:role/api-irsa  # IRSA for AWS

podSecurityContext:
  runAsNonRoot: true
  runAsUser: 65532
  fsGroup: 65532

securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault

podDisruptionBudget:
  enabled: true
  minAvailable: 2

networkPolicy:
  enabled: true
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: gateway

6.11 Training corpus for K8s

- kubernetes/kubernetes (source + design proposals KEPs)
- kubernetes/website (docs)
- helm/charts (deprecated) + bitnami/charts + community charts
- argo-cd/argo-cd repo + examples
- karmada-io/karmada
- cilium/cilium (eBPF code + e2e tests)
- istio/istio
- linkerd/linkerd2
- aws/karpenter-provider-aws
- backstage/backstage source + plugins
- run-x/awesome-kubernetes
- KubeCon talk transcripts (CNCF YouTube; can transcribe via Whisper)

Eval target: 70% on K8s-Bench (manifest validity + Helm chart that helm template validates + ArgoCD Application that syncs + NetworkPolicy that locks down by default).


7. Internal Developer Platform (IDP)

7.1 IDP landscape (2025)

Tool Type Strength TTV (time-to-value)
Backstage (Spotify, CNCF) OSS framework, build-it-yourself portal Most flexible; 120+ Spotify-internal plugins; CNCF 3–6 months
Port Commercial SaaS portal No-code, fast to deploy Days
Cortex Commercial β€” service ownership + scorecards Best for >50-eng orgs needing governance Weeks
OpsLevel Commercial β€” quality scorecards Strong dashboards Weeks
Humanitec Platform Orchestrator (NOT a portal) Backend that resolves Score files into infra Weeks
Encore All-in-one (codegen + infra) Strong opinionated dev workflow Days
Cloudomation Workflow automation IDP Low-code for non-K8s orgs Days

Key mental model: Portal (Backstage/Port) β‰  Orchestrator (Humanitec). You often need BOTH β€” portal as UI, orchestrator as the backend that creates the actual cloud resources.

7.2 Backstage core

Catalog          β†’ entities (Component, System, API, Resource, Group, User)
TechDocs         β†’ MkDocs-based, lives next to code
Software Templates β†’ Cookiecutter-style scaffolds (repo + IaC + pipeline + DB)
Search           β†’ indexes catalog + docs
RBAC             β†’ Spotify's RBAC plugin (commercial)
Soundcheck       β†’ Spotify's tech-standards/scorecard plugin (commercial)
Insights         β†’ adoption analytics (commercial)
Cloud Backstage  β†’ managed hosted (commercial)

Open-source plus commercial Spotify Portal (RBAC, Insights, Soundcheck) = "production-ready Backstage."

catalog-info.yaml example:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: orders-api
  description: Order management service
  annotations:
    github.com/project-slug: org/orders-api
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/integration-key: ${SECRET_PD}
    sonarqube.org/project-key: org_orders-api
    grafana/dashboard-selector: "tags @> 'orders'"
  tags: [java, spring-boot, payments-domain]
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: payments
  providesApis: [orders-rest-api]
  consumesApis: [users-rest-api]
  dependsOn: [resource:orders-db]

7.3 Score (CNCF, 2024) β€” workload spec

Score is platform-agnostic workload spec. Developer writes ONE YAML; platform team's score-compose / score-helm / score-k8s translates.

# score.yaml
apiVersion: score.dev/v1b1
metadata:
  name: orders-api
containers:
  api:
    image: ghcr.io/org/orders-api:${TAG}
    variables:
      DATABASE_URL: postgres://${secrets.DB_PASSWORD}@${resources.db.host}/orders
      REDIS_URL: ${resources.cache.url}
    resources:
      requests: { cpu: "100m", memory: "256Mi" }
service:
  ports:
    web: { port: 8080 }
resources:
  db:
    type: postgres
  cache:
    type: redis
  route:
    type: dns
    params: { host: orders.example.com }

The platform team configures resource definitions (e.g., db.postgres β†’ AWS RDS via Crossplane) β€” devs don't see/care.

7.4 OAM vs Score

OAM is broader (whole-app model with traits + scopes + components); Score is single-workload + simpler. Score is winning in 2025 because of its narrower scope and CNCF backing.

7.5 Humanitec orchestrator pattern

Developer:    score.yaml in repo
GitOps:       commit β†’ CI β†’ calls Humanitec API
Humanitec:    resolves score against Resource Definitions
              β†’ creates EKS Deployment + RDS + Redis + Route53 record
Platform:     defines Resource Definitions (e.g., postgres β†’ AWS RDS via TF/Crossplane)

7.6 Training corpus for IDP

- backstage/backstage source + ALL community plugins (roadie/* / spotify/*)
- score-spec/spec + reference implementations (score-compose/score-helm/score-k8s)
- Humanitec docs + Resource Definition examples
- Port templates marketplace
- Cortex YAML scorecard library
- platformengineering.org community articles
- KubeCon Platform Engineering Day talks (transcripts)

8. Edge + Serverless Platforms

8.1 Latency / cold-start reality (2025)

Platform P50 latency Cold start POPs
Cloudflare Workers 10–30ms <1ms (V8 isolates) 330+
Vercel Edge Functions <50ms sub-50ms 18 (uses Lambda@Edge under hood in some regions)
Lambda@Edge (Node) 50–80ms 250–800ms AWS edge POPs
Lambda@Edge (Python) similar 400–1200ms same
Fastly Compute@Edge (WASM) ~5–10ms <1ms 80+
Deno Deploy low low global
Bun runtime fastest cold-start of any Node-compat n/a self-hosted

Cloudflare Workers ~441% faster than Lambda at p95, and unlimited bandwidth on free tier.

8.2 Cloudflare ecosystem

Workers       β†’ V8 isolate functions (JS/TS/WASM)
Pages         β†’ static + Workers (serverless full-stack)
R2            β†’ S3-compatible object storage, zero egress
D1            β†’ serverless SQLite (replicated)
KV            β†’ eventually-consistent KV
Durable Objects β†’ strongly-consistent stateful primitives
Queues        β†’ managed message queue
Workers AI    β†’ run LLMs at the edge (Llama, Whisper, Stable Diffusion)
Vectorize     β†’ vector DB (RAG at edge)
Hyperdrive    β†’ connection pooler for Postgres/MySQL behind edge
Stream        β†’ video transcoding + delivery

8.3 Vercel ecosystem

Edge Functions    β†’ Cloudflare Workers-compatible runtime (Node + Python + Go + Ruby)
Edge Middleware   β†’ run BEFORE the request enters serverless
Serverless Funcs  β†’ Lambda@Edge under the hood
Postgres         β†’ managed Postgres (built on Neon)
KV               β†’ built on Upstash Redis
Blob             β†’ object storage

8.4 Multi-region edge strategies

Pattern 1 β€” Edge cache + origin in primary region
  Cloudflare cache β†’ S3/Lambda in us-east-1
  Trade: simple, 100ms+ for cache misses

Pattern 2 β€” Workers + DB-at-edge
  CF Workers β†’ D1/Hyperdrive
  Trade: edge writes; eventual consistency
  Use: read-heavy auth, profile, feature flags

Pattern 3 β€” Multi-region active/active
  CF LB β†’ Workers in EU + US + APAC β†’ regional Aurora DSQL
  Trade: cost 2x; near-zero RTO across regions

Pattern 4 β€” Global table + edge CDN
  CF Cache β†’ Lambda β†’ DynamoDB Global Tables (multi-master)
  Trade: replication lag; eventual consistency

8.5 WASM serverless (2025–2026)

  • WASI 0.2 (Component Model) GA β†’ portable across runtimes (Wasmtime, Spin, wasmCloud, Wasmer).
  • Cold starts: microseconds (vs 100–500ms for containers).
  • Major clouds now offer Wasm-based FaaS as mainstream option.
  • Wing language shutdown April 2025 β€” OSS code lives on but no commercial backing.

9. Database Platform

9.1 Postgres options

Service Multi-region Best for
RDS Postgres Read replicas Standard managed
Aurora Postgres Cross-region read replicas + Global DB (1 writer) Standard scale-out
Aurora DSQL Active/active, strong consistency, GA May 2025 New globally-distributed apps
AlloyDB (GCP) HA + read pool nodes Postgres-compat OLTP+OLAP at GCP
Cloud SQL (GCP) Single-region HA Standard managed
Azure Database for Postgres Flex Single-region HA Standard managed
Neon Branching (Git-like) Dev velocity
Supabase Postgres + auth + realtime Full BaaS
Crunchy Bridge Multi-cloud Postgres Vendor-neutral
PlanetScale (Now Postgres + Vitess) Sharded scale-out

9.2 Aurora DSQL deep cuts (GA May 2025)

  • Disaggregated architecture: query processor + adjudicator + journal + crossbar β€” each scales independently.
  • 99.99% single-region SLA, 99.999% multi-region.
  • Active/active multi-master (peers); third region as log-only witness.
  • Region groupings only β€” US (us-east-1, us-east-2, us-west-2), EU (eu-west-1/2/3), APAC (ap-northeast-1/2/3).
  • No cross-continent yet.
  • PostgreSQL wire-compatible.

9.3 Distributed SQL (NewSQL)

DB TPC-C (TPS) PG compat Multi-region
CockroachDB 45k wire only Best with geo-partitioning
YugabyteDB 48k full (reuses PG query layer) Strong with row-level geo
TiDB 40k+ (write-heavy lead) MySQL primary βœ“
Aurora DSQL benchmarked fastest by AWS wire Region-grouped
Spanner 1M+ at scale GoogleSQL or PG dialect Global by design

YugabyteDB wins for PG migration (full compat). CockroachDB wins for geo-partitioning. Spanner remains gold standard at hyperscale.

9.4 Vitess (MySQL sharding)

  • Open-source MySQL sharding system.
  • Powers YouTube, Slack, GitHub, PlanetScale.
  • Functions: query routing, online schema migration (with gh-ost), connection pooling, transparent sharding.
  • Newer alternative: CockroachDB / Aurora DSQL eliminate manual sharding.

9.5 NoSQL

DynamoDB             β†’ AWS, single-digit-ms; on-demand or provisioned
DynamoDB Global Tables β†’ multi-region multi-master (last-writer-wins)
Spanner              β†’ strongly consistent global
Cosmos DB            β†’ multi-model (SQL/MongoDB/Cassandra/Gremlin); 5 consistency levels
Cassandra/Scylla     β†’ wide-column; high write throughput
MongoDB Atlas        β†’ document; managed across all 3 clouds

9.6 Vector DBs (2025 production benchmarks)

DB p99 latency QPS Notable
Qdrant 2ms 12k Best low-latency, $25/mo+ cloud
Milvus / Zilliz 5ms 8k Billion-scale; built-in BM25 + dense (30x faster than Elasticsearch)
Pinecone 8ms 5k Fully managed, 99% recall
Weaviate 10ms 4k BlockMax WAND (10x keyword speed); MUVERA multi-vector
pgvector varies depends If you already have Postgres
OpenSearch k-NN varies depends If you already have OpenSearch

9.7 Migration tools (2025)

Tool Approach Best for
Liquibase Imperative changelogs (XML/YAML/JSON/SQL); FSL license post-v5; AI rollback assist (2025) Multi-DB enterprise
Flyway Numbered SQL files; Java ecosystem standard; Teams tier discontinued 2025 Java teams
Atlas (atlasgo.io) Declarative HCL + computed migration plan Terraform-style schema-as-code
Prisma Migrate Declarative, ORM-coupled Node/TS apps
goose Plain SQL/Go migrations Go services

Atlas is the modern recommendation β€” same paradigm as Terraform.


10. Networking Deep

10.1 DNS

Route53          β†’ AWS native, latency/geo/failover; alias records to AWS resources
Cloud DNS        β†’ GCP native
Azure DNS        β†’ Azure native
Cloudflare DNS   β†’ fastest authoritative (1.1.1.1 is recursive); free
NS1 / Constellix β†’ enterprise multi-cloud DNS, advanced traffic steering

10.2 CDN performance (Cloudflare 95p TTFB benchmark, Nov 2024–Mar 2025)

  • Cloudflare fastest in ~48% of top 1000 networks.
  • Fastly extremely close in many networks (e.g., +0.2% lead on Comcast).
  • CloudFront strong inside AWS-heavy stacks (free egress to AWS origins).
  • All have edge compute now: Workers / Compute@Edge / Lambda@Edge.

10.3 Load balancers (AWS)

ALB  (L7)  β†’ HTTP/HTTPS/gRPC; WAF integration; target group flexibility
NLB  (L4)  β†’ TCP/TLS/UDP; static IPs; >millions RPS
GWLB        β†’ traffic inspection (third-party firewall in chain)
ELB Classic β†’ legacy, avoid
GAL (Global Accelerator) β†’ anycast IPs in front of ALB/NLB for global traffic

10.4 Zero Trust Network Access (2025)

Tool Architecture Best fit
Tailscale WireGuard mesh + identity overlay Fastest dev access; great for SSH/RDP/DB
Twingate Layer 4 ZTNA (no mesh); resource-grain App-name + group-based access
Cloudflare Access + WARP SASE β€” Access for apps + Gateway for SWG When Cloudflare is the wider stack
Zscaler Enterprise SASE Big-org compliance
Pomerium Self-hosted reverse-proxy ZTNA OSS option

Tailscale wins on dev velocity (sign in, get tailnet); Cloudflare Access wins on full SASE; Twingate wins on resource granularity.

10.5 WAF

AWS WAF              β†’ tied to CloudFront/ALB/API Gateway
Cloudflare WAF       β†’ in front of any origin
Azure Front Door WAF β†’ tied to AFD
Akamai App & API Protector β†’ enterprise

10.6 DDoS protection

AWS Shield Advanced  β†’ $3000/mo + transfer; 24/7 SRT
Cloudflare           β†’ unmetered DDoS protection (free tier!)
Google Cloud Armor   β†’ tier-based
Azure DDoS Protection Standard β†’ per-resource

11. FinOps + Cost Engineering

11.1 FinOps Foundation Framework 2025 (Inform / Optimize / Operate)

INFORM   β†’ Visibility, allocation, benchmarking, budgeting, forecasting
OPTIMIZE β†’ Identify and execute waste reduction
OPERATE  β†’ KPI tracking, governance policies aligned with business

11.2 2025 framework changes β€” Scopes

The 2025 Framework adds Scopes as a structural element. Scopes define context: Public Cloud, SaaS (Snowflake, Salesforce), GenAI (LLM API spend), Data Center, Private Cloud. Each capability is now applied per scope.

11.3 Cost allocation tags (mandatory at provision time)

Required tags for every resource:
- Environment  : prod/staging/dev/sandbox
- Owner        : team-name (matches catalog)
- CostCenter   : finance code
- Project      : product/feature
- DataClass    : public/internal/confidential/regulated

Enforce via:

  • AWS: SCP aws:RequestTag/X (deny on creation), Tag Policies
  • GCP: Org Policy required labels
  • Azure: Azure Policy required tags

11.4 Showback / chargeback

Showback   β†’ "your team used $X" (no actual billing)
Chargeback β†’ cross-charge cost center (real finance impact)

Tools: Vantage, CloudHealth, Apptio Cloudability, Kubecost (k8s-specific), Infracost (pre-deploy IaC estimate).

11.5 Anomaly detection

AWS Cost Anomaly Detection (free)
Vantage anomalies + alerts (commercial)
CloudZero / Spend.io
ProsperOps                β†’ automated commitment management

11.6 Right-sizing automation

AWS Compute Optimizer    β†’ free; recs for EC2, Lambda, EBS, ASG, ECS
GCP Recommender          β†’ equivalent
Azure Advisor            β†’ equivalent
ScaleOps / StormForge    β†’ K8s VPA recommender for prod

11.7 Spot orchestration (Karpenter, ProsperOps)

Already covered Β§6.9. Karpenter native AWS spot fleet; CAST AI / ScaleOps for cross-cloud.

11.8 Training corpus for FinOps

- FinOps Foundation framework docs (finops.org/framework)
- AWS / GCP / Azure cost optimization whitepapers
- Vantage / CloudZero / Apptio public benchmarks
- KubeCon FinOps track talks (transcripts)
- Real customer cost-cut case studies (already collected: Tinybird, Series B SaaS examples)

12. 2025–2026 Platform Engineering Trends

12.1 Internal LLM gateways (the 2026 must-have)

Tool Type Key strength Cost
LiteLLM OSS, self-host OpenAI-compat; cheapest at $10k+ MRR; 100+ providers Free + infra
Portkey SaaS or self-host SOC2/HIPAA/ISO27001; observability; 250+ LLMs $49/mo+
OpenRouter SaaS Pay-per-token; consumer-friendly 5% markup
Helicone OSS observability Caching + analytics Free + cloud
Truefoundry / Bifrost SaaS LLM gateway + ML platform Quote

LiteLLM is the default for orgs serious about cost β€” runs as your own proxy, no markup.

12.2 AI agents in platform engineering

  • Resolve.ai β€” AI SRE; auto-investigates alerts, RCA in minutes, MTTR -80%; customers: Coinbase, DoorDash, Toast, Zscaler. $40M Series A extension at $1.5B (2026).
  • Aviator (aviator.co) β€” AI code review + merge queues + deployment.
  • OpenText DevOps Aviator β€” AI for performance engineering scripts.
  • Cursor / Sourcegraph Cody / GitHub Copilot Workspace β€” IDE-side coding agents.
  • Codeium / Tabnine / Continue β€” open-source IDE agents.

12.3 Per-PR ephemeral environments

Tool Approach
Coherence PR comment with auto-preview URL; spot-backed for cost
Uffizzi OSS + cloud; vCluster-based isolated environments
Render Preview Built-in to Render
Vercel Preview Built-in to Vercel
Netlify Deploy Previews Built-in to Netlify
Argo CD ApplicationSet PR Generator OSS K8s-native
vCluster + ArgoCD DIY pattern; cheapest at scale

Best practice: every PR gets a unique URL, smoke tests run against it, design reviewer can click before merge.

12.4 WASM-based services

  • Production runtimes: Wasmtime (Bytecode Alliance), wasmCloud, Spin (Fermyon), Wasmer.
  • Use cases: edge serverless, plugin systems (Envoy filters, Istio extensions, Postgres extensions), embedded scripting.
  • Platforms moving to Wasm: Fastly Compute@Edge (WASM-only), Cloudflare Workers (V8 + WASM), Spin Hub.

12.5 AI-native databases / observability

LangSmith        β†’ LLM tracing + evals (LangChain)
Helicone         β†’ LLM tracing + caching (OSS option)
Phoenix (Arize)  β†’ OSS LLM observability
Langfuse         β†’ OSS, self-host LLM observability
Weights & Biases Weave β†’ MLOps + LLM

12.6 Autonomous Cloud Engineer (the Surrogate-1 mission)

The path is converging on:

  1. MCP (Model Context Protocol) β€” standardize how agents pull cloud state. AWS docs MCP, OpenCost MCP (2025), terraform-mcp-server.
  2. Multi-agent systems β€” research / planner / executor / critic agents (CrewAI, LangGraph, AutoGen).
  3. Tool-using agents β€” agents that call terraform plan, kubectl apply, aws sts get-caller-identity, gh pr create.

Surrogate-1's training MUST include MCP-call patterns + tool-use traces.


13. Training Data Sources

13.1 Curated GitHub repos

Cloud
- awesome-aws (donnemartin)
- awesome-gcp (GoogleCloudPlatform/awesome-google-cloud)
- awesome-azure (kristofferandreasen/awesome-azure)
- aws-samples/* (8000+ official AWS samples)
- GoogleCloudPlatform/* (1500+ GCP samples)
- Azure-Samples/*

K8s
- run-x/awesome-kubernetes
- ramitsurana/awesome-kubernetes
- tomhuang12/awesome-k8s-resources
- kubernetes/kubernetes (source + KEPs)
- kubernetes-sigs/* (CAPI, Gateway API, Karpenter)
- helm/charts (deprecated but reference)
- bitnami/charts
- argoproj/argo-cd

IaC
- hashicorp/terraform
- terraform-aws-modules/* (40+ official modules)
- terraform-google-modules/*
- Azure/terraform-azurerm-* (AVM)
- pulumi/examples
- aws/aws-cdk
- crossplane/crossplane + upbound/configurations

Platform
- backstage/backstage + roadie/* + spotify/* community plugins
- score-spec/spec
- humanitec-architecture/*

Eval
- codefuse-ai/codefuse-devops-eval
- IaC-Eval (academic)
- NL2Bash

13.2 Reddit communities (curate top-voted threads, last 2 yrs)

  • r/devops, r/aws, r/AZURE, r/googlecloud, r/kubernetes, r/Terraform, r/sysadmin, r/sre, r/platformengineering

13.3 Conference talks (transcribe via Whisper, MIT-licensed for TLP/CNCF)

  • KubeCon + CloudNativeCon (CNCF YouTube; ~600 talks/year)
  • AWS re:Invent (multiple thousand sessions, breakouts archived)
  • Google Cloud Next (annual)
  • Microsoft Ignite / Build
  • HashiConf
  • PlatformCon (annual, online)
  • SREcon (USENIX)

13.4 Public datasets on HuggingFace

- CatOwl/Terraform                   (Terraform code corpus)
- nvidia/OpenCodeReasoning            (reasoning over code)
- bigcode/the-stack-v2                (filtered code, has IaC files)
- mhhmm/codealpaca-iac                (instruction tuning for IaC)
- Custom: collect from terraform-aws-modules/eks/aws + variants

13.5 Documentation (for retrieval / SFT context)

  • AWS docs (full), GCP docs, Azure docs (Microsoft Learn), CNCF docs, K8s docs, Helm docs, Terraform/OpenTofu docs.
  • AWS Well-Architected Framework PDFs (one per pillar).
  • Google Cloud Architecture Framework.
  • Azure Cloud Adoption Framework + Well-Architected Framework.

13.6 Synthesized data (recommended approach)

For Surrogate-1 v2:

1. Take each terraform-aws-modules example
2. Mutate: change region, instance type, AZ count, subnet sizes
3. Build instruction format: "Build me a 3-AZ VPC in us-west-2 with public+private+db subnets using terraform-aws-modules/vpc/aws"
4. Output: working main.tf + outputs.tf + variables.tf

5. For each AWS service, generate:
   - "What is X" Q&A from official docs
   - "Compare X vs Y" from official docs
   - "Migrate from X to Y" code examples

6. Multi-step trajectories:
   - "Build me a SaaS platform on AWS" β†’ 30+ step reasoning trace through architecture decisions

Total target: ~100k–250k cloud/platform instruction-tuning examples.


14. Eval Benchmarks

14.1 Existing benchmarks

Benchmark What it tests Surrogate-1 fit
codefuse-ai/codefuse-devops-eval DevOps Q&A multiple-choice Quick sanity check
IaC-Eval (academic) Terraform generation correctness Direct fit
KubeBench (community) K8s manifest validity Direct fit
NL2Bash Bash command from NL Tooling sub-skill
BIG-Bench (subset) Various reasoning General
HumanEval / MBPP General coding Already passes (Qwen2.5-Coder-7B baseline)

14.2 Custom Surrogate-1 v2 evals (we author)

Surrogate-1 Cloud Eval v2:
1. Terraform generation (200 prompts, varying complexity)
   - Pass = `terraform validate` + `terraform plan` succeeds
   - Score: % passing Γ— % correct logical structure (judge LLM)

2. Helm chart authoring (50 prompts)
   - Pass = `helm template` produces valid YAML
   - Score: % passing Γ— `kubeval` validation rate

3. CDK/CFN authoring (100 prompts)
   - Pass = `cdk synth` succeeds
   - Score: + `cfn-lint` clean rate, + `cfn-guard` policy pass

4. ArgoCD Application + Kustomize (50 prompts)
   - Pass = ArgoCD CLI dry-run succeeds

5. Multi-cloud DR scenario (30 prompts)
   - Open-ended: "Design active/passive across AWS+GCP for a SaaS, RTO=15min, RPO=1min"
   - Score: judged by GPT-5 / Claude / human reviewer on architecture quality

6. Cost optimization (50 prompts)
   - Given a `terraform plan` output, return cost reductions (Graviton swap, RIs/SPs, Spot)
   - Score: judged on $$ accuracy (vs Infracost ground truth)

7. K8s troubleshooting (50 prompts)
   - Given pod logs + describe output, return root cause + fix
   - Score: % matching ground truth

8. Tool-use traces (100 prompts)
   - Given a goal, agent must call `aws cli` / `kubectl` / `terraform` correctly
   - Score: % achieving goal (sandbox eval)

Total: ~630 prompts. Run with rubric judges (GPT-5/Claude). Surrogate-1 v2 target: 65% overall (above Qwen2.5-Coder-7B baseline of ~38%).

14.3 Capability tiers (target)

Tier Capability v2 Target
1 Recognize + classify cloud services 95%
2 Author single-file IaC (Terraform/CDK/Helm) 75%
3 Author multi-file project (VPC + EKS + RDS + ArgoCD) 60%
4 End-to-end design trace ("build SaaS on AWS") 50%
5 Multi-cloud DR design + tool execution 35% (stretch)

v2 Curriculum Integration Plan

For the v2 LoRA fine-tune of Qwen2.5-Coder-7B β†’ Surrogate-1:

Data mix (target ~250k instruction examples)

40%  IaC generation (Terraform / OpenTofu / CDK / Pulumi / Bicep / Crossplane)
20%  K8s authoring (Helm / Kustomize / ArgoCD / Karpenter)
15%  Cloud architecture Q&A (mined from cert prep + docs)
10%  Cost optimization scenarios (FinOps mined + synthesized)
10%  IDP / Backstage / Score / Humanitec patterns
5%   Multi-step tool-use traces (terraform plan β†’ fix β†’ apply)

Key sources (direct ingestion priorities)

1. terraform-aws-modules/* + terraform-google-modules/* + Azure AVM (canonical IaC)
2. backstage/backstage source + plugin examples
3. AWS Well-Architected docs (all pillars + lenses)
4. GCP Cloud Adoption Framework
5. CNCF KubeCon transcripts (Whisper-extracted)
6. score-spec + humanitec docs
7. OpenCost docs + MCP-pattern examples
8. Real customer post-mortems (Tinybird $-20k, Series-B SaaS $-29k)
9. IaC-Eval benchmark training set
10. CodeFuse DevOps-Eval training set

Eval gates

  • v2 cannot ship until β‰₯65% overall on Surrogate-1 Cloud Eval v2.
  • Tier-3 (multi-file) β‰₯60% is the practical bar for autonomous infra building.
  • Add MCP-tool-use trajectory eval (sandbox terraform/kubectl/aws calls).

Sources Consulted

  • AWS Well-Architected Framework (6 pillars docs, Sustainability pillar Nov 2024 refresh)
  • Terraform / OpenTofu best practices (Terramate, Spacelift, env0, Scalr 2025 articles)
  • Kubernetes 1.32-1.35 release notes; CNCF security blog Dec 2025
  • Backstage docs + Spotify Backstage portal blog (2025)
  • ArgoCD / FluxCD comparison articles (2025-2026 post-Weaveworks closure)
  • Crossplane v2.0 release blog + InfoQ article (Aug 2025)
  • Karpenter cost optimization blogs (Tinybird; Series-B SaaS case studies)
  • Cloudflare Workers / Vercel Edge / Lambda@Edge benchmarks (2025)
  • FinOps Foundation 2025 framework + Scopes update
  • Istio / Linkerd / Cilium 2025 benchmarks (deepness-lab academic paper)
  • Pulumi / Terraform / CDK / Bicep 2025 comparisons
  • CockroachDB / YugabyteDB / Spanner / Aurora DSQL 2025 benchmarks
  • AWS SAP-C02 / GCP PCA (Oct 2025 refresh) / Azure AZ-305 (April 2026 refresh)
  • Backstage / Port / Cortex / Humanitec IDP comparison (2025-2026)
  • Karmada v1.15 + KubeFed EOL + Cluster API
  • Coherence / Uffizzi ephemeral environments (2025)
  • AWS CDK best practices (CDK Refactor Sept 2025)
  • VPC Transit Gateway / PrivateLink hub-spoke patterns
  • Helm / Kustomize / Carvel comparison (Helm 4 Nov 2025)
  • terraform-aws-modules registry top downloads (May 2025 stats)
  • Liquibase / Flyway / Atlas migration tools (2025 license + features)
  • Aurora DSQL GA announcement (May 2025)
  • CDN benchmarks (Cloudflare 95p TTFB 2024-2025)
  • AWS Savings Plans / Reserved Instances June 2025 policy changes
  • IAM SCPs + Permission Boundaries + ABAC patterns
  • GKE / EKS / AKS managed K8s comparison (2025-2026)
  • terraform-aws-modules registry usage (vpc 126M, eks 96.3M downloads)
  • Vertex AI / BigQuery / Gemini integration (2025)
  • Resolve.ai AI SRE + Aviator (2025-2026)
  • LiteLLM / Portkey / OpenRouter LLM gateway comparison (2025)
  • Multi-cloud DR active/active vs active/passive patterns
  • Wing language shutdown (April 2025) + WASM serverless trends
  • Awesome-aws / awesome-kubernetes curated lists
  • Kubecost / OpenCost cost visibility (Kubecost IBM acquisition 2024)
  • Atlantis / Spacelift / Env0 / Terramate IaC platforms
  • Score spec + OAM workload specifications
  • Karpenter NodePool + Spot + Graviton best practices
  • Tailscale / Twingate / Cloudflare Access ZTNA comparison
  • Vector DB benchmarks (Pinecone / Weaviate / Qdrant / Milvus 2025)
  • AWS Copilot end-of-support (June 12 2026) + SAM + Amplify
  • Gateway API + ingress-nginx retirement (March 2026)
  • DevOps eval benchmarks + IaC-Eval academic benchmark