Spaces:

axentx
/

surrogate-1

Runtime error

App Files Files Community

surrogate-1 / docs /v2-research /research-cloud-platform.md

ashirato

docs(v2-research): persist 16-stream research corpus + Phase A runbook

19be69c 13 days ago

preview code

raw

history blame contribute delete

56.9 kB

	---
	title: Cloud + Platform Engineering Deep Research for Surrogate-1 v2
	date: 2026-04-29
	purpose: Train Surrogate-1 (Qwen2.5-Coder-7B + LoRA) into a SOTA Cloud / Platform Engineer
	scope: AWS + GCP + Azure + Edge + IDP + IaC + K8s + FinOps + Multi-cloud DR
	---

	# Surrogate-1 SOTA Cloud + Platform Engineer Training Plan

	This document is the canonical knowledge base used to design the v2 instruction-tuning curriculum for Surrogate-1. The model must, autonomously and end-to-end:

	1. Design + provision multi-cloud infrastructure (AWS, GCP, Azure, Cloudflare, Vercel)
	2. Author production-grade IaC (Terraform, OpenTofu, CDK, Pulumi, Bicep, Crossplane)
	3. Operate Kubernetes platforms (EKS / GKE / AKS) with GitOps + service mesh
	4. Build internal developer platforms (Backstage, Port, Score, Humanitec)
	5. Handle FinOps lifecycle (Inform / Optimize / Operate, 2025 + Scopes)
	6. Execute multi-cloud disaster recovery + global routing
	7. Stand up edge/serverless (Cloudflare Workers, Vercel Edge, Lambda@Edge)

	The research is organized into 14 verticals. Each section closes with the training corpus + eval target for the v2 curriculum.

	---

	## 1. AWS Deep Mastery

	### 1.1 Certification Scope (training-data anchors)

	\| Cert \| Code \| Topics \| Why we mine it \|
	\|------\|------\|--------\|----------------\|
	\| Solutions Architect Associate \| SAA-C03 \| VPC, EC2, S3, RDS, Lambda, IAM basics \| Foundational service catalog \|
	\| Solutions Architect Pro \| SAP-C02 \| Multi-account, hybrid, migration, DR, cost-resilience \| Most question banks for org-complexity \|
	\| DevOps Engineer Pro \| DOP-C02 \| CI/CD, monitoring, IaC, governance \| Pipelines + observability \|
	\| Security Specialty \| SCS-C02 \| KMS, GuardDuty, Inspector, SCPs, IRSA \| Hardening + compliance scenarios \|
	\| Advanced Networking Specialty \| ANS-C01 \| Transit Gateway, Direct Connect, Cloud WAN \| Multi-VPC + hybrid networking \|

	The SAP-C02 exam validates designing multi-account strategies, hybrid architectures, migration at scale, cost optimization, security, and resilience — exactly the Surrogate-1 scope. The exam has 65 scored + 10 unscored questions, passing score 750/1000.

	### 1.2 Well-Architected Framework — 6 pillars (Sustainability added Dec 2021, refreshed Nov 2024)

	```
	1. Operational Excellence — IaC, runbooks, observability, post-mortems
	2. Security — IAM, encryption, network, IR
	3. Reliability — RTO/RPO, failover, multi-AZ/multi-region
	4. Performance Efficiency — right-sized compute, modern data services
	5. Cost Optimization — RIs/SPs/Spot/Graviton, lifecycle rules
	6. Sustainability — energy efficiency, region selection, idle cleanup
	```

	Lenses Surrogate-1 must recognize: Serverless, SaaS, Migration, Generative AI, IoT, Hybrid Networking, Financial Services, Streaming Media, ML.

	### 1.3 Top 30 services for startup/SaaS workloads

	```
	Compute : EC2, Lambda, Fargate, Batch, ECS, EKS, App Runner
	Storage : S3, EFS, FSx, EBS
	DB : RDS (Postgres/MySQL), Aurora, Aurora DSQL, DynamoDB, ElastiCache (Redis/Valkey), OpenSearch
	Network : VPC, Route53, CloudFront, ALB/NLB/GWLB, Transit Gateway, PrivateLink, API Gateway
	Identity : IAM, IAM Identity Center (SSO), Cognito, Organizations, Verified Permissions
	Observability: CloudWatch, X-Ray, Managed Prometheus, Managed Grafana, OpenSearch
	Security : KMS, Secrets Manager, GuardDuty, Inspector, Security Hub, WAF, Shield
	Data/AI : Bedrock, SageMaker, Glue, Athena, Kinesis, MSK, Step Functions
	Messaging : SQS, SNS, EventBridge
	DevTools : CodePipeline, CodeBuild, CodeDeploy, CDK, SAM
	```

	### 1.4 VPC networking patterns

	Hub-and-spoke with Transit Gateway — TGW is the managed hub-and-spoke service for VPCs and on-prem; centralizes routing without VPN overlays. Aligns with Well-Architected Reliability pillar `REL02-BP04`.

	Centralized PrivateLink endpoints — Host interface VPC endpoints (e.g., for `s3.api`, `kms`, `sts`, `secretsmanager`) in a single shared-services VPC. All spoke VPCs reach AWS APIs via TGW → endpoint VPC. Saves cost (one $7.30/month-per-AZ endpoint instead of N).

	Decision tree:

	```
	Two VPCs, low traffic, no transitive → VPC Peering
	Service consumed across many VPCs → PrivateLink (endpoint service)
	≥3 VPCs with transitive routing needed → Transit Gateway (hub-and-spoke)
	Multi-region + on-prem at scale → Cloud WAN
	```

	### 1.5 IAM advanced

	SCP = guardrail at OU/account level; deny-by-default, no permissions granted, only constrains. SCPs evaluated AND'd with IAM policies + permission boundaries — action allowed only when ALL allow it.

	Permission boundary = max permissions a role/user CAN have, regardless of attached policies. Used for delegated admin (developer can create roles, but only ones bounded).

	ABAC = attribute-based access control via tags (e.g., `aws:PrincipalTag/team` must equal `aws:ResourceTag/team`). Reduces role count drastically. SCPs can lock the tagging itself so principals can't escalate by re-tagging.

	Example SCP — deny untagged production resources:

	```json
	{
	"Sid": "DenyUntaggedEnvProd",
	"Effect": "Deny",
	"Action": ["ec2:RunInstances", "rds:CreateDBInstance"],
	"Resource": "*",
	"Condition": {
	"StringNotEquals": {
	"aws:RequestTag/Environment": ["prod","staging","dev"]
	}
	}
	}
	```

	### 1.6 Cost optimization (FinOps lever in §11)

	Compute discount tiers (max savings vs on-demand):

	\| Mechanism \| Max Discount \| Flexibility \|
	\|-----------\|-------------\|-------------\|
	\| Standard RI \| 75% \| Locked region+family+OS, 1 or 3 yr \|
	\| Convertible RI \| 54% \| Can change family within OS \|
	\| EC2 Instance SP \| 72% \| Locked family, any size, any AZ \|
	\| Compute SP \| 66% \| EC2 + Fargate + Lambda + SageMaker \|
	\| Spot \| 90% \| Variable interruption (2-min notice) \|
	\| Graviton \| +40% perf/$ \| ARM64 (must support arch) \|

	June 2025 change — RIs and SPs are restricted to single-end-customer; MSPs can no longer share commitments across accounts.

	Surrogate-1 must teach `Compute Optimizer` recommendations + apply them.

	### 1.7 AWS-specific tools & CLI surface

	```
	aws cli → primary
	aws cdk → preferred IaC (TS/Python). CDK Refactor (Sept 2025) safely renames constructs without replacement
	aws sam → serverless/Lambda focus
	aws copilot → ECS/Fargate (END OF SUPPORT June 12 2026 — migrate to ECS Express or CDK L3)
	aws amplify → frontend + serverless backend, Git-driven CI/CD
	```

	### 1.8 Training corpus for AWS

	```
	- AWS Well-Architected docs (all 6 pillar PDFs + 9 lenses)
	- AWS official examples: aws-samples/* (8000+ repos)
	- terraform-aws-modules/* (vpc 126M downloads, eks 96.3M downloads)
	- AWS CDK guide v2 + cdk-patterns/serverless
	- SAP-C02 question banks (ExamTopics, Tutorials Dojo)
	- AWS Architecture Center reference architectures (multi-account, DR, hybrid)
	- Service Control Policy examples: aws-samples/service-control-policy-examples
	```

	Eval target: 75% on a custom AWS-design eval (multi-account VPC + hub-spoke + IAM bootstrap + EKS cluster) with `cfn-lint` + `cfn-guard` passing.

	---

	## 2. GCP Deep Mastery

	### 2.1 Certifications

	\| Cert \| Released \| Scope \|
	\|------\|----------\|-------\|
	\| Cloud Digital Leader \| — \| Business/strategy \|
	\| Associate Cloud Engineer \| — \| gcloud + GCE/GKE/GCS basics \|
	\| Professional Cloud Architect (PCA) \| refreshed Oct 30 2025 \| Design — ~30% net-new content (Vertex AI, Gemini, AI Hypercomputer) \|
	\| Professional Cloud Network Engineer (PCNE) \| — \| VPC, hybrid, Cloud Interconnect \|
	\| Professional Cloud DevOps Engineer \| — \| SLO, CI/CD, observability \|
	\| Professional Cloud Security Engineer \| — \| Org policies, VPC-SC, BeyondCorp \|
	\| Professional Cloud Database Engineer \| — \| Cloud SQL, AlloyDB, Spanner \|

	PCA exam covers Compute Engine, Cloud Storage, App Engine, GKE, with the Oct 2025 refresh adding ~30% new content focused on Vertex AI, Gemini integration, and AI Hypercomputer.

	### 2.2 GKE advanced

	GKE Autopilot — Google manages provisioning, scaling, security, add-ons. Bills per-pod resource request (not nodes). Best when team doesn't want to tune nodepools.

	GKE Standard — Customer-managed nodepools; required for DaemonSets that need privileged hostPath, custom CNI, niche GPU/TPU shapes.

	GKE version ladder — GKE adopts new K8s versions fastest (~2 weeks). Autopilot gets 30 months extended support; AKS LTS 24 months; EKS Extended Support +12 months.

	Anthos / GKE Enterprise — Multi-cluster across on-prem + AWS + Azure. Provides Config Sync (GitOps), Service Mesh, Policy Controller. Now folded into GKE Enterprise SKU.

	### 2.3 BigQuery + Vertex AI integration (2025)

	- `AI.GENERATE`, `AI.GENERATE_TABLE`, `AI.EMBED`, `AI.SIMILARITY` are now GA in BigQuery.
	- BQML supports Gemini 3.0 for generative SQL functions.
	- Vertex AI End User Credentials (2025) lets Vertex models authenticate via the calling user's IAM — no service-account proxy.

	This is core for any data-platform engineering Surrogate-1 builds.

	### 2.4 Cloud Run + Cloud Functions

	- Cloud Run gen2 = container-as-a-service, scales to zero, max 60-min timeout, supports websockets/streaming.
	- Cloud Functions gen2 = built ON Cloud Run; choose Functions for trigger-driven, Run for HTTP/services.
	- Cloud Run jobs = batch workloads (cron via Cloud Scheduler).

	### 2.5 GCP-specific tools

	```
	gcloud → primary CLI
	Terraform google → official provider, fastest day-1 support for new services
	Config Connector → GCP-native Crossplane equivalent (KCC). Manage GCP resources via K8s CRDs
	Cloud Deploy → managed GitOps for GKE
	Cloud Build → CI (yaml + buildpacks)
	```

	### 2.6 Training corpus for GCP

	```
	- GCP architecture center (cloud.google.com/architecture)
	- terraform-google-modules/* (network, kubernetes-engine, cloud-foundation-fabric)
	- Cloud Foundation Fabric (Google's reference org setup)
	- gcp-pca-study-guide repos
	- Anthos config-management examples
	- BQML + Vertex AI codelabs
	```

	---

	## 3. Azure Deep Mastery

	### 3.1 Certifications

	\| Cert \| Code \| Scope \|
	\|------\|------\|-------\|
	\| Administrator Associate \| AZ-104 \| RBAC, IAM, networking, storage, Bicep basics \|
	\| Solutions Architect Expert \| AZ-305 \| Design — governance, identity, infra, app, integration \|
	\| Security Engineer \| AZ-500 \| Defender, Sentinel, Conditional Access \|
	\| DevOps Engineer Expert \| AZ-400 \| Pipelines, IaC, monitoring \|

	AZ-305 (refreshed April 17 2026) covers: Identity/governance/monitoring, data storage, infrastructure & availability, application architecture, network solutions, data integration, business continuity. Prereq: AZ-104.

	### 3.2 Azure compute deep cuts

	```
	AKS → managed K8s; "AKS LTS" = 24-mo extended support per minor
	App Service → PaaS web hosting (Plans = Basic/Standard/Premium/Isolated)
	Functions → consumption / premium / dedicated
	Container Apps → CaaS on KEDA (scale-to-zero from events)
	Container Instances (ACI) → single-pod throwaway
	Virtual Machine Scale Sets (VMSS) → IaaS auto-scaling
	Azure Spring Apps → managed Spring Boot
	```

	### 3.3 Azure DevOps + GitHub Enterprise (Microsoft owns both)

	- Azure DevOps = Boards + Repos + Pipelines + Artifacts. Mature for .NET-heavy orgs.
	- GitHub Enterprise + Actions = where new investment is going (Microsoft's strategic direction).
	- 2025 trend: most new Azure customers go GitHub-first; Azure DevOps is in maintenance mode.

	### 3.4 Azure tooling

	```
	az cli → primary
	Bicep → DSL that transpiles to ARM. JSON ARM templates → DEPRECATED for new work
	Pulumi → first-class Azure native provider
	Terraform azurerm + azuread → mature, official
	```

	Bicep simplifies ARM but is Azure-only — for multi-cloud orgs, Terraform remains primary.

	### 3.5 Training corpus for Azure

	```
	- Cloud Adoption Framework (Microsoft's enterprise reference)
	- Azure-Samples/* GitHub org
	- Azure Verified Modules (AVM) — Microsoft's curated Bicep + Terraform modules
	- AZ-305 study guides + Microsoft Learn content
	- Azure Architecture Center patterns
	```

	---

	## 4. Multi-Cloud Strategy

	### 4.1 Workload portability tools

	\| Tool \| Approach \| Best fit \|
	\|------\|----------\|----------\|
	\| Crossplane \| K8s-native control plane → cloud APIs via providers \| Platform teams already on K8s \|
	\| Anthos \| GCP-managed clusters across clouds + on-prem \| GKE-centric orgs wanting unified control \|
	\| Azure Arc \| Azure-managed servers/K8s outside Azure \| Azure-centric hybrid \|
	\| Terraform \| IaC abstraction (provider-per-cloud) \| Most common; least lock-in \|
	\| Pulumi \| Real code (Python/TS); equivalent provider coverage \| Engineering-heavy teams \|

	### 4.2 Crossplane v2 (Aug 2025)

	Major upgrades:

	- Compositions can include any K8s resource — not just Crossplane MRs. Mix RDSInstance + Deployment + CloudNativePG cluster in one XR.
	- Namespace-first — XRs and MRs are namespaced by default (was cluster-scoped).
	- Operations — function pipelines for cert monitoring, rolling upgrades, scheduled maintenance.
	- Multi-cloud status — AWS providers fully migrated; Azure/GCP/Terraform providers still being updated to v2.

	### 4.3 DR / failover patterns

	\| Pattern \| RTO \| RPO \| Cost (vs single-region) \|
	\|---------\|-----\|-----\|-------------------------\|
	\| Backup & restore \| hours-days \| hours \| 1.0x (storage only) \|
	\| Pilot light \| 10s of min \| minutes \| 1.1-1.3x \|
	\| Warm standby \| minutes \| minutes \| 1.5-1.8x \|
	\| Multi-site active/active \| seconds \| ~0 \| 1.8-2.5x \|

	Multi-cloud active/active typically costs 1.8–2.5x single-cloud due to duplicate infra + ops overhead. Recommendation: active/passive across clouds + active/active across regions WITHIN primary cloud.

	### 4.4 Latency-based routing

	```
	Route53 latency policy → AWS-native, cheapest
	Cloud DNS geo-routing → GCP-native
	Azure Traffic Manager → Azure-native
	Cloudflare load balancer → multi-cloud
	NS1 / Constellix → enterprise multi-cloud DNS
	```

	Cloudflare LB is the most common cross-cloud answer because it sits OUTSIDE the providers.

	### 4.5 Cost arbitrage

	- GPU cost: GCP < AWS < Azure (TPUs are GCP-only and cheaper per FLOP at scale)
	- Egress: AWS most expensive; Cloudflare R2 has $0 egress (S3-compatible)
	- Object storage: B2 ($6/TB/mo) < R2 ($15) < S3 standard ($23) < GCS standard ($26)
	- Reserved discounts: deepest in AWS (75% std RI), shallower in Azure (65%), GCP CUDs ~57%

	### 4.6 Vendor lock-in mitigation

	```
	1. Use OSS data formats (Parquet, Iceberg, Delta) — not proprietary
	2. Use OSS DBs (Postgres / Redis-compatible Valkey) — not Aurora-only or Cosmos-only
	3. Use OCI containers + K8s — cluster portability via Crossplane/Anthos
	4. Use Terraform with multi-provider modules — abstract per-cloud differences
	5. Avoid managed-vendor-only auth — use OIDC + Keycloak or Auth0 (cross-cloud)
	6. Multi-cloud DNS (Cloudflare/NS1) so Route53/Cloud DNS isn't single point
	```

	---

	## 5. IaC Mastery

	### 5.1 Terraform / OpenTofu (post-BSL fork)

	- HashiCorp Terraform OSS under BSL discontinued after July 2025 → OpenTofu is the OSS continuation under Linux Foundation.
	- Most TACOS (Spacelift, Env0, Scalr) support both. Most modules still work in both.
	- For new orgs in 2026 → default OpenTofu.

	Best practices (2025):

	```
	1. Remote backend (S3+DynamoDB lock, GCS, Azure Blob) — never local state
	2. Split state: per-environment (dev/staging/prod) + per-domain (network, data, compute)
	- Terralith state >50MB causes timeouts; >10MB visible perf hit
	3. Module versioning: `~> 2.5` (allow patch+minor, block major)
	4. Pre-commit: terraform fmt + validate + tflint + tfsec/checkov
	5. CI/CD: Atlantis (OSS, self-host) or Spacelift / Env0 / Scalr / Terramate (SaaS)
	6. State locking always on
	7. Drift detection: `terraform plan -refresh-only` on schedule (Spacelift / Atlantis cron)
	8. Workspaces only for environment isolation; NOT for tenant separation
	```

	Workspace anti-pattern: using workspaces for cust-1, cust-2, cust-3 — should be separate state files / dirs instead. Workspaces good for `dev`, `staging`, `prod` of same module.

	### 5.2 CloudFormation

	```
	- Nested stacks → for >500 resources / cross-stack dependencies
	- Custom resources → Lambda-backed for CFN gaps. Use AwsCustomResource (CDK) for single-API-call
	- Change sets → preview before apply (mandatory for prod)
	- Stack policies → prevent accidental updates to specific resources
	- Service Catalog → curated CFN templates exposed to devs
	- StackSets → multi-account/multi-region rollout
	```

	### 5.3 AWS CDK best practices

	```
	- Constructs L1 (raw CFN) / L2 (curated AWS) / L3 (composite patterns)
	- Aspects → enforce policy across all constructs (e.g., "all S3 buckets must encrypt")
	- Aspects run at synth-time → cfn-guard runs post-synth → both = defense in depth
	- Don't extend Construct unless interacting with AWS resources directly; helper class is enough
	- Custom resources: use AwsCustomResource for single API call; full Lambda-backed for complex
	- CDK Refactor (Sept 2025) → safely rename or move resources without replacement
	- Pipelines L3 = managed CodePipeline that self-mutates
	```

	### 5.4 Pulumi

	- Real code (TS/Python/Go/.NET/Java) — language loops, classes, unit tests with native frameworks.
	- Pulumi onboarding ~30% faster for engineers already knowing TS/Python (vs HCL).
	- Day-1 support for new cloud services because Pulumi wraps SDKs directly.
	- Pulumi ESC = encrypted env+secrets store; Pulumi Deployments = managed runners.

	### 5.5 Crossplane (K8s-native multi-cloud)

	```yaml
	# Composition that creates RDS + Deployment + Service in one XR
	apiVersion: apiextensions.crossplane.io/v2
	kind: Composition
	metadata:
	name: web-app-with-db
	spec:
	compositeTypeRef:
	apiVersion: example.io/v1alpha1
	kind: WebApp
	pipeline:
	- step: provision-db
	functionRef:
	name: function-patch-and-transform
	input:
	apiVersion: pt.fn.crossplane.io/v1beta1
	kind: Resources
	resources:
	- name: rds
	base:
	apiVersion: rds.aws.upbound.io/v1beta1
	kind: Instance
	spec:
	forProvider:
	instanceClass: db.t3.medium
	engine: postgres
	engineVersion: "16"
	allocatedStorage: 50
	- name: deployment
	base:
	apiVersion: apps/v1
	kind: Deployment
	spec:
	replicas: 3
	```

	### 5.6 IaC TACOS comparison

	\| Tool \| OSS / SaaS \| IaC Coverage \| Best For \|
	\|------\|-----------\|--------------\|----------\|
	\| Atlantis \| OSS, self-host \| TF/OpenTofu/Terragrunt \| Free, GitHub-PR workflow \|
	\| Spacelift \| SaaS + self-hosted \| TF/OpenTofu/Terragrunt/Pulumi/CFN/K8s/Ansible \| Enterprise multi-IaC \|
	\| Env0 \| SaaS only \| Multi-IaC + strong FinOps \| FinOps-aware deployment \|
	\| Terramate \| OSS + SaaS \| TF/OpenTofu \| Stack orchestration + DAGs \|
	\| Scalr \| SaaS + self-hosted \| TF/OpenTofu \| TFC alternative \|
	\| Terraform Cloud \| SaaS \| TF only \| Default if already HashiCorp \|

	### 5.7 Real Terraform module example (multi-cloud DRY)

	```hcl
	# environments/prod/main.tf
	module "vpc" {
	source = "terraform-aws-modules/vpc/aws"
	version = "~> 5.0"

	name = "prod-vpc"
	cidr = "10.0.0.0/16"
	azs = ["us-east-1a", "us-east-1b", "us-east-1c"]

	private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
	public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
	database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]

	enable_nat_gateway = true
	single_nat_gateway = false # one per AZ for HA
	enable_vpn_gateway = false
	enable_dns_hostnames = true
	enable_flow_log = true
	flow_log_destination_type = "cloud-watch-logs"

	tags = local.common_tags
	}

	module "eks" {
	source = "terraform-aws-modules/eks/aws"
	version = "~> 20.0"

	cluster_name = "prod-platform"
	cluster_version = "1.32"

	vpc_id = module.vpc.vpc_id
	subnet_ids = module.vpc.private_subnets

	cluster_endpoint_public_access = false
	cluster_endpoint_private_access = true

	cluster_addons = {
	coredns = { most_recent = true }
	kube-proxy = { most_recent = true }
	vpc-cni = { most_recent = true }
	aws-ebs-csi-driver = { most_recent = true }
	eks-pod-identity-agent = { most_recent = true }
	}

	eks_managed_node_groups = {
	system = {
	instance_types = ["t3.medium"]
	min_size = 2
	max_size = 4
	desired_size = 2
	labels = { workload = "system" }
	taints = [{ key = "system", value = "true", effect = "NO_SCHEDULE" }]
	}
	karpenter = {
	instance_types = ["m6g.large"] # Graviton
	capacity_type = "ON_DEMAND"
	min_size = 1
	max_size = 2
	desired_size = 1
	labels = { workload = "karpenter" }
	}
	}

	enable_irsa = true
	enable_cluster_creator_admin_permissions = true
	}
	```

	### 5.8 Training corpus for IaC

	```
	- HashiCorp learn.hashicorp.com/terraform tutorials (1000+ lessons)
	- terraform-aws-modules / terraform-google-modules / Azure/terraform-azurerm-* (AVM)
	- Pulumi pulumi/examples (1500+)
	- aws-samples/aws-cdk-examples
	- Crossplane upbound/configurations (reference platforms)
	- Awesome-terraform / awesome-pulumi GitHub lists
	- IaC-Eval benchmark (academic Terraform benchmark)
	- TACOS docs: Spacelift, Env0, Atlantis, Terramate
	```

	---

	## 6. Kubernetes Platform Engineering

	### 6.1 Kubernetes 1.32 → 1.35 highlights (2025)

	\| Version \| Released \| Key features \|
	\|---------\|----------\|-------------\|
	\| 1.32 \| Dec 2024 \| KubeletFineGrainedAuthz; Memory Manager GA; Anonymous Auth Configurable Endpoints \|
	\| 1.33 \| Apr 2025 \| Sidecars GA; supplementalGroupsPolicy beta; in-place pod resize beta \|
	\| 1.34 \| Aug 2025 \| DRA core GA (Dynamic Resource Allocation for GPUs/TPUs/FPGAs) \|
	\| 1.35 \| Dec 2025 \| Fine-grained Supplemental Groups GA; TLS 1.3 baseline \|

	Pod Security Standards are GA since v1.25 (NOT 2025). Three levels: Privileged / Baseline / Restricted, applied via `pod-security.kubernetes.io/<mode>` namespace labels.

	### 6.2 Helm vs Kustomize vs Carvel

	\| Tool \| Approach \| Strength \| Weakness \|
	\|------\|----------\|----------\|----------\|
	\| Helm \| Templating + values + chart \| Package manager (75% adoption); Helm 4 (Nov 2025) adds server-side apply \| Templating debug pain \|
	\| Kustomize \| Patch-based overlays on bases \| No magic; built into kubectl \| No release/version concept; needs ArgoCD/Flux for state \|
	\| Carvel \| ytt + kapp + kbld + imgpkg \| Strong CI bundling; image relocation \| Steeper learning curve, smaller community \|

	Mature pattern: Helm to install upstream charts (Cilium, ArgoCD, Prometheus); Kustomize overlays per environment. Use ArgoCD `helm` source with `valuesObject` overrides.

	### 6.3 GitOps — ArgoCD vs FluxCD (2025 reality)

	Weaveworks closed in early 2024 — Flux became fully community-driven (CNCF graduated). ArgoCD has clearer commercial path (Akuity, CodeFresh).

	\| Aspect \| ArgoCD \| FluxCD \|
	\|--------\|--------\|--------\|
	\| UI \| Strong native dashboard \| None native (use Weave GitOps or third-party) \|
	\| RBAC \| Built-in + Projects multi-tenancy \| Standard K8s RBAC only \|
	\| Architecture \| Hub-and-spoke \| Decentralized, K8s-idiomatic \|
	\| Multi-cluster \| Native (App-of-Apps, ApplicationSets) \| Per-cluster Flux + Notification Controller \|
	\| Best for \| Most enterprises in 2025 \| Air-gapped / minimal-deps / true GitOps purists \|

	Default 2026 recommendation: ArgoCD for most orgs.

	### 6.4 Service Mesh — Istio vs Linkerd vs Cilium (2025)

	\| Mesh \| Sidecars \| Data plane \| Best fit \|
	\|------\|----------\|------------\|----------\|
	\| Istio \| Sidecar OR Ambient (ztunnel + waypoint) \| Envoy \| Advanced traffic mgmt, deep telemetry \|
	\| Linkerd \| Sidecar only \| linkerd2-proxy (lightweight Rust) \| Simplicity + lowest overhead \|
	\| Cilium \| Sidecarless \| eBPF + Envoy (L7) \| Network policy + perf at scale \|

	Memory cost reality: 500 services on Istio sidecar = ~25–50 GB more RAM than same on Linkerd. Translates to real $$.

	Cilium caveat: eBPF can't parse HTTP/gRPC or do mTLS termination — Cilium still uses Envoy for L7, so the perf delta vs Istio at L7 is small.

	Decision tree:

	```
	Tiny team, just want mTLS + observability → Linkerd
	Already on Cilium CNI, want unified → Cilium Service Mesh
	Need full traffic mgmt (canary, mirror, fault) → Istio Ambient
	```

	### 6.5 Ingress + Gateway API (the Ingress era is ending)

	Ingress-NGINX official maintenance halt March 2026. Gateway API is the K8s-official successor.

	Gateway API provides:
	- Protocol-agnostic (HTTP, TCP, gRPC, TLS passthrough)
	- Role-split: GatewayClass (provider) → Gateway (cluster operator) → HTTPRoute (app dev)
	- Built-in canary/blue-green via weighted routes
	- Both north-south AND east-west

	Ingress controllers / Gateway implementations:

	\| Implementation \| Notes \|
	\|----------------\|-------\|
	\| Envoy Gateway \| Reference implementation; CNCF \|
	\| Istio \| Native Gateway API support (replaces Istio VirtualService for new) \|
	\| NGINX Gateway Fabric \| NGINX-backed, replaces ingress-nginx \|
	\| Cilium Gateway \| CNI-integrated \|
	\| Traefik \| Long-time leader for Ingress; Gateway API supported \|

	Migration: `ingress2gateway` 1.0 (2026) translates Ingress + annotations → Gateway API resources.

	### 6.6 Operators

	```
	Operator SDK (Red Hat) → Go/Helm/Ansible scaffolding
	Kubebuilder → upstream K8s SIG; cleaner Go
	KUDO → declarative operator definition
	metacontroller → Lua/JSONNET-style hooks (lightweight)
	```

	When to write an operator: state machine that doesn't fit `Deployment` (e.g., DB clustering, leader election, custom backup).

	When NOT: just templating → use Helm/Kustomize.

	### 6.7 Multi-cluster — Karmada vs Cluster API vs OCM

	KubeFed is EOL (no commits since 2020).

	\| Tool \| Approach \|
	\|------\|----------\|
	\| Karmada \| CNCF Incubation; multi-cluster scheduling + propagation policy. v1.15 (Oct 2025) adds multi-template workload awareness + structured logging \|
	\| Cluster API (CAPI) \| Declarative cluster lifecycle (CAPA AWS, CAPG GCP, CAPZ Azure providers) \|
	\| Open Cluster Management (OCM) \| Red Hat-led; ACM commercial product \|
	\| Anthos / GKE Enterprise \| GCP-managed; folds in Config Sync + Mesh + Policy \|
	\| Azure Arc \| Azure-managed; brings Azure Policy/Monitor to any cluster \|

	Pattern: CAPI provisions clusters, Karmada propagates workloads, ArgoCD reconciles config.

	### 6.8 Cost — Kubecost vs OpenCost

	- OpenCost (Apache 2.0) — free, single-cluster focus, real-time allocation by pod/namespace/controller, multi-cloud (AWS/GCP/Azure). Now ships with built-in MCP server (2025) for AI agent access.
	- Kubecost (IBM-owned post-2024 acquisition) — adds budgets, RBAC, multi-cluster aggregation, automated cost policies. Starts $449/mo, enterprise on quote.

	### 6.9 Karpenter + Spot + Graviton

	Real customer outcomes:

	- Tinybird: 20% AWS bill reduction with EKS+Karpenter+Spot
	- Series B SaaS (200 microservices): $52k → $23k/mo (56%) with Graviton mix + Karpenter + Spot
	- One reported migration: $50k → $22k/mo Karpenter + Spot + VPA

	NodePool best practices:

	```yaml
	apiVersion: karpenter.sh/v1
	kind: NodePool
	metadata:
	name: default
	spec:
	template:
	spec:
	requirements:
	- key: kubernetes.io/arch
	operator: In
	values: ["arm64", "amd64"] # Graviton preferred but allow x86 fallback
	- key: karpenter.sh/capacity-type
	operator: In
	values: ["spot", "on-demand"]
	- key: karpenter.k8s.aws/instance-category
	operator: In
	values: ["m", "c", "r"]
	- key: karpenter.k8s.aws/instance-generation
	operator: Gt
	values: ["6"] # m6g+, c6g+, r6g+
	nodeClassRef:
	group: karpenter.k8s.aws
	kind: EC2NodeClass
	name: default
	disruption:
	consolidationPolicy: WhenEmptyOrUnderutilized
	consolidateAfter: 30s
	limits:
	cpu: 1000
	```

	### 6.10 Helm chart example for a service

	```yaml
	# values.yaml
	image:
	repository: ghcr.io/org/api
	tag: "" # set by ArgoCD Image Updater or via CI
	pullPolicy: IfNotPresent

	resources:
	requests:
	cpu: 100m
	memory: 128Mi
	limits:
	memory: 512Mi

	autoscaling:
	enabled: true
	minReplicas: 3
	maxReplicas: 30
	targetCPUUtilizationPercentage: 70

	serviceAccount:
	create: true
	annotations:
	eks.amazonaws.com/role-arn: arn:aws:iam::123:role/api-irsa # IRSA for AWS

	podSecurityContext:
	runAsNonRoot: true
	runAsUser: 65532
	fsGroup: 65532

	securityContext:
	allowPrivilegeEscalation: false
	readOnlyRootFilesystem: true
	capabilities:
	drop: ["ALL"]
	seccompProfile:
	type: RuntimeDefault

	podDisruptionBudget:
	enabled: true
	minAvailable: 2

	networkPolicy:
	enabled: true
	ingress:
	- from:
	- podSelector:
	matchLabels:
	role: gateway
	```

	### 6.11 Training corpus for K8s

	```
	- kubernetes/kubernetes (source + design proposals KEPs)
	- kubernetes/website (docs)
	- helm/charts (deprecated) + bitnami/charts + community charts
	- argo-cd/argo-cd repo + examples
	- karmada-io/karmada
	- cilium/cilium (eBPF code + e2e tests)
	- istio/istio
	- linkerd/linkerd2
	- aws/karpenter-provider-aws
	- backstage/backstage source + plugins
	- run-x/awesome-kubernetes
	- KubeCon talk transcripts (CNCF YouTube; can transcribe via Whisper)
	```

	Eval target: 70% on K8s-Bench (manifest validity + Helm chart that `helm template` validates + ArgoCD Application that syncs + NetworkPolicy that locks down by default).

	---

	## 7. Internal Developer Platform (IDP)

	### 7.1 IDP landscape (2025)

	\| Tool \| Type \| Strength \| TTV (time-to-value) \|
	\|------\|------\|---------\|---------------------\|
	\| Backstage (Spotify, CNCF) \| OSS framework, build-it-yourself portal \| Most flexible; 120+ Spotify-internal plugins; CNCF \| 3–6 months \|
	\| Port \| Commercial SaaS portal \| No-code, fast to deploy \| Days \|
	\| Cortex \| Commercial — service ownership + scorecards \| Best for >50-eng orgs needing governance \| Weeks \|
	\| OpsLevel \| Commercial — quality scorecards \| Strong dashboards \| Weeks \|
	\| Humanitec \| Platform Orchestrator (NOT a portal) \| Backend that resolves Score files into infra \| Weeks \|
	\| Encore \| All-in-one (codegen + infra) \| Strong opinionated dev workflow \| Days \|
	\| Cloudomation \| Workflow automation IDP \| Low-code for non-K8s orgs \| Days \|

	Key mental model: Portal (Backstage/Port) ≠ Orchestrator (Humanitec). You often need BOTH — portal as UI, orchestrator as the backend that creates the actual cloud resources.

	### 7.2 Backstage core

	```
	Catalog → entities (Component, System, API, Resource, Group, User)
	TechDocs → MkDocs-based, lives next to code
	Software Templates → Cookiecutter-style scaffolds (repo + IaC + pipeline + DB)
	Search → indexes catalog + docs
	RBAC → Spotify's RBAC plugin (commercial)
	Soundcheck → Spotify's tech-standards/scorecard plugin (commercial)
	Insights → adoption analytics (commercial)
	Cloud Backstage → managed hosted (commercial)
	```

	Open-source plus commercial Spotify Portal (RBAC, Insights, Soundcheck) = "production-ready Backstage."

	catalog-info.yaml example:

	```yaml
	apiVersion: backstage.io/v1alpha1
	kind: Component
	metadata:
	name: orders-api
	description: Order management service
	annotations:
	github.com/project-slug: org/orders-api
	backstage.io/techdocs-ref: dir:.
	pagerduty.com/integration-key: ${SECRET_PD}
	sonarqube.org/project-key: org_orders-api
	grafana/dashboard-selector: "tags @> 'orders'"
	tags: [java, spring-boot, payments-domain]
	spec:
	type: service
	lifecycle: production
	owner: payments-team
	system: payments
	providesApis: [orders-rest-api]
	consumesApis: [users-rest-api]
	dependsOn: [resource:orders-db]
	```

	### 7.3 Score (CNCF, 2024) — workload spec

	Score is platform-agnostic workload spec. Developer writes ONE YAML; platform team's `score-compose` / `score-helm` / `score-k8s` translates.

	```yaml
	# score.yaml
	apiVersion: score.dev/v1b1
	metadata:
	name: orders-api
	containers:
	api:
	image: ghcr.io/org/orders-api:${TAG}
	variables:
	DATABASE_URL: postgres://${secrets.DB_PASSWORD}@${resources.db.host}/orders
	REDIS_URL: ${resources.cache.url}
	resources:
	requests: { cpu: "100m", memory: "256Mi" }
	service:
	ports:
	web: { port: 8080 }
	resources:
	db:
	type: postgres
	cache:
	type: redis
	route:
	type: dns
	params: { host: orders.example.com }
	```

	The platform team configures resource definitions (e.g., `db.postgres → AWS RDS via Crossplane`) — devs don't see/care.

	### 7.4 OAM vs Score

	OAM is broader (whole-app model with traits + scopes + components); Score is single-workload + simpler. Score is winning in 2025 because of its narrower scope and CNCF backing.

	### 7.5 Humanitec orchestrator pattern

	```
	Developer: score.yaml in repo
	GitOps: commit → CI → calls Humanitec API
	Humanitec: resolves score against Resource Definitions
	→ creates EKS Deployment + RDS + Redis + Route53 record
	Platform: defines Resource Definitions (e.g., postgres → AWS RDS via TF/Crossplane)
	```

	### 7.6 Training corpus for IDP

	```
	- backstage/backstage source + ALL community plugins (roadie/* / spotify/*)
	- score-spec/spec + reference implementations (score-compose/score-helm/score-k8s)
	- Humanitec docs + Resource Definition examples
	- Port templates marketplace
	- Cortex YAML scorecard library
	- platformengineering.org community articles
	- KubeCon Platform Engineering Day talks (transcripts)
	```

	---

	## 8. Edge + Serverless Platforms

	### 8.1 Latency / cold-start reality (2025)

	\| Platform \| P50 latency \| Cold start \| POPs \|
	\|----------\|------------\|-----------\|------\|
	\| Cloudflare Workers \| 10–30ms \| <1ms (V8 isolates) \| 330+ \|
	\| Vercel Edge Functions \| <50ms \| sub-50ms \| 18 (uses Lambda@Edge under hood in some regions) \|
	\| Lambda@Edge (Node) \| 50–80ms \| 250–800ms \| AWS edge POPs \|
	\| Lambda@Edge (Python) \| similar \| 400–1200ms \| same \|
	\| Fastly Compute@Edge (WASM) \| ~5–10ms \| <1ms \| 80+ \|
	\| Deno Deploy \| low \| low \| global \|
	\| Bun runtime \| fastest cold-start of any Node-compat \| n/a \| self-hosted \|

	Cloudflare Workers ~441% faster than Lambda at p95, and unlimited bandwidth on free tier.

	### 8.2 Cloudflare ecosystem

	```
	Workers → V8 isolate functions (JS/TS/WASM)
	Pages → static + Workers (serverless full-stack)
	R2 → S3-compatible object storage, zero egress
	D1 → serverless SQLite (replicated)
	KV → eventually-consistent KV
	Durable Objects → strongly-consistent stateful primitives
	Queues → managed message queue
	Workers AI → run LLMs at the edge (Llama, Whisper, Stable Diffusion)
	Vectorize → vector DB (RAG at edge)
	Hyperdrive → connection pooler for Postgres/MySQL behind edge
	Stream → video transcoding + delivery
	```

	### 8.3 Vercel ecosystem

	```
	Edge Functions → Cloudflare Workers-compatible runtime (Node + Python + Go + Ruby)
	Edge Middleware → run BEFORE the request enters serverless
	Serverless Funcs → Lambda@Edge under the hood
	Postgres → managed Postgres (built on Neon)
	KV → built on Upstash Redis
	Blob → object storage
	```

	### 8.4 Multi-region edge strategies

	```
	Pattern 1 — Edge cache + origin in primary region
	Cloudflare cache → S3/Lambda in us-east-1
	Trade: simple, 100ms+ for cache misses

	Pattern 2 — Workers + DB-at-edge
	CF Workers → D1/Hyperdrive
	Trade: edge writes; eventual consistency
	Use: read-heavy auth, profile, feature flags

	Pattern 3 — Multi-region active/active
	CF LB → Workers in EU + US + APAC → regional Aurora DSQL
	Trade: cost 2x; near-zero RTO across regions

	Pattern 4 — Global table + edge CDN
	CF Cache → Lambda → DynamoDB Global Tables (multi-master)
	Trade: replication lag; eventual consistency
	```

	### 8.5 WASM serverless (2025–2026)

	- WASI 0.2 (Component Model) GA → portable across runtimes (Wasmtime, Spin, wasmCloud, Wasmer).
	- Cold starts: microseconds (vs 100–500ms for containers).
	- Major clouds now offer Wasm-based FaaS as mainstream option.
	- Wing language shutdown April 2025 — OSS code lives on but no commercial backing.

	---

	## 9. Database Platform

	### 9.1 Postgres options

	\| Service \| Multi-region \| Best for \|
	\|---------\|--------------\|----------\|
	\| RDS Postgres \| Read replicas \| Standard managed \|
	\| Aurora Postgres \| Cross-region read replicas + Global DB (1 writer) \| Standard scale-out \|
	\| Aurora DSQL \| Active/active, strong consistency, GA May 2025 \| New globally-distributed apps \|
	\| AlloyDB (GCP) \| HA + read pool nodes \| Postgres-compat OLTP+OLAP at GCP \|
	\| Cloud SQL (GCP) \| Single-region HA \| Standard managed \|
	\| Azure Database for Postgres Flex \| Single-region HA \| Standard managed \|
	\| Neon \| Branching (Git-like) \| Dev velocity \|
	\| Supabase \| Postgres + auth + realtime \| Full BaaS \|
	\| Crunchy Bridge \| Multi-cloud Postgres \| Vendor-neutral \|
	\| PlanetScale \| (Now Postgres + Vitess) \| Sharded scale-out \|

	### 9.2 Aurora DSQL deep cuts (GA May 2025)

	- Disaggregated architecture: query processor + adjudicator + journal + crossbar — each scales independently.
	- 99.99% single-region SLA, 99.999% multi-region.
	- Active/active multi-master (peers); third region as log-only witness.
	- Region groupings only — US (us-east-1, us-east-2, us-west-2), EU (eu-west-1/2/3), APAC (ap-northeast-1/2/3).
	- No cross-continent yet.
	- PostgreSQL wire-compatible.

	### 9.3 Distributed SQL (NewSQL)

	\| DB \| TPC-C (TPS) \| PG compat \| Multi-region \|
	\|----\|-------------\|-----------\|--------------\|
	\| CockroachDB \| 45k \| wire only \| Best with geo-partitioning \|
	\| YugabyteDB \| 48k \| full (reuses PG query layer) \| Strong with row-level geo \|
	\| TiDB \| 40k+ (write-heavy lead) \| MySQL primary \| ✓ \|
	\| Aurora DSQL \| benchmarked fastest by AWS \| wire \| Region-grouped \|
	\| Spanner \| 1M+ at scale \| GoogleSQL or PG dialect \| Global by design \|

	YugabyteDB wins for PG migration (full compat). CockroachDB wins for geo-partitioning. Spanner remains gold standard at hyperscale.

	### 9.4 Vitess (MySQL sharding)

	- Open-source MySQL sharding system.
	- Powers YouTube, Slack, GitHub, PlanetScale.
	- Functions: query routing, online schema migration (with `gh-ost`), connection pooling, transparent sharding.
	- Newer alternative: CockroachDB / Aurora DSQL eliminate manual sharding.

	### 9.5 NoSQL

	```
	DynamoDB → AWS, single-digit-ms; on-demand or provisioned
	DynamoDB Global Tables → multi-region multi-master (last-writer-wins)
	Spanner → strongly consistent global
	Cosmos DB → multi-model (SQL/MongoDB/Cassandra/Gremlin); 5 consistency levels
	Cassandra/Scylla → wide-column; high write throughput
	MongoDB Atlas → document; managed across all 3 clouds
	```

	### 9.6 Vector DBs (2025 production benchmarks)

	\| DB \| p99 latency \| QPS \| Notable \|
	\|----\|-------------\|-----\|---------\|
	\| Qdrant \| 2ms \| 12k \| Best low-latency, $25/mo+ cloud \|
	\| Milvus / Zilliz \| 5ms \| 8k \| Billion-scale; built-in BM25 + dense (30x faster than Elasticsearch) \|
	\| Pinecone \| 8ms \| 5k \| Fully managed, 99% recall \|
	\| Weaviate \| 10ms \| 4k \| BlockMax WAND (10x keyword speed); MUVERA multi-vector \|
	\| pgvector \| varies \| depends \| If you already have Postgres \|
	\| OpenSearch k-NN \| varies \| depends \| If you already have OpenSearch \|

	### 9.7 Migration tools (2025)

	\| Tool \| Approach \| Best for \|
	\|------\|----------\|----------\|
	\| Liquibase \| Imperative changelogs (XML/YAML/JSON/SQL); FSL license post-v5; AI rollback assist (2025) \| Multi-DB enterprise \|
	\| Flyway \| Numbered SQL files; Java ecosystem standard; Teams tier discontinued 2025 \| Java teams \|
	\| Atlas (atlasgo.io) \| Declarative HCL + computed migration plan \| Terraform-style schema-as-code \|
	\| Prisma Migrate \| Declarative, ORM-coupled \| Node/TS apps \|
	\| goose \| Plain SQL/Go migrations \| Go services \|

	Atlas is the modern recommendation — same paradigm as Terraform.

	---

	## 10. Networking Deep

	### 10.1 DNS

	```
	Route53 → AWS native, latency/geo/failover; alias records to AWS resources
	Cloud DNS → GCP native
	Azure DNS → Azure native
	Cloudflare DNS → fastest authoritative (1.1.1.1 is recursive); free
	NS1 / Constellix → enterprise multi-cloud DNS, advanced traffic steering
	```

	### 10.2 CDN performance (Cloudflare 95p TTFB benchmark, Nov 2024–Mar 2025)

	- Cloudflare fastest in ~48% of top 1000 networks.
	- Fastly extremely close in many networks (e.g., +0.2% lead on Comcast).
	- CloudFront strong inside AWS-heavy stacks (free egress to AWS origins).
	- All have edge compute now: Workers / Compute@Edge / Lambda@Edge.

	### 10.3 Load balancers (AWS)

	```
	ALB (L7) → HTTP/HTTPS/gRPC; WAF integration; target group flexibility
	NLB (L4) → TCP/TLS/UDP; static IPs; >millions RPS
	GWLB → traffic inspection (third-party firewall in chain)
	ELB Classic → legacy, avoid
	GAL (Global Accelerator) → anycast IPs in front of ALB/NLB for global traffic
	```

	### 10.4 Zero Trust Network Access (2025)

	\| Tool \| Architecture \| Best fit \|
	\|------\|-------------\|----------\|
	\| Tailscale \| WireGuard mesh + identity overlay \| Fastest dev access; great for SSH/RDP/DB \|
	\| Twingate \| Layer 4 ZTNA (no mesh); resource-grain \| App-name + group-based access \|
	\| Cloudflare Access + WARP \| SASE — Access for apps + Gateway for SWG \| When Cloudflare is the wider stack \|
	\| Zscaler \| Enterprise SASE \| Big-org compliance \|
	\| Pomerium \| Self-hosted reverse-proxy ZTNA \| OSS option \|

	Tailscale wins on dev velocity (sign in, get tailnet); Cloudflare Access wins on full SASE; Twingate wins on resource granularity.

	### 10.5 WAF

	```
	AWS WAF → tied to CloudFront/ALB/API Gateway
	Cloudflare WAF → in front of any origin
	Azure Front Door WAF → tied to AFD
	Akamai App & API Protector → enterprise
	```

	### 10.6 DDoS protection

	```
	AWS Shield Advanced → $3000/mo + transfer; 24/7 SRT
	Cloudflare → unmetered DDoS protection (free tier!)
	Google Cloud Armor → tier-based
	Azure DDoS Protection Standard → per-resource
	```

	---

	## 11. FinOps + Cost Engineering

	### 11.1 FinOps Foundation Framework 2025 (Inform / Optimize / Operate)

	```
	INFORM → Visibility, allocation, benchmarking, budgeting, forecasting
	OPTIMIZE → Identify and execute waste reduction
	OPERATE → KPI tracking, governance policies aligned with business
	```

	### 11.2 2025 framework changes — Scopes

	The 2025 Framework adds Scopes as a structural element. Scopes define context: Public Cloud, SaaS (Snowflake, Salesforce), GenAI (LLM API spend), Data Center, Private Cloud. Each capability is now applied per scope.

	### 11.3 Cost allocation tags (mandatory at provision time)

	```
	Required tags for every resource:
	- Environment : prod/staging/dev/sandbox
	- Owner : team-name (matches catalog)
	- CostCenter : finance code
	- Project : product/feature
	- DataClass : public/internal/confidential/regulated
	```

	Enforce via:
	- AWS: SCP `aws:RequestTag/X` (deny on creation), Tag Policies
	- GCP: Org Policy required labels
	- Azure: Azure Policy required tags

	### 11.4 Showback / chargeback

	```
	Showback → "your team used $X" (no actual billing)
	Chargeback → cross-charge cost center (real finance impact)
	```

	Tools: Vantage, CloudHealth, Apptio Cloudability, Kubecost (k8s-specific), Infracost (pre-deploy IaC estimate).

	### 11.5 Anomaly detection

	```
	AWS Cost Anomaly Detection (free)
	Vantage anomalies + alerts (commercial)
	CloudZero / Spend.io
	ProsperOps → automated commitment management
	```

	### 11.6 Right-sizing automation

	```
	AWS Compute Optimizer → free; recs for EC2, Lambda, EBS, ASG, ECS
	GCP Recommender → equivalent
	Azure Advisor → equivalent
	ScaleOps / StormForge → K8s VPA recommender for prod
	```

	### 11.7 Spot orchestration (Karpenter, ProsperOps)

	Already covered §6.9. Karpenter native AWS spot fleet; CAST AI / ScaleOps for cross-cloud.

	### 11.8 Training corpus for FinOps

	```
	- FinOps Foundation framework docs (finops.org/framework)
	- AWS / GCP / Azure cost optimization whitepapers
	- Vantage / CloudZero / Apptio public benchmarks
	- KubeCon FinOps track talks (transcripts)
	- Real customer cost-cut case studies (already collected: Tinybird, Series B SaaS examples)
	```

	---

	## 12. 2025–2026 Platform Engineering Trends

	### 12.1 Internal LLM gateways (the 2026 must-have)

	\| Tool \| Type \| Key strength \| Cost \|
	\|------\|------\|-------------\|------\|
	\| LiteLLM \| OSS, self-host \| OpenAI-compat; cheapest at $10k+ MRR; 100+ providers \| Free + infra \|
	\| Portkey \| SaaS or self-host \| SOC2/HIPAA/ISO27001; observability; 250+ LLMs \| $49/mo+ \|
	\| OpenRouter \| SaaS \| Pay-per-token; consumer-friendly \| 5% markup \|
	\| Helicone \| OSS observability \| Caching + analytics \| Free + cloud \|
	\| Truefoundry / Bifrost \| SaaS \| LLM gateway + ML platform \| Quote \|

	LiteLLM is the default for orgs serious about cost — runs as your own proxy, no markup.

	### 12.2 AI agents in platform engineering

	- Resolve.ai — AI SRE; auto-investigates alerts, RCA in minutes, MTTR -80%; customers: Coinbase, DoorDash, Toast, Zscaler. $40M Series A extension at $1.5B (2026).
	- Aviator (aviator.co) — AI code review + merge queues + deployment.
	- OpenText DevOps Aviator — AI for performance engineering scripts.
	- Cursor / Sourcegraph Cody / GitHub Copilot Workspace — IDE-side coding agents.
	- Codeium / Tabnine / Continue — open-source IDE agents.

	### 12.3 Per-PR ephemeral environments

	\| Tool \| Approach \|
	\|------\|----------\|
	\| Coherence \| PR comment with auto-preview URL; spot-backed for cost \|
	\| Uffizzi \| OSS + cloud; vCluster-based isolated environments \|
	\| Render Preview \| Built-in to Render \|
	\| Vercel Preview \| Built-in to Vercel \|
	\| Netlify Deploy Previews \| Built-in to Netlify \|
	\| Argo CD ApplicationSet PR Generator \| OSS K8s-native \|
	\| vCluster + ArgoCD \| DIY pattern; cheapest at scale \|

	Best practice: every PR gets a unique URL, smoke tests run against it, design reviewer can click before merge.

	### 12.4 WASM-based services

	- Production runtimes: Wasmtime (Bytecode Alliance), wasmCloud, Spin (Fermyon), Wasmer.
	- Use cases: edge serverless, plugin systems (Envoy filters, Istio extensions, Postgres extensions), embedded scripting.
	- Platforms moving to Wasm: Fastly Compute@Edge (WASM-only), Cloudflare Workers (V8 + WASM), Spin Hub.

	### 12.5 AI-native databases / observability

	```
	LangSmith → LLM tracing + evals (LangChain)
	Helicone → LLM tracing + caching (OSS option)
	Phoenix (Arize) → OSS LLM observability
	Langfuse → OSS, self-host LLM observability
	Weights & Biases Weave → MLOps + LLM
	```

	### 12.6 Autonomous Cloud Engineer (the Surrogate-1 mission)

	The path is converging on:

	1. MCP (Model Context Protocol) — standardize how agents pull cloud state. AWS docs MCP, OpenCost MCP (2025), terraform-mcp-server.
	2. Multi-agent systems — research / planner / executor / critic agents (CrewAI, LangGraph, AutoGen).
	3. Tool-using agents — agents that call `terraform plan`, `kubectl apply`, `aws sts get-caller-identity`, `gh pr create`.

	Surrogate-1's training MUST include MCP-call patterns + tool-use traces.

	---

	## 13. Training Data Sources

	### 13.1 Curated GitHub repos

	```
	Cloud
	- awesome-aws (donnemartin)
	- awesome-gcp (GoogleCloudPlatform/awesome-google-cloud)
	- awesome-azure (kristofferandreasen/awesome-azure)
	- aws-samples/* (8000+ official AWS samples)
	- GoogleCloudPlatform/* (1500+ GCP samples)
	- Azure-Samples/*

	K8s
	- run-x/awesome-kubernetes
	- ramitsurana/awesome-kubernetes
	- tomhuang12/awesome-k8s-resources
	- kubernetes/kubernetes (source + KEPs)
	- kubernetes-sigs/* (CAPI, Gateway API, Karpenter)
	- helm/charts (deprecated but reference)
	- bitnami/charts
	- argoproj/argo-cd

	IaC
	- hashicorp/terraform
	- terraform-aws-modules/* (40+ official modules)
	- terraform-google-modules/*
	- Azure/terraform-azurerm-* (AVM)
	- pulumi/examples
	- aws/aws-cdk
	- crossplane/crossplane + upbound/configurations

	Platform
	- backstage/backstage + roadie/* + spotify/* community plugins
	- score-spec/spec
	- humanitec-architecture/*

	Eval
	- codefuse-ai/codefuse-devops-eval
	- IaC-Eval (academic)
	- NL2Bash
	```

	### 13.2 Reddit communities (curate top-voted threads, last 2 yrs)

	- r/devops, r/aws, r/AZURE, r/googlecloud, r/kubernetes, r/Terraform, r/sysadmin, r/sre, r/platformengineering

	### 13.3 Conference talks (transcribe via Whisper, MIT-licensed for TLP/CNCF)

	- KubeCon + CloudNativeCon (CNCF YouTube; ~600 talks/year)
	- AWS re:Invent (multiple thousand sessions, breakouts archived)
	- Google Cloud Next (annual)
	- Microsoft Ignite / Build
	- HashiConf
	- PlatformCon (annual, online)
	- SREcon (USENIX)

	### 13.4 Public datasets on HuggingFace

	```
	- CatOwl/Terraform (Terraform code corpus)
	- nvidia/OpenCodeReasoning (reasoning over code)
	- bigcode/the-stack-v2 (filtered code, has IaC files)
	- mhhmm/codealpaca-iac (instruction tuning for IaC)
	- Custom: collect from terraform-aws-modules/eks/aws + variants
	```

	### 13.5 Documentation (for retrieval / SFT context)

	- AWS docs (full), GCP docs, Azure docs (Microsoft Learn), CNCF docs, K8s docs, Helm docs, Terraform/OpenTofu docs.
	- AWS Well-Architected Framework PDFs (one per pillar).
	- Google Cloud Architecture Framework.
	- Azure Cloud Adoption Framework + Well-Architected Framework.

	### 13.6 Synthesized data (recommended approach)

	For Surrogate-1 v2:

	```
	1. Take each terraform-aws-modules example
	2. Mutate: change region, instance type, AZ count, subnet sizes
	3. Build instruction format: "Build me a 3-AZ VPC in us-west-2 with public+private+db subnets using terraform-aws-modules/vpc/aws"
	4. Output: working main.tf + outputs.tf + variables.tf

	5. For each AWS service, generate:
	- "What is X" Q&A from official docs
	- "Compare X vs Y" from official docs
	- "Migrate from X to Y" code examples

	6. Multi-step trajectories:
	- "Build me a SaaS platform on AWS" → 30+ step reasoning trace through architecture decisions
	```

	Total target: ~100k–250k cloud/platform instruction-tuning examples.

	---

	## 14. Eval Benchmarks

	### 14.1 Existing benchmarks

	\| Benchmark \| What it tests \| Surrogate-1 fit \|
	\|-----------\|---------------\|-----------------\|
	\| codefuse-ai/codefuse-devops-eval \| DevOps Q&A multiple-choice \| Quick sanity check \|
	\| IaC-Eval (academic) \| Terraform generation correctness \| Direct fit \|
	\| KubeBench (community) \| K8s manifest validity \| Direct fit \|
	\| NL2Bash \| Bash command from NL \| Tooling sub-skill \|
	\| BIG-Bench (subset) \| Various reasoning \| General \|
	\| HumanEval / MBPP \| General coding \| Already passes (Qwen2.5-Coder-7B baseline) \|

	### 14.2 Custom Surrogate-1 v2 evals (we author)

	```
	Surrogate-1 Cloud Eval v2:
	1. Terraform generation (200 prompts, varying complexity)
	- Pass = `terraform validate` + `terraform plan` succeeds
	- Score: % passing × % correct logical structure (judge LLM)

	2. Helm chart authoring (50 prompts)
	- Pass = `helm template` produces valid YAML
	- Score: % passing × `kubeval` validation rate

	3. CDK/CFN authoring (100 prompts)
	- Pass = `cdk synth` succeeds
	- Score: + `cfn-lint` clean rate, + `cfn-guard` policy pass

	4. ArgoCD Application + Kustomize (50 prompts)
	- Pass = ArgoCD CLI dry-run succeeds

	5. Multi-cloud DR scenario (30 prompts)
	- Open-ended: "Design active/passive across AWS+GCP for a SaaS, RTO=15min, RPO=1min"
	- Score: judged by GPT-5 / Claude / human reviewer on architecture quality

	6. Cost optimization (50 prompts)
	- Given a `terraform plan` output, return cost reductions (Graviton swap, RIs/SPs, Spot)
	- Score: judged on $$ accuracy (vs Infracost ground truth)

	7. K8s troubleshooting (50 prompts)
	- Given pod logs + describe output, return root cause + fix
	- Score: % matching ground truth

	8. Tool-use traces (100 prompts)
	- Given a goal, agent must call `aws cli` / `kubectl` / `terraform` correctly
	- Score: % achieving goal (sandbox eval)
	```

	Total: ~630 prompts. Run with rubric judges (GPT-5/Claude). Surrogate-1 v2 target: 65% overall (above Qwen2.5-Coder-7B baseline of ~38%).

	### 14.3 Capability tiers (target)

	\| Tier \| Capability \| v2 Target \|
	\|------\|-----------\|-----------\|
	\| 1 \| Recognize + classify cloud services \| 95% \|
	\| 2 \| Author single-file IaC (Terraform/CDK/Helm) \| 75% \|
	\| 3 \| Author multi-file project (VPC + EKS + RDS + ArgoCD) \| 60% \|
	\| 4 \| End-to-end design trace ("build SaaS on AWS") \| 50% \|
	\| 5 \| Multi-cloud DR design + tool execution \| 35% (stretch) \|

	---

	## v2 Curriculum Integration Plan

	For the v2 LoRA fine-tune of Qwen2.5-Coder-7B → Surrogate-1:

	### Data mix (target ~250k instruction examples)

	```
	40% IaC generation (Terraform / OpenTofu / CDK / Pulumi / Bicep / Crossplane)
	20% K8s authoring (Helm / Kustomize / ArgoCD / Karpenter)
	15% Cloud architecture Q&A (mined from cert prep + docs)
	10% Cost optimization scenarios (FinOps mined + synthesized)
	10% IDP / Backstage / Score / Humanitec patterns
	5% Multi-step tool-use traces (terraform plan → fix → apply)
	```

	### Key sources (direct ingestion priorities)

	```
	1. terraform-aws-modules/* + terraform-google-modules/* + Azure AVM (canonical IaC)
	2. backstage/backstage source + plugin examples
	3. AWS Well-Architected docs (all pillars + lenses)
	4. GCP Cloud Adoption Framework
	5. CNCF KubeCon transcripts (Whisper-extracted)
	6. score-spec + humanitec docs
	7. OpenCost docs + MCP-pattern examples
	8. Real customer post-mortems (Tinybird $-20k, Series-B SaaS $-29k)
	9. IaC-Eval benchmark training set
	10. CodeFuse DevOps-Eval training set
	```

	### Eval gates

	- v2 cannot ship until ≥65% overall on Surrogate-1 Cloud Eval v2.
	- Tier-3 (multi-file) ≥60% is the practical bar for autonomous infra building.
	- Add MCP-tool-use trajectory eval (sandbox terraform/kubectl/aws calls).

	---

	## Sources Consulted

	- AWS Well-Architected Framework (6 pillars docs, Sustainability pillar Nov 2024 refresh)
	- Terraform / OpenTofu best practices (Terramate, Spacelift, env0, Scalr 2025 articles)
	- Kubernetes 1.32-1.35 release notes; CNCF security blog Dec 2025
	- Backstage docs + Spotify Backstage portal blog (2025)
	- ArgoCD / FluxCD comparison articles (2025-2026 post-Weaveworks closure)
	- Crossplane v2.0 release blog + InfoQ article (Aug 2025)
	- Karpenter cost optimization blogs (Tinybird; Series-B SaaS case studies)
	- Cloudflare Workers / Vercel Edge / Lambda@Edge benchmarks (2025)
	- FinOps Foundation 2025 framework + Scopes update
	- Istio / Linkerd / Cilium 2025 benchmarks (deepness-lab academic paper)
	- Pulumi / Terraform / CDK / Bicep 2025 comparisons
	- CockroachDB / YugabyteDB / Spanner / Aurora DSQL 2025 benchmarks
	- AWS SAP-C02 / GCP PCA (Oct 2025 refresh) / Azure AZ-305 (April 2026 refresh)
	- Backstage / Port / Cortex / Humanitec IDP comparison (2025-2026)
	- Karmada v1.15 + KubeFed EOL + Cluster API
	- Coherence / Uffizzi ephemeral environments (2025)
	- AWS CDK best practices (CDK Refactor Sept 2025)
	- VPC Transit Gateway / PrivateLink hub-spoke patterns
	- Helm / Kustomize / Carvel comparison (Helm 4 Nov 2025)
	- terraform-aws-modules registry top downloads (May 2025 stats)
	- Liquibase / Flyway / Atlas migration tools (2025 license + features)
	- Aurora DSQL GA announcement (May 2025)
	- CDN benchmarks (Cloudflare 95p TTFB 2024-2025)
	- AWS Savings Plans / Reserved Instances June 2025 policy changes
	- IAM SCPs + Permission Boundaries + ABAC patterns
	- GKE / EKS / AKS managed K8s comparison (2025-2026)
	- terraform-aws-modules registry usage (vpc 126M, eks 96.3M downloads)
	- Vertex AI / BigQuery / Gemini integration (2025)
	- Resolve.ai AI SRE + Aviator (2025-2026)
	- LiteLLM / Portkey / OpenRouter LLM gateway comparison (2025)
	- Multi-cloud DR active/active vs active/passive patterns
	- Wing language shutdown (April 2025) + WASM serverless trends
	- Awesome-aws / awesome-kubernetes curated lists
	- Kubecost / OpenCost cost visibility (Kubecost IBM acquisition 2024)
	- Atlantis / Spacelift / Env0 / Terramate IaC platforms
	- Score spec + OAM workload specifications
	- Karpenter NodePool + Spot + Graviton best practices
	- Tailscale / Twingate / Cloudflare Access ZTNA comparison
	- Vector DB benchmarks (Pinecone / Weaviate / Qdrant / Milvus 2025)
	- AWS Copilot end-of-support (June 12 2026) + SAM + Amplify
	- Gateway API + ingress-nginx retirement (March 2026)
	- DevOps eval benchmarks + IaC-Eval academic benchmark