Spaces:
Runtime error
Runtime error
| title: Cloud + Platform Engineering Deep Research for Surrogate-1 v2 | |
| date: 2026-04-29 | |
| purpose: Train Surrogate-1 (Qwen2.5-Coder-7B + LoRA) into a SOTA Cloud / Platform Engineer | |
| scope: AWS + GCP + Azure + Edge + IDP + IaC + K8s + FinOps + Multi-cloud DR | |
| # Surrogate-1 SOTA Cloud + Platform Engineer Training Plan | |
| This document is the canonical knowledge base used to design the v2 instruction-tuning curriculum for Surrogate-1. The model must, autonomously and end-to-end: | |
| 1. Design + provision multi-cloud infrastructure (AWS, GCP, Azure, Cloudflare, Vercel) | |
| 2. Author production-grade IaC (Terraform, OpenTofu, CDK, Pulumi, Bicep, Crossplane) | |
| 3. Operate Kubernetes platforms (EKS / GKE / AKS) with GitOps + service mesh | |
| 4. Build internal developer platforms (Backstage, Port, Score, Humanitec) | |
| 5. Handle FinOps lifecycle (Inform / Optimize / Operate, 2025 + Scopes) | |
| 6. Execute multi-cloud disaster recovery + global routing | |
| 7. Stand up edge/serverless (Cloudflare Workers, Vercel Edge, Lambda@Edge) | |
| The research is organized into 14 verticals. Each section closes with the **training corpus** + **eval target** for the v2 curriculum. | |
| --- | |
| ## 1. AWS Deep Mastery | |
| ### 1.1 Certification Scope (training-data anchors) | |
| | Cert | Code | Topics | Why we mine it | | |
| |------|------|--------|----------------| | |
| | Solutions Architect Associate | SAA-C03 | VPC, EC2, S3, RDS, Lambda, IAM basics | Foundational service catalog | | |
| | Solutions Architect Pro | SAP-C02 | Multi-account, hybrid, migration, DR, cost-resilience | Most question banks for org-complexity | | |
| | DevOps Engineer Pro | DOP-C02 | CI/CD, monitoring, IaC, governance | Pipelines + observability | | |
| | Security Specialty | SCS-C02 | KMS, GuardDuty, Inspector, SCPs, IRSA | Hardening + compliance scenarios | | |
| | Advanced Networking Specialty | ANS-C01 | Transit Gateway, Direct Connect, Cloud WAN | Multi-VPC + hybrid networking | | |
| The SAP-C02 exam validates designing multi-account strategies, hybrid architectures, migration at scale, cost optimization, security, and resilience β exactly the Surrogate-1 scope. The exam has 65 scored + 10 unscored questions, passing score 750/1000. | |
| ### 1.2 Well-Architected Framework β 6 pillars (Sustainability added Dec 2021, refreshed Nov 2024) | |
| ``` | |
| 1. Operational Excellence β IaC, runbooks, observability, post-mortems | |
| 2. Security β IAM, encryption, network, IR | |
| 3. Reliability β RTO/RPO, failover, multi-AZ/multi-region | |
| 4. Performance Efficiency β right-sized compute, modern data services | |
| 5. Cost Optimization β RIs/SPs/Spot/Graviton, lifecycle rules | |
| 6. Sustainability β energy efficiency, region selection, idle cleanup | |
| ``` | |
| **Lenses** Surrogate-1 must recognize: Serverless, SaaS, Migration, Generative AI, IoT, Hybrid Networking, Financial Services, Streaming Media, ML. | |
| ### 1.3 Top 30 services for startup/SaaS workloads | |
| ``` | |
| Compute : EC2, Lambda, Fargate, Batch, ECS, EKS, App Runner | |
| Storage : S3, EFS, FSx, EBS | |
| DB : RDS (Postgres/MySQL), Aurora, Aurora DSQL, DynamoDB, ElastiCache (Redis/Valkey), OpenSearch | |
| Network : VPC, Route53, CloudFront, ALB/NLB/GWLB, Transit Gateway, PrivateLink, API Gateway | |
| Identity : IAM, IAM Identity Center (SSO), Cognito, Organizations, Verified Permissions | |
| Observability: CloudWatch, X-Ray, Managed Prometheus, Managed Grafana, OpenSearch | |
| Security : KMS, Secrets Manager, GuardDuty, Inspector, Security Hub, WAF, Shield | |
| Data/AI : Bedrock, SageMaker, Glue, Athena, Kinesis, MSK, Step Functions | |
| Messaging : SQS, SNS, EventBridge | |
| DevTools : CodePipeline, CodeBuild, CodeDeploy, CDK, SAM | |
| ``` | |
| ### 1.4 VPC networking patterns | |
| **Hub-and-spoke with Transit Gateway** β TGW is the managed hub-and-spoke service for VPCs and on-prem; centralizes routing without VPN overlays. Aligns with Well-Architected Reliability pillar `REL02-BP04`. | |
| **Centralized PrivateLink endpoints** β Host interface VPC endpoints (e.g., for `s3.api`, `kms`, `sts`, `secretsmanager`) in a single shared-services VPC. All spoke VPCs reach AWS APIs via TGW β endpoint VPC. Saves cost (one $7.30/month-per-AZ endpoint instead of N). | |
| **Decision tree**: | |
| ``` | |
| Two VPCs, low traffic, no transitive β VPC Peering | |
| Service consumed across many VPCs β PrivateLink (endpoint service) | |
| β₯3 VPCs with transitive routing needed β Transit Gateway (hub-and-spoke) | |
| Multi-region + on-prem at scale β Cloud WAN | |
| ``` | |
| ### 1.5 IAM advanced | |
| **SCP** = guardrail at OU/account level; **deny-by-default**, no permissions granted, only constrains. SCPs evaluated AND'd with IAM policies + permission boundaries β action allowed only when ALL allow it. | |
| **Permission boundary** = max permissions a role/user CAN have, regardless of attached policies. Used for delegated admin (developer can create roles, but only ones bounded). | |
| **ABAC** = attribute-based access control via tags (e.g., `aws:PrincipalTag/team` must equal `aws:ResourceTag/team`). Reduces role count drastically. SCPs can lock the tagging itself so principals can't escalate by re-tagging. | |
| Example SCP β deny untagged production resources: | |
| ```json | |
| { | |
| "Sid": "DenyUntaggedEnvProd", | |
| "Effect": "Deny", | |
| "Action": ["ec2:RunInstances", "rds:CreateDBInstance"], | |
| "Resource": "*", | |
| "Condition": { | |
| "StringNotEquals": { | |
| "aws:RequestTag/Environment": ["prod","staging","dev"] | |
| } | |
| } | |
| } | |
| ``` | |
| ### 1.6 Cost optimization (FinOps lever in Β§11) | |
| **Compute discount tiers** (max savings vs on-demand): | |
| | Mechanism | Max Discount | Flexibility | | |
| |-----------|-------------|-------------| | |
| | Standard RI | 75% | Locked region+family+OS, 1 or 3 yr | | |
| | Convertible RI | 54% | Can change family within OS | | |
| | EC2 Instance SP | 72% | Locked family, any size, any AZ | | |
| | Compute SP | 66% | EC2 + Fargate + Lambda + SageMaker | | |
| | Spot | 90% | Variable interruption (2-min notice) | | |
| | Graviton | +40% perf/$ | ARM64 (must support arch) | | |
| **June 2025 change** β RIs and SPs are restricted to single-end-customer; MSPs can no longer share commitments across accounts. | |
| Surrogate-1 must teach `Compute Optimizer` recommendations + apply them. | |
| ### 1.7 AWS-specific tools & CLI surface | |
| ``` | |
| aws cli β primary | |
| aws cdk β preferred IaC (TS/Python). CDK Refactor (Sept 2025) safely renames constructs without replacement | |
| aws sam β serverless/Lambda focus | |
| aws copilot β ECS/Fargate (END OF SUPPORT June 12 2026 β migrate to ECS Express or CDK L3) | |
| aws amplify β frontend + serverless backend, Git-driven CI/CD | |
| ``` | |
| ### 1.8 Training corpus for AWS | |
| ``` | |
| - AWS Well-Architected docs (all 6 pillar PDFs + 9 lenses) | |
| - AWS official examples: aws-samples/* (8000+ repos) | |
| - terraform-aws-modules/* (vpc 126M downloads, eks 96.3M downloads) | |
| - AWS CDK guide v2 + cdk-patterns/serverless | |
| - SAP-C02 question banks (ExamTopics, Tutorials Dojo) | |
| - AWS Architecture Center reference architectures (multi-account, DR, hybrid) | |
| - Service Control Policy examples: aws-samples/service-control-policy-examples | |
| ``` | |
| **Eval target**: 75% on a custom AWS-design eval (multi-account VPC + hub-spoke + IAM bootstrap + EKS cluster) with `cfn-lint` + `cfn-guard` passing. | |
| --- | |
| ## 2. GCP Deep Mastery | |
| ### 2.1 Certifications | |
| | Cert | Released | Scope | | |
| |------|----------|-------| | |
| | Cloud Digital Leader | β | Business/strategy | | |
| | Associate Cloud Engineer | β | gcloud + GCE/GKE/GCS basics | | |
| | Professional Cloud Architect (PCA) | refreshed Oct 30 2025 | Design β ~30% net-new content (Vertex AI, Gemini, AI Hypercomputer) | | |
| | Professional Cloud Network Engineer (PCNE) | β | VPC, hybrid, Cloud Interconnect | | |
| | Professional Cloud DevOps Engineer | β | SLO, CI/CD, observability | | |
| | Professional Cloud Security Engineer | β | Org policies, VPC-SC, BeyondCorp | | |
| | Professional Cloud Database Engineer | β | Cloud SQL, AlloyDB, Spanner | | |
| PCA exam covers Compute Engine, Cloud Storage, App Engine, GKE, with the Oct 2025 refresh adding ~30% new content focused on Vertex AI, Gemini integration, and AI Hypercomputer. | |
| ### 2.2 GKE advanced | |
| **GKE Autopilot** β Google manages provisioning, scaling, security, add-ons. Bills per-pod resource request (not nodes). Best when team doesn't want to tune nodepools. | |
| **GKE Standard** β Customer-managed nodepools; required for DaemonSets that need privileged hostPath, custom CNI, niche GPU/TPU shapes. | |
| **GKE version ladder** β GKE adopts new K8s versions fastest (~2 weeks). Autopilot gets 30 months extended support; AKS LTS 24 months; EKS Extended Support +12 months. | |
| **Anthos / GKE Enterprise** β Multi-cluster across on-prem + AWS + Azure. Provides Config Sync (GitOps), Service Mesh, Policy Controller. Now folded into GKE Enterprise SKU. | |
| ### 2.3 BigQuery + Vertex AI integration (2025) | |
| - `AI.GENERATE`, `AI.GENERATE_TABLE`, `AI.EMBED`, `AI.SIMILARITY` are now **GA** in BigQuery. | |
| - BQML supports Gemini 3.0 for generative SQL functions. | |
| - Vertex AI End User Credentials (2025) lets Vertex models authenticate via the calling user's IAM β no service-account proxy. | |
| This is core for any data-platform engineering Surrogate-1 builds. | |
| ### 2.4 Cloud Run + Cloud Functions | |
| - Cloud Run gen2 = container-as-a-service, scales to zero, max 60-min timeout, supports websockets/streaming. | |
| - Cloud Functions gen2 = built ON Cloud Run; choose Functions for trigger-driven, Run for HTTP/services. | |
| - Cloud Run jobs = batch workloads (cron via Cloud Scheduler). | |
| ### 2.5 GCP-specific tools | |
| ``` | |
| gcloud β primary CLI | |
| Terraform google β official provider, fastest day-1 support for new services | |
| Config Connector β GCP-native Crossplane equivalent (KCC). Manage GCP resources via K8s CRDs | |
| Cloud Deploy β managed GitOps for GKE | |
| Cloud Build β CI (yaml + buildpacks) | |
| ``` | |
| ### 2.6 Training corpus for GCP | |
| ``` | |
| - GCP architecture center (cloud.google.com/architecture) | |
| - terraform-google-modules/* (network, kubernetes-engine, cloud-foundation-fabric) | |
| - Cloud Foundation Fabric (Google's reference org setup) | |
| - gcp-pca-study-guide repos | |
| - Anthos config-management examples | |
| - BQML + Vertex AI codelabs | |
| ``` | |
| --- | |
| ## 3. Azure Deep Mastery | |
| ### 3.1 Certifications | |
| | Cert | Code | Scope | | |
| |------|------|-------| | |
| | Administrator Associate | AZ-104 | RBAC, IAM, networking, storage, Bicep basics | | |
| | Solutions Architect Expert | AZ-305 | Design β governance, identity, infra, app, integration | | |
| | Security Engineer | AZ-500 | Defender, Sentinel, Conditional Access | | |
| | DevOps Engineer Expert | AZ-400 | Pipelines, IaC, monitoring | | |
| AZ-305 (refreshed April 17 2026) covers: Identity/governance/monitoring, data storage, infrastructure & availability, application architecture, network solutions, data integration, business continuity. Prereq: AZ-104. | |
| ### 3.2 Azure compute deep cuts | |
| ``` | |
| AKS β managed K8s; "AKS LTS" = 24-mo extended support per minor | |
| App Service β PaaS web hosting (Plans = Basic/Standard/Premium/Isolated) | |
| Functions β consumption / premium / dedicated | |
| Container Apps β CaaS on KEDA (scale-to-zero from events) | |
| Container Instances (ACI) β single-pod throwaway | |
| Virtual Machine Scale Sets (VMSS) β IaaS auto-scaling | |
| Azure Spring Apps β managed Spring Boot | |
| ``` | |
| ### 3.3 Azure DevOps + GitHub Enterprise (Microsoft owns both) | |
| - **Azure DevOps** = Boards + Repos + Pipelines + Artifacts. Mature for .NET-heavy orgs. | |
| - **GitHub Enterprise** + Actions = where new investment is going (Microsoft's strategic direction). | |
| - 2025 trend: most new Azure customers go GitHub-first; Azure DevOps is in maintenance mode. | |
| ### 3.4 Azure tooling | |
| ``` | |
| az cli β primary | |
| Bicep β DSL that transpiles to ARM. JSON ARM templates β DEPRECATED for new work | |
| Pulumi β first-class Azure native provider | |
| Terraform azurerm + azuread β mature, official | |
| ``` | |
| Bicep simplifies ARM but is Azure-only β for multi-cloud orgs, Terraform remains primary. | |
| ### 3.5 Training corpus for Azure | |
| ``` | |
| - Cloud Adoption Framework (Microsoft's enterprise reference) | |
| - Azure-Samples/* GitHub org | |
| - Azure Verified Modules (AVM) β Microsoft's curated Bicep + Terraform modules | |
| - AZ-305 study guides + Microsoft Learn content | |
| - Azure Architecture Center patterns | |
| ``` | |
| --- | |
| ## 4. Multi-Cloud Strategy | |
| ### 4.1 Workload portability tools | |
| | Tool | Approach | Best fit | | |
| |------|----------|----------| | |
| | Crossplane | K8s-native control plane β cloud APIs via providers | Platform teams already on K8s | | |
| | Anthos | GCP-managed clusters across clouds + on-prem | GKE-centric orgs wanting unified control | | |
| | Azure Arc | Azure-managed servers/K8s outside Azure | Azure-centric hybrid | | |
| | Terraform | IaC abstraction (provider-per-cloud) | Most common; least lock-in | | |
| | Pulumi | Real code (Python/TS); equivalent provider coverage | Engineering-heavy teams | | |
| ### 4.2 Crossplane v2 (Aug 2025) | |
| Major upgrades: | |
| - **Compositions can include any K8s resource** β not just Crossplane MRs. Mix RDSInstance + Deployment + CloudNativePG cluster in one XR. | |
| - **Namespace-first** β XRs and MRs are namespaced by default (was cluster-scoped). | |
| - **Operations** β function pipelines for cert monitoring, rolling upgrades, scheduled maintenance. | |
| - **Multi-cloud status** β AWS providers fully migrated; Azure/GCP/Terraform providers still being updated to v2. | |
| ### 4.3 DR / failover patterns | |
| | Pattern | RTO | RPO | Cost (vs single-region) | | |
| |---------|-----|-----|-------------------------| | |
| | Backup & restore | hours-days | hours | 1.0x (storage only) | | |
| | Pilot light | 10s of min | minutes | 1.1-1.3x | | |
| | Warm standby | minutes | minutes | 1.5-1.8x | | |
| | Multi-site active/active | seconds | ~0 | 1.8-2.5x | | |
| Multi-cloud active/active typically costs 1.8β2.5x single-cloud due to duplicate infra + ops overhead. Recommendation: active/passive across clouds + active/active across regions WITHIN primary cloud. | |
| ### 4.4 Latency-based routing | |
| ``` | |
| Route53 latency policy β AWS-native, cheapest | |
| Cloud DNS geo-routing β GCP-native | |
| Azure Traffic Manager β Azure-native | |
| Cloudflare load balancer β multi-cloud | |
| NS1 / Constellix β enterprise multi-cloud DNS | |
| ``` | |
| Cloudflare LB is the most common cross-cloud answer because it sits OUTSIDE the providers. | |
| ### 4.5 Cost arbitrage | |
| - GPU cost: GCP < AWS < Azure (TPUs are GCP-only and cheaper per FLOP at scale) | |
| - Egress: AWS most expensive; Cloudflare R2 has $0 egress (S3-compatible) | |
| - Object storage: B2 ($6/TB/mo) < R2 ($15) < S3 standard ($23) < GCS standard ($26) | |
| - Reserved discounts: deepest in AWS (75% std RI), shallower in Azure (65%), GCP CUDs ~57% | |
| ### 4.6 Vendor lock-in mitigation | |
| ``` | |
| 1. Use OSS data formats (Parquet, Iceberg, Delta) β not proprietary | |
| 2. Use OSS DBs (Postgres / Redis-compatible Valkey) β not Aurora-only or Cosmos-only | |
| 3. Use OCI containers + K8s β cluster portability via Crossplane/Anthos | |
| 4. Use Terraform with multi-provider modules β abstract per-cloud differences | |
| 5. Avoid managed-vendor-only auth β use OIDC + Keycloak or Auth0 (cross-cloud) | |
| 6. Multi-cloud DNS (Cloudflare/NS1) so Route53/Cloud DNS isn't single point | |
| ``` | |
| --- | |
| ## 5. IaC Mastery | |
| ### 5.1 Terraform / OpenTofu (post-BSL fork) | |
| - HashiCorp **Terraform OSS under BSL discontinued after July 2025** β OpenTofu is the OSS continuation under Linux Foundation. | |
| - Most TACOS (Spacelift, Env0, Scalr) support both. Most modules still work in both. | |
| - For new orgs in 2026 β **default OpenTofu**. | |
| **Best practices** (2025): | |
| ``` | |
| 1. Remote backend (S3+DynamoDB lock, GCS, Azure Blob) β never local state | |
| 2. Split state: per-environment (dev/staging/prod) + per-domain (network, data, compute) | |
| - Terralith state >50MB causes timeouts; >10MB visible perf hit | |
| 3. Module versioning: `~> 2.5` (allow patch+minor, block major) | |
| 4. Pre-commit: terraform fmt + validate + tflint + tfsec/checkov | |
| 5. CI/CD: Atlantis (OSS, self-host) or Spacelift / Env0 / Scalr / Terramate (SaaS) | |
| 6. State locking always on | |
| 7. Drift detection: `terraform plan -refresh-only` on schedule (Spacelift / Atlantis cron) | |
| 8. Workspaces only for environment isolation; NOT for tenant separation | |
| ``` | |
| **Workspace anti-pattern**: using workspaces for cust-1, cust-2, cust-3 β should be separate state files / dirs instead. Workspaces good for `dev`, `staging`, `prod` of same module. | |
| ### 5.2 CloudFormation | |
| ``` | |
| - Nested stacks β for >500 resources / cross-stack dependencies | |
| - Custom resources β Lambda-backed for CFN gaps. Use AwsCustomResource (CDK) for single-API-call | |
| - Change sets β preview before apply (mandatory for prod) | |
| - Stack policies β prevent accidental updates to specific resources | |
| - Service Catalog β curated CFN templates exposed to devs | |
| - StackSets β multi-account/multi-region rollout | |
| ``` | |
| ### 5.3 AWS CDK best practices | |
| ``` | |
| - Constructs L1 (raw CFN) / L2 (curated AWS) / L3 (composite patterns) | |
| - Aspects β enforce policy across all constructs (e.g., "all S3 buckets must encrypt") | |
| - Aspects run at synth-time β cfn-guard runs post-synth β both = defense in depth | |
| - Don't extend Construct unless interacting with AWS resources directly; helper class is enough | |
| - Custom resources: use AwsCustomResource for single API call; full Lambda-backed for complex | |
| - CDK Refactor (Sept 2025) β safely rename or move resources without replacement | |
| - Pipelines L3 = managed CodePipeline that self-mutates | |
| ``` | |
| ### 5.4 Pulumi | |
| - Real code (TS/Python/Go/.NET/Java) β language loops, classes, unit tests with native frameworks. | |
| - Pulumi onboarding ~30% faster for engineers already knowing TS/Python (vs HCL). | |
| - Day-1 support for new cloud services because Pulumi wraps SDKs directly. | |
| - Pulumi ESC = encrypted env+secrets store; Pulumi Deployments = managed runners. | |
| ### 5.5 Crossplane (K8s-native multi-cloud) | |
| ```yaml | |
| # Composition that creates RDS + Deployment + Service in one XR | |
| apiVersion: apiextensions.crossplane.io/v2 | |
| kind: Composition | |
| metadata: | |
| name: web-app-with-db | |
| spec: | |
| compositeTypeRef: | |
| apiVersion: example.io/v1alpha1 | |
| kind: WebApp | |
| pipeline: | |
| - step: provision-db | |
| functionRef: | |
| name: function-patch-and-transform | |
| input: | |
| apiVersion: pt.fn.crossplane.io/v1beta1 | |
| kind: Resources | |
| resources: | |
| - name: rds | |
| base: | |
| apiVersion: rds.aws.upbound.io/v1beta1 | |
| kind: Instance | |
| spec: | |
| forProvider: | |
| instanceClass: db.t3.medium | |
| engine: postgres | |
| engineVersion: "16" | |
| allocatedStorage: 50 | |
| - name: deployment | |
| base: | |
| apiVersion: apps/v1 | |
| kind: Deployment | |
| spec: | |
| replicas: 3 | |
| ``` | |
| ### 5.6 IaC TACOS comparison | |
| | Tool | OSS / SaaS | IaC Coverage | Best For | | |
| |------|-----------|--------------|----------| | |
| | Atlantis | OSS, self-host | TF/OpenTofu/Terragrunt | Free, GitHub-PR workflow | | |
| | Spacelift | SaaS + self-hosted | TF/OpenTofu/Terragrunt/Pulumi/CFN/K8s/Ansible | Enterprise multi-IaC | | |
| | Env0 | SaaS only | Multi-IaC + strong FinOps | FinOps-aware deployment | | |
| | Terramate | OSS + SaaS | TF/OpenTofu | Stack orchestration + DAGs | | |
| | Scalr | SaaS + self-hosted | TF/OpenTofu | TFC alternative | | |
| | Terraform Cloud | SaaS | TF only | Default if already HashiCorp | | |
| ### 5.7 Real Terraform module example (multi-cloud DRY) | |
| ```hcl | |
| # environments/prod/main.tf | |
| module "vpc" { | |
| source = "terraform-aws-modules/vpc/aws" | |
| version = "~> 5.0" | |
| name = "prod-vpc" | |
| cidr = "10.0.0.0/16" | |
| azs = ["us-east-1a", "us-east-1b", "us-east-1c"] | |
| private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] | |
| public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"] | |
| database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"] | |
| enable_nat_gateway = true | |
| single_nat_gateway = false # one per AZ for HA | |
| enable_vpn_gateway = false | |
| enable_dns_hostnames = true | |
| enable_flow_log = true | |
| flow_log_destination_type = "cloud-watch-logs" | |
| tags = local.common_tags | |
| } | |
| module "eks" { | |
| source = "terraform-aws-modules/eks/aws" | |
| version = "~> 20.0" | |
| cluster_name = "prod-platform" | |
| cluster_version = "1.32" | |
| vpc_id = module.vpc.vpc_id | |
| subnet_ids = module.vpc.private_subnets | |
| cluster_endpoint_public_access = false | |
| cluster_endpoint_private_access = true | |
| cluster_addons = { | |
| coredns = { most_recent = true } | |
| kube-proxy = { most_recent = true } | |
| vpc-cni = { most_recent = true } | |
| aws-ebs-csi-driver = { most_recent = true } | |
| eks-pod-identity-agent = { most_recent = true } | |
| } | |
| eks_managed_node_groups = { | |
| system = { | |
| instance_types = ["t3.medium"] | |
| min_size = 2 | |
| max_size = 4 | |
| desired_size = 2 | |
| labels = { workload = "system" } | |
| taints = [{ key = "system", value = "true", effect = "NO_SCHEDULE" }] | |
| } | |
| karpenter = { | |
| instance_types = ["m6g.large"] # Graviton | |
| capacity_type = "ON_DEMAND" | |
| min_size = 1 | |
| max_size = 2 | |
| desired_size = 1 | |
| labels = { workload = "karpenter" } | |
| } | |
| } | |
| enable_irsa = true | |
| enable_cluster_creator_admin_permissions = true | |
| } | |
| ``` | |
| ### 5.8 Training corpus for IaC | |
| ``` | |
| - HashiCorp learn.hashicorp.com/terraform tutorials (1000+ lessons) | |
| - terraform-aws-modules / terraform-google-modules / Azure/terraform-azurerm-* (AVM) | |
| - Pulumi pulumi/examples (1500+) | |
| - aws-samples/aws-cdk-examples | |
| - Crossplane upbound/configurations (reference platforms) | |
| - Awesome-terraform / awesome-pulumi GitHub lists | |
| - IaC-Eval benchmark (academic Terraform benchmark) | |
| - TACOS docs: Spacelift, Env0, Atlantis, Terramate | |
| ``` | |
| --- | |
| ## 6. Kubernetes Platform Engineering | |
| ### 6.1 Kubernetes 1.32 β 1.35 highlights (2025) | |
| | Version | Released | Key features | | |
| |---------|----------|-------------| | |
| | 1.32 | Dec 2024 | KubeletFineGrainedAuthz; Memory Manager GA; Anonymous Auth Configurable Endpoints | | |
| | 1.33 | Apr 2025 | Sidecars GA; supplementalGroupsPolicy beta; in-place pod resize beta | | |
| | 1.34 | Aug 2025 | DRA core GA (Dynamic Resource Allocation for GPUs/TPUs/FPGAs) | | |
| | 1.35 | Dec 2025 | Fine-grained Supplemental Groups GA; TLS 1.3 baseline | | |
| **Pod Security Standards** are GA since v1.25 (NOT 2025). Three levels: Privileged / Baseline / Restricted, applied via `pod-security.kubernetes.io/<mode>` namespace labels. | |
| ### 6.2 Helm vs Kustomize vs Carvel | |
| | Tool | Approach | Strength | Weakness | | |
| |------|----------|----------|----------| | |
| | Helm | Templating + values + chart | Package manager (75% adoption); Helm 4 (Nov 2025) adds server-side apply | Templating debug pain | | |
| | Kustomize | Patch-based overlays on bases | No magic; built into kubectl | No release/version concept; needs ArgoCD/Flux for state | | |
| | Carvel | ytt + kapp + kbld + imgpkg | Strong CI bundling; image relocation | Steeper learning curve, smaller community | | |
| **Mature pattern**: Helm to install upstream charts (Cilium, ArgoCD, Prometheus); Kustomize overlays per environment. Use ArgoCD `helm` source with `valuesObject` overrides. | |
| ### 6.3 GitOps β ArgoCD vs FluxCD (2025 reality) | |
| **Weaveworks closed in early 2024** β Flux became fully community-driven (CNCF graduated). ArgoCD has clearer commercial path (Akuity, CodeFresh). | |
| | Aspect | ArgoCD | FluxCD | | |
| |--------|--------|--------| | |
| | UI | Strong native dashboard | None native (use Weave GitOps or third-party) | | |
| | RBAC | Built-in + Projects multi-tenancy | Standard K8s RBAC only | | |
| | Architecture | Hub-and-spoke | Decentralized, K8s-idiomatic | | |
| | Multi-cluster | Native (App-of-Apps, ApplicationSets) | Per-cluster Flux + Notification Controller | | |
| | Best for | Most enterprises in 2025 | Air-gapped / minimal-deps / true GitOps purists | | |
| Default 2026 recommendation: **ArgoCD** for most orgs. | |
| ### 6.4 Service Mesh β Istio vs Linkerd vs Cilium (2025) | |
| | Mesh | Sidecars | Data plane | Best fit | | |
| |------|----------|------------|----------| | |
| | Istio | Sidecar OR Ambient (ztunnel + waypoint) | Envoy | Advanced traffic mgmt, deep telemetry | | |
| | Linkerd | Sidecar only | linkerd2-proxy (lightweight Rust) | Simplicity + lowest overhead | | |
| | Cilium | Sidecarless | eBPF + Envoy (L7) | Network policy + perf at scale | | |
| **Memory cost reality**: 500 services on Istio sidecar = ~25β50 GB more RAM than same on Linkerd. Translates to real $$. | |
| **Cilium caveat**: eBPF can't parse HTTP/gRPC or do mTLS termination β Cilium still uses Envoy for L7, so the perf delta vs Istio at L7 is small. | |
| **Decision tree**: | |
| ``` | |
| Tiny team, just want mTLS + observability β Linkerd | |
| Already on Cilium CNI, want unified β Cilium Service Mesh | |
| Need full traffic mgmt (canary, mirror, fault) β Istio Ambient | |
| ``` | |
| ### 6.5 Ingress + Gateway API (the Ingress era is ending) | |
| Ingress-NGINX official **maintenance halt March 2026**. Gateway API is the K8s-official successor. | |
| Gateway API provides: | |
| - Protocol-agnostic (HTTP, TCP, gRPC, TLS passthrough) | |
| - Role-split: GatewayClass (provider) β Gateway (cluster operator) β HTTPRoute (app dev) | |
| - Built-in canary/blue-green via weighted routes | |
| - Both north-south AND east-west | |
| Ingress controllers / Gateway implementations: | |
| | Implementation | Notes | | |
| |----------------|-------| | |
| | Envoy Gateway | Reference implementation; CNCF | | |
| | Istio | Native Gateway API support (replaces Istio VirtualService for new) | | |
| | NGINX Gateway Fabric | NGINX-backed, replaces ingress-nginx | | |
| | Cilium Gateway | CNI-integrated | | |
| | Traefik | Long-time leader for Ingress; Gateway API supported | | |
| Migration: **`ingress2gateway` 1.0** (2026) translates Ingress + annotations β Gateway API resources. | |
| ### 6.6 Operators | |
| ``` | |
| Operator SDK (Red Hat) β Go/Helm/Ansible scaffolding | |
| Kubebuilder β upstream K8s SIG; cleaner Go | |
| KUDO β declarative operator definition | |
| metacontroller β Lua/JSONNET-style hooks (lightweight) | |
| ``` | |
| When to write an operator: state machine that doesn't fit `Deployment` (e.g., DB clustering, leader election, custom backup). | |
| When NOT: just templating β use Helm/Kustomize. | |
| ### 6.7 Multi-cluster β Karmada vs Cluster API vs OCM | |
| **KubeFed is EOL** (no commits since 2020). | |
| | Tool | Approach | | |
| |------|----------| | |
| | Karmada | CNCF Incubation; multi-cluster scheduling + propagation policy. v1.15 (Oct 2025) adds multi-template workload awareness + structured logging | | |
| | Cluster API (CAPI) | Declarative cluster lifecycle (CAPA AWS, CAPG GCP, CAPZ Azure providers) | | |
| | Open Cluster Management (OCM) | Red Hat-led; ACM commercial product | | |
| | Anthos / GKE Enterprise | GCP-managed; folds in Config Sync + Mesh + Policy | | |
| | Azure Arc | Azure-managed; brings Azure Policy/Monitor to any cluster | | |
| Pattern: **CAPI provisions clusters**, **Karmada propagates workloads**, **ArgoCD reconciles config**. | |
| ### 6.8 Cost β Kubecost vs OpenCost | |
| - OpenCost (Apache 2.0) β free, single-cluster focus, real-time allocation by pod/namespace/controller, multi-cloud (AWS/GCP/Azure). Now ships with built-in MCP server (2025) for AI agent access. | |
| - Kubecost (IBM-owned post-2024 acquisition) β adds budgets, RBAC, multi-cluster aggregation, automated cost policies. Starts $449/mo, enterprise on quote. | |
| ### 6.9 Karpenter + Spot + Graviton | |
| Real customer outcomes: | |
| - Tinybird: 20% AWS bill reduction with EKS+Karpenter+Spot | |
| - Series B SaaS (200 microservices): $52k β $23k/mo (56%) with Graviton mix + Karpenter + Spot | |
| - One reported migration: $50k β $22k/mo Karpenter + Spot + VPA | |
| **NodePool best practices**: | |
| ```yaml | |
| apiVersion: karpenter.sh/v1 | |
| kind: NodePool | |
| metadata: | |
| name: default | |
| spec: | |
| template: | |
| spec: | |
| requirements: | |
| - key: kubernetes.io/arch | |
| operator: In | |
| values: ["arm64", "amd64"] # Graviton preferred but allow x86 fallback | |
| - key: karpenter.sh/capacity-type | |
| operator: In | |
| values: ["spot", "on-demand"] | |
| - key: karpenter.k8s.aws/instance-category | |
| operator: In | |
| values: ["m", "c", "r"] | |
| - key: karpenter.k8s.aws/instance-generation | |
| operator: Gt | |
| values: ["6"] # m6g+, c6g+, r6g+ | |
| nodeClassRef: | |
| group: karpenter.k8s.aws | |
| kind: EC2NodeClass | |
| name: default | |
| disruption: | |
| consolidationPolicy: WhenEmptyOrUnderutilized | |
| consolidateAfter: 30s | |
| limits: | |
| cpu: 1000 | |
| ``` | |
| ### 6.10 Helm chart example for a service | |
| ```yaml | |
| # values.yaml | |
| image: | |
| repository: ghcr.io/org/api | |
| tag: "" # set by ArgoCD Image Updater or via CI | |
| pullPolicy: IfNotPresent | |
| resources: | |
| requests: | |
| cpu: 100m | |
| memory: 128Mi | |
| limits: | |
| memory: 512Mi | |
| autoscaling: | |
| enabled: true | |
| minReplicas: 3 | |
| maxReplicas: 30 | |
| targetCPUUtilizationPercentage: 70 | |
| serviceAccount: | |
| create: true | |
| annotations: | |
| eks.amazonaws.com/role-arn: arn:aws:iam::123:role/api-irsa # IRSA for AWS | |
| podSecurityContext: | |
| runAsNonRoot: true | |
| runAsUser: 65532 | |
| fsGroup: 65532 | |
| securityContext: | |
| allowPrivilegeEscalation: false | |
| readOnlyRootFilesystem: true | |
| capabilities: | |
| drop: ["ALL"] | |
| seccompProfile: | |
| type: RuntimeDefault | |
| podDisruptionBudget: | |
| enabled: true | |
| minAvailable: 2 | |
| networkPolicy: | |
| enabled: true | |
| ingress: | |
| - from: | |
| - podSelector: | |
| matchLabels: | |
| role: gateway | |
| ``` | |
| ### 6.11 Training corpus for K8s | |
| ``` | |
| - kubernetes/kubernetes (source + design proposals KEPs) | |
| - kubernetes/website (docs) | |
| - helm/charts (deprecated) + bitnami/charts + community charts | |
| - argo-cd/argo-cd repo + examples | |
| - karmada-io/karmada | |
| - cilium/cilium (eBPF code + e2e tests) | |
| - istio/istio | |
| - linkerd/linkerd2 | |
| - aws/karpenter-provider-aws | |
| - backstage/backstage source + plugins | |
| - run-x/awesome-kubernetes | |
| - KubeCon talk transcripts (CNCF YouTube; can transcribe via Whisper) | |
| ``` | |
| **Eval target**: 70% on K8s-Bench (manifest validity + Helm chart that `helm template` validates + ArgoCD Application that syncs + NetworkPolicy that locks down by default). | |
| --- | |
| ## 7. Internal Developer Platform (IDP) | |
| ### 7.1 IDP landscape (2025) | |
| | Tool | Type | Strength | TTV (time-to-value) | | |
| |------|------|---------|---------------------| | |
| | **Backstage** (Spotify, CNCF) | OSS framework, build-it-yourself portal | Most flexible; 120+ Spotify-internal plugins; CNCF | 3β6 months | | |
| | **Port** | Commercial SaaS portal | No-code, fast to deploy | Days | | |
| | **Cortex** | Commercial β service ownership + scorecards | Best for >50-eng orgs needing governance | Weeks | | |
| | **OpsLevel** | Commercial β quality scorecards | Strong dashboards | Weeks | | |
| | **Humanitec** | Platform Orchestrator (NOT a portal) | Backend that resolves Score files into infra | Weeks | | |
| | **Encore** | All-in-one (codegen + infra) | Strong opinionated dev workflow | Days | | |
| | **Cloudomation** | Workflow automation IDP | Low-code for non-K8s orgs | Days | | |
| **Key mental model**: Portal (Backstage/Port) β Orchestrator (Humanitec). You often need BOTH β portal as UI, orchestrator as the backend that creates the actual cloud resources. | |
| ### 7.2 Backstage core | |
| ``` | |
| Catalog β entities (Component, System, API, Resource, Group, User) | |
| TechDocs β MkDocs-based, lives next to code | |
| Software Templates β Cookiecutter-style scaffolds (repo + IaC + pipeline + DB) | |
| Search β indexes catalog + docs | |
| RBAC β Spotify's RBAC plugin (commercial) | |
| Soundcheck β Spotify's tech-standards/scorecard plugin (commercial) | |
| Insights β adoption analytics (commercial) | |
| Cloud Backstage β managed hosted (commercial) | |
| ``` | |
| Open-source plus commercial Spotify Portal (RBAC, Insights, Soundcheck) = "production-ready Backstage." | |
| **catalog-info.yaml** example: | |
| ```yaml | |
| apiVersion: backstage.io/v1alpha1 | |
| kind: Component | |
| metadata: | |
| name: orders-api | |
| description: Order management service | |
| annotations: | |
| github.com/project-slug: org/orders-api | |
| backstage.io/techdocs-ref: dir:. | |
| pagerduty.com/integration-key: ${SECRET_PD} | |
| sonarqube.org/project-key: org_orders-api | |
| grafana/dashboard-selector: "tags @> 'orders'" | |
| tags: [java, spring-boot, payments-domain] | |
| spec: | |
| type: service | |
| lifecycle: production | |
| owner: payments-team | |
| system: payments | |
| providesApis: [orders-rest-api] | |
| consumesApis: [users-rest-api] | |
| dependsOn: [resource:orders-db] | |
| ``` | |
| ### 7.3 Score (CNCF, 2024) β workload spec | |
| Score is platform-agnostic workload spec. Developer writes ONE YAML; platform team's `score-compose` / `score-helm` / `score-k8s` translates. | |
| ```yaml | |
| # score.yaml | |
| apiVersion: score.dev/v1b1 | |
| metadata: | |
| name: orders-api | |
| containers: | |
| api: | |
| image: ghcr.io/org/orders-api:${TAG} | |
| variables: | |
| DATABASE_URL: postgres://${secrets.DB_PASSWORD}@${resources.db.host}/orders | |
| REDIS_URL: ${resources.cache.url} | |
| resources: | |
| requests: { cpu: "100m", memory: "256Mi" } | |
| service: | |
| ports: | |
| web: { port: 8080 } | |
| resources: | |
| db: | |
| type: postgres | |
| cache: | |
| type: redis | |
| route: | |
| type: dns | |
| params: { host: orders.example.com } | |
| ``` | |
| The platform team configures resource definitions (e.g., `db.postgres β AWS RDS via Crossplane`) β devs don't see/care. | |
| ### 7.4 OAM vs Score | |
| OAM is broader (whole-app model with traits + scopes + components); Score is single-workload + simpler. Score is winning in 2025 because of its narrower scope and CNCF backing. | |
| ### 7.5 Humanitec orchestrator pattern | |
| ``` | |
| Developer: score.yaml in repo | |
| GitOps: commit β CI β calls Humanitec API | |
| Humanitec: resolves score against Resource Definitions | |
| β creates EKS Deployment + RDS + Redis + Route53 record | |
| Platform: defines Resource Definitions (e.g., postgres β AWS RDS via TF/Crossplane) | |
| ``` | |
| ### 7.6 Training corpus for IDP | |
| ``` | |
| - backstage/backstage source + ALL community plugins (roadie/* / spotify/*) | |
| - score-spec/spec + reference implementations (score-compose/score-helm/score-k8s) | |
| - Humanitec docs + Resource Definition examples | |
| - Port templates marketplace | |
| - Cortex YAML scorecard library | |
| - platformengineering.org community articles | |
| - KubeCon Platform Engineering Day talks (transcripts) | |
| ``` | |
| --- | |
| ## 8. Edge + Serverless Platforms | |
| ### 8.1 Latency / cold-start reality (2025) | |
| | Platform | P50 latency | Cold start | POPs | | |
| |----------|------------|-----------|------| | |
| | Cloudflare Workers | 10β30ms | <1ms (V8 isolates) | 330+ | | |
| | Vercel Edge Functions | <50ms | sub-50ms | 18 (uses Lambda@Edge under hood in some regions) | | |
| | Lambda@Edge (Node) | 50β80ms | 250β800ms | AWS edge POPs | | |
| | Lambda@Edge (Python) | similar | 400β1200ms | same | | |
| | Fastly Compute@Edge (WASM) | ~5β10ms | <1ms | 80+ | | |
| | Deno Deploy | low | low | global | | |
| | Bun runtime | fastest cold-start of any Node-compat | n/a | self-hosted | | |
| Cloudflare Workers ~441% faster than Lambda at p95, and unlimited bandwidth on free tier. | |
| ### 8.2 Cloudflare ecosystem | |
| ``` | |
| Workers β V8 isolate functions (JS/TS/WASM) | |
| Pages β static + Workers (serverless full-stack) | |
| R2 β S3-compatible object storage, zero egress | |
| D1 β serverless SQLite (replicated) | |
| KV β eventually-consistent KV | |
| Durable Objects β strongly-consistent stateful primitives | |
| Queues β managed message queue | |
| Workers AI β run LLMs at the edge (Llama, Whisper, Stable Diffusion) | |
| Vectorize β vector DB (RAG at edge) | |
| Hyperdrive β connection pooler for Postgres/MySQL behind edge | |
| Stream β video transcoding + delivery | |
| ``` | |
| ### 8.3 Vercel ecosystem | |
| ``` | |
| Edge Functions β Cloudflare Workers-compatible runtime (Node + Python + Go + Ruby) | |
| Edge Middleware β run BEFORE the request enters serverless | |
| Serverless Funcs β Lambda@Edge under the hood | |
| Postgres β managed Postgres (built on Neon) | |
| KV β built on Upstash Redis | |
| Blob β object storage | |
| ``` | |
| ### 8.4 Multi-region edge strategies | |
| ``` | |
| Pattern 1 β Edge cache + origin in primary region | |
| Cloudflare cache β S3/Lambda in us-east-1 | |
| Trade: simple, 100ms+ for cache misses | |
| Pattern 2 β Workers + DB-at-edge | |
| CF Workers β D1/Hyperdrive | |
| Trade: edge writes; eventual consistency | |
| Use: read-heavy auth, profile, feature flags | |
| Pattern 3 β Multi-region active/active | |
| CF LB β Workers in EU + US + APAC β regional Aurora DSQL | |
| Trade: cost 2x; near-zero RTO across regions | |
| Pattern 4 β Global table + edge CDN | |
| CF Cache β Lambda β DynamoDB Global Tables (multi-master) | |
| Trade: replication lag; eventual consistency | |
| ``` | |
| ### 8.5 WASM serverless (2025β2026) | |
| - WASI 0.2 (Component Model) GA β portable across runtimes (Wasmtime, Spin, wasmCloud, Wasmer). | |
| - Cold starts: microseconds (vs 100β500ms for containers). | |
| - Major clouds now offer Wasm-based FaaS as mainstream option. | |
| - Wing language **shutdown April 2025** β OSS code lives on but no commercial backing. | |
| --- | |
| ## 9. Database Platform | |
| ### 9.1 Postgres options | |
| | Service | Multi-region | Best for | | |
| |---------|--------------|----------| | |
| | RDS Postgres | Read replicas | Standard managed | | |
| | Aurora Postgres | Cross-region read replicas + Global DB (1 writer) | Standard scale-out | | |
| | **Aurora DSQL** | **Active/active, strong consistency, GA May 2025** | **New globally-distributed apps** | | |
| | AlloyDB (GCP) | HA + read pool nodes | Postgres-compat OLTP+OLAP at GCP | | |
| | Cloud SQL (GCP) | Single-region HA | Standard managed | | |
| | Azure Database for Postgres Flex | Single-region HA | Standard managed | | |
| | Neon | Branching (Git-like) | Dev velocity | | |
| | Supabase | Postgres + auth + realtime | Full BaaS | | |
| | Crunchy Bridge | Multi-cloud Postgres | Vendor-neutral | | |
| | PlanetScale | (Now Postgres + Vitess) | Sharded scale-out | | |
| ### 9.2 Aurora DSQL deep cuts (GA May 2025) | |
| - Disaggregated architecture: query processor + adjudicator + journal + crossbar β each scales independently. | |
| - 99.99% single-region SLA, 99.999% multi-region. | |
| - Active/active multi-master (peers); third region as log-only witness. | |
| - Region groupings only β US (us-east-1, us-east-2, us-west-2), EU (eu-west-1/2/3), APAC (ap-northeast-1/2/3). | |
| - No cross-continent yet. | |
| - PostgreSQL wire-compatible. | |
| ### 9.3 Distributed SQL (NewSQL) | |
| | DB | TPC-C (TPS) | PG compat | Multi-region | | |
| |----|-------------|-----------|--------------| | |
| | CockroachDB | 45k | wire only | Best with geo-partitioning | | |
| | YugabyteDB | 48k | full (reuses PG query layer) | Strong with row-level geo | | |
| | TiDB | 40k+ (write-heavy lead) | MySQL primary | β | | |
| | Aurora DSQL | benchmarked fastest by AWS | wire | Region-grouped | | |
| | Spanner | 1M+ at scale | GoogleSQL or PG dialect | Global by design | | |
| YugabyteDB wins for PG migration (full compat). CockroachDB wins for geo-partitioning. Spanner remains gold standard at hyperscale. | |
| ### 9.4 Vitess (MySQL sharding) | |
| - Open-source MySQL sharding system. | |
| - Powers YouTube, Slack, GitHub, PlanetScale. | |
| - Functions: query routing, online schema migration (with `gh-ost`), connection pooling, transparent sharding. | |
| - Newer alternative: CockroachDB / Aurora DSQL eliminate manual sharding. | |
| ### 9.5 NoSQL | |
| ``` | |
| DynamoDB β AWS, single-digit-ms; on-demand or provisioned | |
| DynamoDB Global Tables β multi-region multi-master (last-writer-wins) | |
| Spanner β strongly consistent global | |
| Cosmos DB β multi-model (SQL/MongoDB/Cassandra/Gremlin); 5 consistency levels | |
| Cassandra/Scylla β wide-column; high write throughput | |
| MongoDB Atlas β document; managed across all 3 clouds | |
| ``` | |
| ### 9.6 Vector DBs (2025 production benchmarks) | |
| | DB | p99 latency | QPS | Notable | | |
| |----|-------------|-----|---------| | |
| | Qdrant | 2ms | 12k | Best low-latency, $25/mo+ cloud | | |
| | Milvus / Zilliz | 5ms | 8k | Billion-scale; built-in BM25 + dense (30x faster than Elasticsearch) | | |
| | Pinecone | 8ms | 5k | Fully managed, 99% recall | | |
| | Weaviate | 10ms | 4k | BlockMax WAND (10x keyword speed); MUVERA multi-vector | | |
| | pgvector | varies | depends | If you already have Postgres | | |
| | OpenSearch k-NN | varies | depends | If you already have OpenSearch | | |
| ### 9.7 Migration tools (2025) | |
| | Tool | Approach | Best for | | |
| |------|----------|----------| | |
| | Liquibase | Imperative changelogs (XML/YAML/JSON/SQL); FSL license post-v5; AI rollback assist (2025) | Multi-DB enterprise | | |
| | Flyway | Numbered SQL files; Java ecosystem standard; Teams tier discontinued 2025 | Java teams | | |
| | Atlas (atlasgo.io) | Declarative HCL + computed migration plan | Terraform-style schema-as-code | | |
| | Prisma Migrate | Declarative, ORM-coupled | Node/TS apps | | |
| | goose | Plain SQL/Go migrations | Go services | | |
| Atlas is the modern recommendation β same paradigm as Terraform. | |
| --- | |
| ## 10. Networking Deep | |
| ### 10.1 DNS | |
| ``` | |
| Route53 β AWS native, latency/geo/failover; alias records to AWS resources | |
| Cloud DNS β GCP native | |
| Azure DNS β Azure native | |
| Cloudflare DNS β fastest authoritative (1.1.1.1 is recursive); free | |
| NS1 / Constellix β enterprise multi-cloud DNS, advanced traffic steering | |
| ``` | |
| ### 10.2 CDN performance (Cloudflare 95p TTFB benchmark, Nov 2024βMar 2025) | |
| - Cloudflare fastest in ~48% of top 1000 networks. | |
| - Fastly extremely close in many networks (e.g., +0.2% lead on Comcast). | |
| - CloudFront strong inside AWS-heavy stacks (free egress to AWS origins). | |
| - All have edge compute now: Workers / Compute@Edge / Lambda@Edge. | |
| ### 10.3 Load balancers (AWS) | |
| ``` | |
| ALB (L7) β HTTP/HTTPS/gRPC; WAF integration; target group flexibility | |
| NLB (L4) β TCP/TLS/UDP; static IPs; >millions RPS | |
| GWLB β traffic inspection (third-party firewall in chain) | |
| ELB Classic β legacy, avoid | |
| GAL (Global Accelerator) β anycast IPs in front of ALB/NLB for global traffic | |
| ``` | |
| ### 10.4 Zero Trust Network Access (2025) | |
| | Tool | Architecture | Best fit | | |
| |------|-------------|----------| | |
| | Tailscale | WireGuard mesh + identity overlay | Fastest dev access; great for SSH/RDP/DB | | |
| | Twingate | Layer 4 ZTNA (no mesh); resource-grain | App-name + group-based access | | |
| | Cloudflare Access + WARP | SASE β Access for apps + Gateway for SWG | When Cloudflare is the wider stack | | |
| | Zscaler | Enterprise SASE | Big-org compliance | | |
| | Pomerium | Self-hosted reverse-proxy ZTNA | OSS option | | |
| Tailscale wins on dev velocity (sign in, get tailnet); Cloudflare Access wins on full SASE; Twingate wins on resource granularity. | |
| ### 10.5 WAF | |
| ``` | |
| AWS WAF β tied to CloudFront/ALB/API Gateway | |
| Cloudflare WAF β in front of any origin | |
| Azure Front Door WAF β tied to AFD | |
| Akamai App & API Protector β enterprise | |
| ``` | |
| ### 10.6 DDoS protection | |
| ``` | |
| AWS Shield Advanced β $3000/mo + transfer; 24/7 SRT | |
| Cloudflare β unmetered DDoS protection (free tier!) | |
| Google Cloud Armor β tier-based | |
| Azure DDoS Protection Standard β per-resource | |
| ``` | |
| --- | |
| ## 11. FinOps + Cost Engineering | |
| ### 11.1 FinOps Foundation Framework 2025 (Inform / Optimize / Operate) | |
| ``` | |
| INFORM β Visibility, allocation, benchmarking, budgeting, forecasting | |
| OPTIMIZE β Identify and execute waste reduction | |
| OPERATE β KPI tracking, governance policies aligned with business | |
| ``` | |
| ### 11.2 2025 framework changes β **Scopes** | |
| The 2025 Framework adds **Scopes** as a structural element. Scopes define context: Public Cloud, SaaS (Snowflake, Salesforce), GenAI (LLM API spend), Data Center, Private Cloud. Each capability is now applied **per scope**. | |
| ### 11.3 Cost allocation tags (mandatory at provision time) | |
| ``` | |
| Required tags for every resource: | |
| - Environment : prod/staging/dev/sandbox | |
| - Owner : team-name (matches catalog) | |
| - CostCenter : finance code | |
| - Project : product/feature | |
| - DataClass : public/internal/confidential/regulated | |
| ``` | |
| Enforce via: | |
| - AWS: SCP `aws:RequestTag/X` (deny on creation), Tag Policies | |
| - GCP: Org Policy required labels | |
| - Azure: Azure Policy required tags | |
| ### 11.4 Showback / chargeback | |
| ``` | |
| Showback β "your team used $X" (no actual billing) | |
| Chargeback β cross-charge cost center (real finance impact) | |
| ``` | |
| Tools: Vantage, CloudHealth, Apptio Cloudability, Kubecost (k8s-specific), Infracost (pre-deploy IaC estimate). | |
| ### 11.5 Anomaly detection | |
| ``` | |
| AWS Cost Anomaly Detection (free) | |
| Vantage anomalies + alerts (commercial) | |
| CloudZero / Spend.io | |
| ProsperOps β automated commitment management | |
| ``` | |
| ### 11.6 Right-sizing automation | |
| ``` | |
| AWS Compute Optimizer β free; recs for EC2, Lambda, EBS, ASG, ECS | |
| GCP Recommender β equivalent | |
| Azure Advisor β equivalent | |
| ScaleOps / StormForge β K8s VPA recommender for prod | |
| ``` | |
| ### 11.7 Spot orchestration (Karpenter, ProsperOps) | |
| Already covered Β§6.9. Karpenter native AWS spot fleet; CAST AI / ScaleOps for cross-cloud. | |
| ### 11.8 Training corpus for FinOps | |
| ``` | |
| - FinOps Foundation framework docs (finops.org/framework) | |
| - AWS / GCP / Azure cost optimization whitepapers | |
| - Vantage / CloudZero / Apptio public benchmarks | |
| - KubeCon FinOps track talks (transcripts) | |
| - Real customer cost-cut case studies (already collected: Tinybird, Series B SaaS examples) | |
| ``` | |
| --- | |
| ## 12. 2025β2026 Platform Engineering Trends | |
| ### 12.1 Internal LLM gateways (the 2026 must-have) | |
| | Tool | Type | Key strength | Cost | | |
| |------|------|-------------|------| | |
| | **LiteLLM** | OSS, self-host | OpenAI-compat; cheapest at $10k+ MRR; 100+ providers | Free + infra | | |
| | **Portkey** | SaaS or self-host | SOC2/HIPAA/ISO27001; observability; 250+ LLMs | $49/mo+ | | |
| | **OpenRouter** | SaaS | Pay-per-token; consumer-friendly | 5% markup | | |
| | **Helicone** | OSS observability | Caching + analytics | Free + cloud | | |
| | **Truefoundry / Bifrost** | SaaS | LLM gateway + ML platform | Quote | | |
| LiteLLM is the **default for orgs serious about cost** β runs as your own proxy, no markup. | |
| ### 12.2 AI agents in platform engineering | |
| - **Resolve.ai** β AI SRE; auto-investigates alerts, RCA in minutes, MTTR -80%; customers: Coinbase, DoorDash, Toast, Zscaler. $40M Series A extension at $1.5B (2026). | |
| - **Aviator (aviator.co)** β AI code review + merge queues + deployment. | |
| - **OpenText DevOps Aviator** β AI for performance engineering scripts. | |
| - **Cursor / Sourcegraph Cody / GitHub Copilot Workspace** β IDE-side coding agents. | |
| - **Codeium / Tabnine / Continue** β open-source IDE agents. | |
| ### 12.3 Per-PR ephemeral environments | |
| | Tool | Approach | | |
| |------|----------| | |
| | Coherence | PR comment with auto-preview URL; spot-backed for cost | | |
| | Uffizzi | OSS + cloud; vCluster-based isolated environments | | |
| | Render Preview | Built-in to Render | | |
| | Vercel Preview | Built-in to Vercel | | |
| | Netlify Deploy Previews | Built-in to Netlify | | |
| | Argo CD ApplicationSet PR Generator | OSS K8s-native | | |
| | vCluster + ArgoCD | DIY pattern; cheapest at scale | | |
| Best practice: every PR gets a unique URL, smoke tests run against it, design reviewer can click before merge. | |
| ### 12.4 WASM-based services | |
| - Production runtimes: Wasmtime (Bytecode Alliance), wasmCloud, Spin (Fermyon), Wasmer. | |
| - Use cases: edge serverless, plugin systems (Envoy filters, Istio extensions, Postgres extensions), embedded scripting. | |
| - Platforms moving to Wasm: Fastly Compute@Edge (WASM-only), Cloudflare Workers (V8 + WASM), Spin Hub. | |
| ### 12.5 AI-native databases / observability | |
| ``` | |
| LangSmith β LLM tracing + evals (LangChain) | |
| Helicone β LLM tracing + caching (OSS option) | |
| Phoenix (Arize) β OSS LLM observability | |
| Langfuse β OSS, self-host LLM observability | |
| Weights & Biases Weave β MLOps + LLM | |
| ``` | |
| ### 12.6 Autonomous Cloud Engineer (the Surrogate-1 mission) | |
| The path is converging on: | |
| 1. **MCP** (Model Context Protocol) β standardize how agents pull cloud state. AWS docs MCP, OpenCost MCP (2025), terraform-mcp-server. | |
| 2. **Multi-agent systems** β research / planner / executor / critic agents (CrewAI, LangGraph, AutoGen). | |
| 3. **Tool-using agents** β agents that call `terraform plan`, `kubectl apply`, `aws sts get-caller-identity`, `gh pr create`. | |
| Surrogate-1's training MUST include MCP-call patterns + tool-use traces. | |
| --- | |
| ## 13. Training Data Sources | |
| ### 13.1 Curated GitHub repos | |
| ``` | |
| Cloud | |
| - awesome-aws (donnemartin) | |
| - awesome-gcp (GoogleCloudPlatform/awesome-google-cloud) | |
| - awesome-azure (kristofferandreasen/awesome-azure) | |
| - aws-samples/* (8000+ official AWS samples) | |
| - GoogleCloudPlatform/* (1500+ GCP samples) | |
| - Azure-Samples/* | |
| K8s | |
| - run-x/awesome-kubernetes | |
| - ramitsurana/awesome-kubernetes | |
| - tomhuang12/awesome-k8s-resources | |
| - kubernetes/kubernetes (source + KEPs) | |
| - kubernetes-sigs/* (CAPI, Gateway API, Karpenter) | |
| - helm/charts (deprecated but reference) | |
| - bitnami/charts | |
| - argoproj/argo-cd | |
| IaC | |
| - hashicorp/terraform | |
| - terraform-aws-modules/* (40+ official modules) | |
| - terraform-google-modules/* | |
| - Azure/terraform-azurerm-* (AVM) | |
| - pulumi/examples | |
| - aws/aws-cdk | |
| - crossplane/crossplane + upbound/configurations | |
| Platform | |
| - backstage/backstage + roadie/* + spotify/* community plugins | |
| - score-spec/spec | |
| - humanitec-architecture/* | |
| Eval | |
| - codefuse-ai/codefuse-devops-eval | |
| - IaC-Eval (academic) | |
| - NL2Bash | |
| ``` | |
| ### 13.2 Reddit communities (curate top-voted threads, last 2 yrs) | |
| - r/devops, r/aws, r/AZURE, r/googlecloud, r/kubernetes, r/Terraform, r/sysadmin, r/sre, r/platformengineering | |
| ### 13.3 Conference talks (transcribe via Whisper, MIT-licensed for TLP/CNCF) | |
| - KubeCon + CloudNativeCon (CNCF YouTube; ~600 talks/year) | |
| - AWS re:Invent (multiple thousand sessions, breakouts archived) | |
| - Google Cloud Next (annual) | |
| - Microsoft Ignite / Build | |
| - HashiConf | |
| - PlatformCon (annual, online) | |
| - SREcon (USENIX) | |
| ### 13.4 Public datasets on HuggingFace | |
| ``` | |
| - CatOwl/Terraform (Terraform code corpus) | |
| - nvidia/OpenCodeReasoning (reasoning over code) | |
| - bigcode/the-stack-v2 (filtered code, has IaC files) | |
| - mhhmm/codealpaca-iac (instruction tuning for IaC) | |
| - Custom: collect from terraform-aws-modules/eks/aws + variants | |
| ``` | |
| ### 13.5 Documentation (for retrieval / SFT context) | |
| - AWS docs (full), GCP docs, Azure docs (Microsoft Learn), CNCF docs, K8s docs, Helm docs, Terraform/OpenTofu docs. | |
| - AWS Well-Architected Framework PDFs (one per pillar). | |
| - Google Cloud Architecture Framework. | |
| - Azure Cloud Adoption Framework + Well-Architected Framework. | |
| ### 13.6 Synthesized data (recommended approach) | |
| For Surrogate-1 v2: | |
| ``` | |
| 1. Take each terraform-aws-modules example | |
| 2. Mutate: change region, instance type, AZ count, subnet sizes | |
| 3. Build instruction format: "Build me a 3-AZ VPC in us-west-2 with public+private+db subnets using terraform-aws-modules/vpc/aws" | |
| 4. Output: working main.tf + outputs.tf + variables.tf | |
| 5. For each AWS service, generate: | |
| - "What is X" Q&A from official docs | |
| - "Compare X vs Y" from official docs | |
| - "Migrate from X to Y" code examples | |
| 6. Multi-step trajectories: | |
| - "Build me a SaaS platform on AWS" β 30+ step reasoning trace through architecture decisions | |
| ``` | |
| Total target: ~100kβ250k cloud/platform instruction-tuning examples. | |
| --- | |
| ## 14. Eval Benchmarks | |
| ### 14.1 Existing benchmarks | |
| | Benchmark | What it tests | Surrogate-1 fit | | |
| |-----------|---------------|-----------------| | |
| | codefuse-ai/codefuse-devops-eval | DevOps Q&A multiple-choice | Quick sanity check | | |
| | IaC-Eval (academic) | Terraform generation correctness | Direct fit | | |
| | KubeBench (community) | K8s manifest validity | Direct fit | | |
| | NL2Bash | Bash command from NL | Tooling sub-skill | | |
| | BIG-Bench (subset) | Various reasoning | General | | |
| | HumanEval / MBPP | General coding | Already passes (Qwen2.5-Coder-7B baseline) | | |
| ### 14.2 Custom Surrogate-1 v2 evals (we author) | |
| ``` | |
| Surrogate-1 Cloud Eval v2: | |
| 1. Terraform generation (200 prompts, varying complexity) | |
| - Pass = `terraform validate` + `terraform plan` succeeds | |
| - Score: % passing Γ % correct logical structure (judge LLM) | |
| 2. Helm chart authoring (50 prompts) | |
| - Pass = `helm template` produces valid YAML | |
| - Score: % passing Γ `kubeval` validation rate | |
| 3. CDK/CFN authoring (100 prompts) | |
| - Pass = `cdk synth` succeeds | |
| - Score: + `cfn-lint` clean rate, + `cfn-guard` policy pass | |
| 4. ArgoCD Application + Kustomize (50 prompts) | |
| - Pass = ArgoCD CLI dry-run succeeds | |
| 5. Multi-cloud DR scenario (30 prompts) | |
| - Open-ended: "Design active/passive across AWS+GCP for a SaaS, RTO=15min, RPO=1min" | |
| - Score: judged by GPT-5 / Claude / human reviewer on architecture quality | |
| 6. Cost optimization (50 prompts) | |
| - Given a `terraform plan` output, return cost reductions (Graviton swap, RIs/SPs, Spot) | |
| - Score: judged on $$ accuracy (vs Infracost ground truth) | |
| 7. K8s troubleshooting (50 prompts) | |
| - Given pod logs + describe output, return root cause + fix | |
| - Score: % matching ground truth | |
| 8. Tool-use traces (100 prompts) | |
| - Given a goal, agent must call `aws cli` / `kubectl` / `terraform` correctly | |
| - Score: % achieving goal (sandbox eval) | |
| ``` | |
| Total: ~630 prompts. Run with rubric judges (GPT-5/Claude). Surrogate-1 v2 target: **65% overall** (above Qwen2.5-Coder-7B baseline of ~38%). | |
| ### 14.3 Capability tiers (target) | |
| | Tier | Capability | v2 Target | | |
| |------|-----------|-----------| | |
| | 1 | Recognize + classify cloud services | 95% | | |
| | 2 | Author single-file IaC (Terraform/CDK/Helm) | 75% | | |
| | 3 | Author multi-file project (VPC + EKS + RDS + ArgoCD) | 60% | | |
| | 4 | End-to-end design trace ("build SaaS on AWS") | 50% | | |
| | 5 | Multi-cloud DR design + tool execution | 35% (stretch) | | |
| --- | |
| ## v2 Curriculum Integration Plan | |
| For the v2 LoRA fine-tune of Qwen2.5-Coder-7B β Surrogate-1: | |
| ### Data mix (target ~250k instruction examples) | |
| ``` | |
| 40% IaC generation (Terraform / OpenTofu / CDK / Pulumi / Bicep / Crossplane) | |
| 20% K8s authoring (Helm / Kustomize / ArgoCD / Karpenter) | |
| 15% Cloud architecture Q&A (mined from cert prep + docs) | |
| 10% Cost optimization scenarios (FinOps mined + synthesized) | |
| 10% IDP / Backstage / Score / Humanitec patterns | |
| 5% Multi-step tool-use traces (terraform plan β fix β apply) | |
| ``` | |
| ### Key sources (direct ingestion priorities) | |
| ``` | |
| 1. terraform-aws-modules/* + terraform-google-modules/* + Azure AVM (canonical IaC) | |
| 2. backstage/backstage source + plugin examples | |
| 3. AWS Well-Architected docs (all pillars + lenses) | |
| 4. GCP Cloud Adoption Framework | |
| 5. CNCF KubeCon transcripts (Whisper-extracted) | |
| 6. score-spec + humanitec docs | |
| 7. OpenCost docs + MCP-pattern examples | |
| 8. Real customer post-mortems (Tinybird $-20k, Series-B SaaS $-29k) | |
| 9. IaC-Eval benchmark training set | |
| 10. CodeFuse DevOps-Eval training set | |
| ``` | |
| ### Eval gates | |
| - v2 cannot ship until β₯65% overall on Surrogate-1 Cloud Eval v2. | |
| - Tier-3 (multi-file) β₯60% is the practical bar for autonomous infra building. | |
| - Add MCP-tool-use trajectory eval (sandbox terraform/kubectl/aws calls). | |
| --- | |
| ## Sources Consulted | |
| - AWS Well-Architected Framework (6 pillars docs, Sustainability pillar Nov 2024 refresh) | |
| - Terraform / OpenTofu best practices (Terramate, Spacelift, env0, Scalr 2025 articles) | |
| - Kubernetes 1.32-1.35 release notes; CNCF security blog Dec 2025 | |
| - Backstage docs + Spotify Backstage portal blog (2025) | |
| - ArgoCD / FluxCD comparison articles (2025-2026 post-Weaveworks closure) | |
| - Crossplane v2.0 release blog + InfoQ article (Aug 2025) | |
| - Karpenter cost optimization blogs (Tinybird; Series-B SaaS case studies) | |
| - Cloudflare Workers / Vercel Edge / Lambda@Edge benchmarks (2025) | |
| - FinOps Foundation 2025 framework + Scopes update | |
| - Istio / Linkerd / Cilium 2025 benchmarks (deepness-lab academic paper) | |
| - Pulumi / Terraform / CDK / Bicep 2025 comparisons | |
| - CockroachDB / YugabyteDB / Spanner / Aurora DSQL 2025 benchmarks | |
| - AWS SAP-C02 / GCP PCA (Oct 2025 refresh) / Azure AZ-305 (April 2026 refresh) | |
| - Backstage / Port / Cortex / Humanitec IDP comparison (2025-2026) | |
| - Karmada v1.15 + KubeFed EOL + Cluster API | |
| - Coherence / Uffizzi ephemeral environments (2025) | |
| - AWS CDK best practices (CDK Refactor Sept 2025) | |
| - VPC Transit Gateway / PrivateLink hub-spoke patterns | |
| - Helm / Kustomize / Carvel comparison (Helm 4 Nov 2025) | |
| - terraform-aws-modules registry top downloads (May 2025 stats) | |
| - Liquibase / Flyway / Atlas migration tools (2025 license + features) | |
| - Aurora DSQL GA announcement (May 2025) | |
| - CDN benchmarks (Cloudflare 95p TTFB 2024-2025) | |
| - AWS Savings Plans / Reserved Instances June 2025 policy changes | |
| - IAM SCPs + Permission Boundaries + ABAC patterns | |
| - GKE / EKS / AKS managed K8s comparison (2025-2026) | |
| - terraform-aws-modules registry usage (vpc 126M, eks 96.3M downloads) | |
| - Vertex AI / BigQuery / Gemini integration (2025) | |
| - Resolve.ai AI SRE + Aviator (2025-2026) | |
| - LiteLLM / Portkey / OpenRouter LLM gateway comparison (2025) | |
| - Multi-cloud DR active/active vs active/passive patterns | |
| - Wing language shutdown (April 2025) + WASM serverless trends | |
| - Awesome-aws / awesome-kubernetes curated lists | |
| - Kubecost / OpenCost cost visibility (Kubecost IBM acquisition 2024) | |
| - Atlantis / Spacelift / Env0 / Terramate IaC platforms | |
| - Score spec + OAM workload specifications | |
| - Karpenter NodePool + Spot + Graviton best practices | |
| - Tailscale / Twingate / Cloudflare Access ZTNA comparison | |
| - Vector DB benchmarks (Pinecone / Weaviate / Qdrant / Milvus 2025) | |
| - AWS Copilot end-of-support (June 12 2026) + SAM + Amplify | |
| - Gateway API + ingress-nginx retirement (March 2026) | |
| - DevOps eval benchmarks + IaC-Eval academic benchmark | |