As organizations scale their container workloads across multiple regions and cloud providers, the complexity of managing Kubernetes infrastructure grows exponentially. In this post, I’ll share my battle-tested approach to building a production-grade, multi-region Kubernetes platform using GitOps principles and Infrastructure as Code (IaC).
The Challenge
Recently, I led the development of a global platform that needed to:
- Support applications across North America, Europe, and Asia
- Maintain consistent security and compliance controls
- Enable rapid deployment with minimal human intervention
- Provide disaster recovery with RPO < 15 minutes
- Scale to handle 1000+ microservices
Architecture Overview
Here’s the high-level architecture we implemented:
[Git Repositories]
│
▼
[ArgoCD/Flux]──────────────[Terraform Cloud]
│ │
▼ ▼
[Platform Components] [Infrastructure]
- Cert Manager - VPC/Networking
- External DNS - EKS Clusters
- Ingress Controller - IAM Roles
- Monitoring Stack - Security Groups
│
▼
[Regional EKS Clusters]
└── us-east-1
└── eu-west-1
└── ap-southeast-1
Infrastructure as Code Foundation
We used Terraform to define our infrastructure, organizing it into reusable modules:
module "eks_cluster" {
source = "./modules/eks"
for_each = local.regions
region = each.key
cluster_name = "${var.environment}-${each.key}"
node_groups = local.node_group_config[each.key]
vpc_id = module.vpc[each.key].vpc_id
subnet_ids = module.vpc[each.key].private_subnet_ids
tags = {
Environment = var.environment
Region = each.key
ManagedBy = "terraform"
}
}
GitOps Implementation
We chose ArgoCD for GitOps, configuring it to manage both infrastructure and applications:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: platform-services
namespace: argocd
spec:
project: default
source:
repoURL: git@github.com:org/platform-services.git
targetRevision: HEAD
path: manifests
destination:
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true
selfHeal: true
Platform Components
Security and Access Control
We implemented a zero-trust security model using AWS IAM roles and Kubernetes RBAC:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: platform-admin
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: platform-admin-binding
subjects:
- kind: Group
name: platform-admins
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: platform-admin
apiGroup: rbac.authorization.k8s.io
Monitoring and Observability
We deployed a comprehensive monitoring stack:
- Prometheus for metrics collection
- Grafana for visualization
- Loki for log aggregation
- Tempo for distributed tracing
Example Prometheus configuration:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
replicas: 2
retention: 15d
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 100Gi
Performance Optimizations
Some key optimizations we implemented:
- Cluster Autoscaling
resource "aws_autoscaling_group" "nodes" {
desired_capacity = 3
max_size = 10
min_size = 1
mixed_instances_policy {
instances_distribution {
on_demand_percentage_above_base_capacity = 50
}
launch_template {
override {
instance_type = "m6i.2xlarge"
}
}
}
}
- Network Policy Optimization
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Lessons Learned
- State Management: Keep Terraform state in a centralized location (we used S3 + DynamoDB) and implement proper locking:
terraform {
backend "s3" {
bucket = "terraform-state"
key = "platform/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
- Disaster Recovery: Regular testing of DR procedures is crucial. We automated this with chaos engineering:
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-failure-test
spec:
action: pod-failure
mode: one
duration: "10m"
selector:
namespaces:
- default
- Cost Management: Implement proper tagging and use tools like Kubecost for visibility:
resource "aws_eks_node_group" "main" {
tags = {
Environment = var.environment
Team = var.team
CostCenter = var.cost_center
}
}
Performance Results
After implementation, we achieved:
- 99.99% platform availability
- 45% reduction in deployment time
- 30% cost savings through optimized resource utilization
- Zero production incidents during regional failovers
Tech Stack Summary
- Infrastructure: AWS (EKS, VPC, Route53)
- IaC: Terraform
- GitOps: ArgoCD
- Monitoring: Prometheus, Grafana, Loki
- Security: AWS IAM, cert-manager, external-dns
- CI/CD: GitHub Actions, ArgoCD
- Storage: AWS EBS, S3
- Networking: AWS VPC CNI, Calico
This architecture has been running in production for over 6 months, serving millions of requests daily across three continents. The combination of GitOps and IaC has dramatically reduced our operational overhead while improving reliability and security.
Remember, there’s no one-size-fits-all solution. The key is understanding your specific requirements and constraints, then designing a platform that balances complexity with maintainability.
Feel free to reach out if you have questions about implementing similar architectures in your organization!