Building a "Cloud-Gapped" Platform on AWS
This project involved architecting a highly available, production-grade RKE2 cluster within a restricted AWS VPC. To meet strict compliance requirements, the environment was designed with no direct internet egress or ingress. Remote administration is secured via Zero Trust principles, utilizing Cloudflare WARP and Tunnels to grant authorized clients access without exposing public listening ports. The architecture leverages AWS VPC Endpoints (Interface and Gateway) to maintain a fully private “cloud-gapped” posture while retaining access to critical AWS managed services.
| CHALLENGE | DECISION | IMPACT |
|---|---|---|
| Network Isolation | Private Subnet Isolation | Restricted all nodes from direct internet access, achieving compliance for sensitive workloads. |
| Secure Image Supply | ECR + IAM Integration | Secured the supply chain by hosting all runtime images in a private registry with native IAM-based pull authentication. |
| Controlled Updates | S3 Gateway Endpoints | Enabled Amazon Linux (AL2023) package updates via S3 without requiring a NAT Gateway or Proxy. |
| Control Plane HA | 3-Node Stacked Plane | Deployed across 3 Availability Zones (AZs) to ensure API server quorum and resilience against localized AWS failures. |
| Internal Connectivity | Internal NLB + Route53 | Established a stable, internal FQDN for node registration and kubectl operations, ensuring seamless cluster scaling. |
| Secure Remote Ops | Cloudflare Zero Trust | Leveraged Cloudflare Tunnels and WARP for encrypted, identity-aware admin access, eliminating the need for a traditional VPN. |
Remote access was built on the principle of Zero Inbound Ports, moving away from the vulnerabilities of traditional Bastion hosts or client-to-site VPNs.
A cloudflared connector runs as a systemd service on an EC2 instance within the VPC. It initiates an outbound-only connection to the Cloudflare edge.
WARP Client: Authorized users & administrators use the WARP client to route traffic into the private RKE2 subnets based on defined CIDR ranges.
mTLS Enforcement: Even after tunnel authentication, kubectl communication is protected by RKE2’s native mTLS, providing multiple layers of identity verification.
For CLI-level access, I bypassed SSH entirely in favor of SSM Session Manager. By deploying Interface Endpoints for ssm, ssmmessages, and ec2messages, I enabled interactive shell access to nodes that have zero route to the internet. This provides a detailed audit log of all commands executed on the infrastructure.
A primary challenge of isolated environments is the “circular dependency” of needing updates without having a route to get them.
To ensure the platform survives a total AZ failure, the architecture utilizes a multi-master RKE2 configuration across three AZs.
While a standard NAT Gateway architecture is simpler and initially cheaper, it introduces a centralized point of egress that increases the attack surface.
The Tradeoff: Utilizing VPC Interface Endpoints for every service (SSM, ECR, S3) significantly increases the fixed monthly cost per VPC.
The Justification: For restricted workloads, the “Cost of Breach” or “Cost of Non-Compliance” outweighs the AWS service premiums. This architecture achieves a Zero-Egress posture that meets the highest security benchmarks.
I chose a Cross-AZ Internal NLB over a simpler NodePort or HostPort configuration.
The Tradeoff: NLBs introduce latency (minimal) and additional cost. It also requires more complex Target Group management for RKE2 control plane nodes.
The Justification: The NLB provides a stable, immutable entry point. In a senior-level production environment, the ability to cycle nodes without updating kubeconfig files or DNS records is mandatory for Platform Resiliency.
| ALTERNATIVE | WHY IT WAS REJECTED |
|---|---|
| Self Hosted Harbor | While Harbor would eliminate the need for ECR VPC Endpoints, it introduces a “management tax”—requiring its own backup strategy, storage volumes, and high-availability logic. ECR with VPC Endpoints provides a serverless registry experience with higher 9s of availability. |
| Squid Proxy / NAT | A Squid Proxy requires constant maintenance of whitelist/blacklist rules and creates a bottleneck for cluster-wide pulls. The S3 Gateway Endpoint used here provides a direct, wire-speed path to OS updates without the proxy overhead. |
| Traditional Bastion Host | Bastion hosts require managing SSH keys and exposing Port 22 to the internet (even if IP-restricted). SSM Session Manager replaced this, providing identity-based access (IAM) and full session auditing without any open ports. |