Platform resiliency in restricted cloud environments

Building a "Cloud-Gapped" Platform on AWS

AWS, Kubernetes, RKE2, Zero Trust, Cloudflare
December 31, 2025

Executive Summary

This project involved architecting a highly available, production-grade RKE2 cluster within a restricted AWS VPC. To meet strict compliance requirements, the environment was designed with no direct internet egress or ingress. Remote administration is secured via Zero Trust principles, utilizing Cloudflare WARP and Tunnels to grant authorized clients access without exposing public listening ports. The architecture leverages AWS VPC Endpoints (Interface and Gateway) to maintain a fully private “cloud-gapped” posture while retaining access to critical AWS managed services.

Strategic Design Summary

CHALLENGEDECISIONIMPACT
Network IsolationPrivate Subnet IsolationRestricted all nodes from direct internet access, achieving compliance for sensitive workloads.
Secure Image SupplyECR + IAM IntegrationSecured the supply chain by hosting all runtime images in a private registry with native IAM-based pull authentication.
Controlled UpdatesS3 Gateway EndpointsEnabled Amazon Linux (AL2023) package updates via S3 without requiring a NAT Gateway or Proxy.
Control Plane HA3-Node Stacked PlaneDeployed across 3 Availability Zones (AZs) to ensure API server quorum and resilience against localized AWS failures.
Internal ConnectivityInternal NLB + Route53Established a stable, internal FQDN for node registration and kubectl operations, ensuring seamless cluster scaling.
Secure Remote OpsCloudflare Zero TrustLeveraged Cloudflare Tunnels and WARP for encrypted, identity-aware admin access, eliminating the need for a traditional VPN.

Infrastructure Topology

graph TB subgraph Public_Internet [External Access] direction TB spacer_external[ ] spacer_external ~~~ WARP WARP((WARP Client)) --- CF_Network([Cloudflare Network]) end subgraph us-east-1 [ ] direction TB Title_Region[US-EAST-1] spacer_region[ ] Title_Region ~~~ spacer_region spacer_region ~~~ NLB subgraph AZ-A [ ] direction TB AZ_A_LABEL["us-east-1a"] subgraph PubSubA [ ] direction TB Title_PubSub[Public] CF[Cloudflared] CF_Network ---|"Tunnel Initiated
Outbound"| Title_PubSub Title_PubSub --- CF end subgraph PrivSubA [ ] direction TB Title_PrivSubA[Private] K8sA[K8s A] Title_PrivSubA ~~~ K8sA end end subgraph AZ-B [ ] direction TB AZ_B_LABEL["us-east-1b"] subgraph PrivSubB [ ] direction TB Title_PrivSubB[Private] K8sB[K8s B] Title_PrivSubB ~~~ K8sB end end subgraph AZ-C [ ] direction TB AZ_C_LABEL["us-east-1c"] subgraph PrivSubC [ ] direction TB Title_PrivSubC[Private] K8sC[K8s C] Title_PrivSubC ~~~ K8sC end end subgraph Internals [ ] direction LR Internals_LABEL["AWS Services"] subgraph SSM [ ] direction TB Title_SSM[SSM] end subgraph S3 [ ] direction TB Title_S3[S3] end subgraph ECR [ ] direction TB Title_ECR[ECR] end Internals_LABEL ~~~ SSM ~~~ ECR ~~~ S3 end NLB{Internal NLB} CF --> NLB NLB <--> PrivSubA NLB <--> PrivSubB NLB <--> PrivSubC PrivSubA --> Internals PrivSubB --> Internals PrivSubC --> Internals end %% --- STYLING --- classDef default font-family:arial,font-size:14px; %% Functional Classes classDef public fill:#FFF4DD,stroke:#D4A017,stroke-width:2px; classDef private fill:#E1F5FE,stroke:#01579B,stroke-width:2px; classDef internal fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px; classDef database fill:#E8F5E9,stroke:#2E7D32,stroke-width:2px,stroke-dasharray: 5 5; classDef cloudflare fill:#F38020,stroke:#333,color:#fff,font-weight:bold; %% Assigning Classes class WARP,CF_Network,CF cloudflare; class PubSubA public; class PrivSubA,PrivSubB,PrivSubC private; class SSM,ECR,S3 internal; %% Your existing hidden/label styling classDef spacer fill:none,stroke:none,color:none classDef Title fill:none,stroke:none classDef AZ_LABELS fill:none,stroke:none,font-weight:bold class Title_PubSub,Title_Region,Title_PrivSubA,Title_PrivSubB,Title_PrivSubC,Title_ECR,Title_SSM,Title_S3 Title; class AZ_A_LABEL,AZ_B_LABEL,AZ_C_LABEL,Internals_LABEL AZ_LABELS; class spacer_region,spacer_external spacer; style NLB fill:#bbf,stroke:#333,stroke-width:2px;

Zero Trust Access

Remote access was built on the principle of Zero Inbound Ports, moving away from the vulnerabilities of traditional Bastion hosts or client-to-site VPNs.

Cloudflare Zero Trust

A cloudflared connector runs as a systemd service on an EC2 instance within the VPC. It initiates an outbound-only connection to the Cloudflare edge.

  • WARP Client: Authorized users & administrators use the WARP client to route traffic into the private RKE2 subnets based on defined CIDR ranges.

  • mTLS Enforcement: Even after tunnel authentication, kubectl communication is protected by RKE2’s native mTLS, providing multiple layers of identity verification.

Session Manager For EC2

For CLI-level access, I bypassed SSH entirely in favor of SSM Session Manager. By deploying Interface Endpoints for ssm, ssmmessages, and ec2messages, I enabled interactive shell access to nodes that have zero route to the internet. This provides a detailed audit log of all commands executed on the infrastructure.

Offline Supply Chain

A primary challenge of isolated environments is the “circular dependency” of needing updates without having a route to get them.

  • Private Image Registry: I utilized Amazon ECR for all containerized workloads. By configuring the ecr-credential-provider on the RKE2 nodes, I enabled the Kubelet to rotate ECR tokens natively using the node’s IAM Instance Profile. This removed the need for static imagePullSecrets.
  • S3 for Package Management: Amazon Linux mirrors are hosted on S3. By implementing a Gateway Endpoint, the nodes pull OS updates directly from the AWS backbone. This is both more secure and more cost-effective than using a NAT Gateway or Squid Proxy.

Multi AZ Resilience

To ensure the platform survives a total AZ failure, the architecture utilizes a multi-master RKE2 configuration across three AZs.

  • Internal NLB: A Network Load Balancer (NLB) serves as the stable entry point for both the Control Plane (TCP 6443) and the Agent Join API (TCP 9345).
  • Service Discovery: A Route53 CNAME points to the NLB. This ensures that as worker nodes scale or control plane nodes are cycled, the endpoint remains constant for the cluster members and the kubeconfig used by administrators.

Tradeoffs & Considerations

Architectural Cost vs. Security Posture

While a standard NAT Gateway architecture is simpler and initially cheaper, it introduces a centralized point of egress that increases the attack surface.

  • The Tradeoff: Utilizing VPC Interface Endpoints for every service (SSM, ECR, S3) significantly increases the fixed monthly cost per VPC.

  • The Justification: For restricted workloads, the “Cost of Breach” or “Cost of Non-Compliance” outweighs the AWS service premiums. This architecture achieves a Zero-Egress posture that meets the highest security benchmarks.

Load Balancer Topologies

I chose a Cross-AZ Internal NLB over a simpler NodePort or HostPort configuration.

  • The Tradeoff: NLBs introduce latency (minimal) and additional cost. It also requires more complex Target Group management for RKE2 control plane nodes.

  • The Justification: The NLB provides a stable, immutable entry point. In a senior-level production environment, the ability to cycle nodes without updating kubeconfig files or DNS records is mandatory for Platform Resiliency.

Alternatives Considered

ALTERNATIVEWHY IT WAS REJECTED
Self Hosted HarborWhile Harbor would eliminate the need for ECR VPC Endpoints, it introduces a “management tax”—requiring its own backup strategy, storage volumes, and high-availability logic. ECR with VPC Endpoints provides a serverless registry experience with higher 9s of availability.
Squid Proxy / NATA Squid Proxy requires constant maintenance of whitelist/blacklist rules and creates a bottleneck for cluster-wide pulls. The S3 Gateway Endpoint used here provides a direct, wire-speed path to OS updates without the proxy overhead.
Traditional Bastion HostBastion hosts require managing SSH keys and exposing Port 22 to the internet (even if IP-restricted). SSM Session Manager replaced this, providing identity-based access (IAM) and full session auditing without any open ports.