Building A Zero Trust Distributed System

The Fitness Webapp

Kubernetes, RabbitMQ, MongoDB, NGINX, Traefik, Cloudflare, Python
December 29, 2025

Executive Summary

This production-grade distributed system serves as a blueprint for high-availability application hosting, designed with a Zero Trust security posture and zero single points of failure (SPOF). By architecting a 5-node Kubernetes cluster integrated with an asynchronous RabbitMQ pipeline and a 3-node MongoDB Replica Set, I achieved 100% UI responsiveness during database failover events. The implementation eliminates the public attack surface by requiring no inbound ports, while automating the full SSL/TLS lifecycle to ensure zero-touch operational overhead for security compliance.

Strategic Design Summary

CHALLENGEDECISIONIMPACT
Secure Remote AccessCloudflare Tunnel & WARPAchieved Zero Trust connectivity; eliminated all inbound firewall ports and exposed public IPs.
Traffic OrchestrationTraefik Gateway APIFuture-proofed the platform (over traditional Ingress Controllers). Gateway API allows for a role-oriented architecture, where infrastructure lifecycle (Gateway) is decoupled from application routing logic (HTTPRoute), facilitating cleaner multi-tenant operations as the platform scales.
Write ReliabilityRabbitMQ PipelineWhile the asynchronous messaging tier adds architectural complexity and requires a separate consumer service, it was a necessary trade-off to ensure the UI remains non-blocking during MongoDB primary elections or transient network partitions.
Control Plane HA3-Node Kubernetes Control PlaneEngineered a stacked control plane to ensure API server quorum and high availability. This enables zero-downtime cluster maintenance and protects against etcd data loss during localized hardware failures.
Data IntegrityMongoDB Replica SetDeployed a 3-node Replica Set to prioritize durability. This configuration guarantees automated primary elections and data replication across fault domains, ensuring zero data loss during node outages.
Secrets & IdentityCert-Manager w/ DNS-01Standardized on DNS-01 ACME challenges via cert-manager to maintain a strict zero-inbound-port policy. By leveraging Cloudflare API tokens for TXT record verification, I bypassed the need for public-facing HTTP/80 listeners required by standard HTTP-01 challenges, preserving the integrity of the internal network perimeter.

Infrastructure Topology

flowchart TD subgraph Public_Internet [External Access] spacer[ ] style spacer fill:none,stroke:none,color:none spacer ~~~ WARP WARP((WARP Client)) --- CF_Network([Cloudflare Network]) end subgraph Homelab_Edge [ ] Title2[Zero Trust Networking] style Title2 fill:none,stroke:none CF_Network --- Title2 Title2 --- CFTunnel[Cloudflared] %%CFTunnel --- GW[Gateway] end subgraph HA_Layer [ ] direction TB Title1[High Availability Ingress] style Title1 fill:none,stroke:none CFTunnel --> Title1 Title1 --> NGINX{"NGINX Load Balancers
Shared VIP
L4"} %%VIP --- end NGINX --> Title3 subgraph Kubernetes_Cluster [ ] direction TB Title3[Kubernetes Cluster] style Title3 fill:none,stroke:none Title3 --> Traefik subgraph Routing ["Ingress Layer (L7)"] Traefik{{Traefik Gateway API}}:::k8s-infra CertMgr{{Cert-Manager / TLS}}:::k8s-infra end subgraph Apps [Application Layer] direction TB Webapp([Dash Frontend]):::k8s-app Consumer([Python Consumer]):::k8s-app end subgraph Messaging [Asynchronous Bus] RabbitMQ[(RabbitMQ)]:::k8s-infra end end subgraph Database_Layer [ ] Title4[Storage Tier] style Title4 fill:none,stroke:none MongoDB[(MongoDB Replica Set)]:::database end %% Application Logic Traefik --> Apps Webapp -- "Publishes" --> RabbitMQ RabbitMQ -- "Triggers" --> Consumer Consumer -- "Inserts" --> MongoDB Webapp -- "Queries" --> MongoDB %% Functional Classes classDef default font-family:arial,font-size:14px; classDef k8s-infra fill:#E8EAF6,stroke:#3F51B5,stroke-width:2px,color:#1A237E; classDef k8s-app fill:#E1F5FE,stroke:#0288D1,stroke-width:2px,color:#01579B; classDef public fill:#FFF4DD,stroke:#D4A017,stroke-width:2px; classDef private fill:#E1F5FE,stroke:#01579B,stroke-width:2px; classDef internal fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px; classDef database fill:#E8F5E green,stroke:#2E7D32,stroke-width:2px,stroke-dasharray: 5 5; classDef cloudflare fill:#F38020,stroke:#333,color:#fff,font-weight:bold; %% Assigning Classes class CFTunnel,WARP,CF_Network cloudflare; class MongoDB database; %% --- STYLING --- style NGINX fill:#bbf,stroke:#333,stroke-width:2px;

Zero Trust Access

Security is enforced through a Zero Trust model, only providing access to authorized Cloudflare WARP clients. Cloudflared runs as a system service on multiple internal servers, establishing encrypted tunnels that provide remote clients direct access to the internal network. This allows for remote connectivity without requiring any inbound ports to be accessible via the open internet.

DNS for our application domain resolves to an internal Virtual IP (VIP) shared across an NGINX Load Balancer Pool via Keepalived. Since the VIP is an internal address, only our authorized WARP clients are able to reach it.

High Availability Ingress

The entry point for all traffic is a robust, load-balanced ingress tier designed for zero downtime.

Shared VIP: Multiple servers utilize Keepalived to maintain a Shared VIP across the Load Balancer Pool.

L4 Load Balancing: An NGINX Stream configuration is synchronized across the Load Balancer Pool. This manages Layer 4 traffic, forwarding requests to the appropriate backend target. This is utilized for client application traffic, as well as administrative access to the Kubernetes control plane.

Note: Access to the Kubernetes control plane is secured via mTLS. Only admins possessing an authorized certificate are provided access.

Kubernetes Cluster

I bootstrapped a 5-node Kubernetes cluster (3 control plane nodes, 2 worker nodes). Three control plane nodes achieves ensures the API server remains available during node maintenance or failure. The cluster is managed via Kustomize and leverages the modern Gateway API for traffic management.

Gateway API

I deployed the Gateway API via Helm, choosing Traefik as the controller. Traefik is exposed via a nodePort which acts as the target for the upstream NGINX L4 Load Balancers proxying client application traffic.

I created three distinct HTTPRoutes to manage traffic for the fitness app, RabbitMQ management UI, and the Traefik dashboard itself. The entire cluster state, including these routes and gateway configurations, is managed declaratively using Kustomize.

Cert-Manager

I integrated cert-manager to automate TLS certificate issuance and renewal for all services.

Cluster-Wide Issuance: Created a ClusterIssuer named that interfaces with Let’s Encrypt for issuing TLS certificates.

DNS-01 Challenge: To secure internal services without opening firewall ports, I configured the issuer to use Cloudflare DNS-01 challenges, leveraging a Cloudflare API token stored as a Kubernetes secret for automated verification.

Dynamic Provisioning: Certificates are automatically requested and renewed based on certificateRefs defined within the Traefik Gateway listeners.

Dash Frontend

The Fitness Webapp is a containerized Python application built with the Dash framework.

The app fetches historical workout data from MongoDB to generate personalized workout plans. Users input workout metrics as they complete the personalized plan, then utilize a “Log Workout” interface that asynchronously offloads data processing by publishing a message to a RabbitMQ Exchange.

Reliability: The app ensures the required queues and exchanges exist before publishing, and uses the internal Kubernetes service name (rabbitmq) for service discovery.

RabbitMQ

To ensure the web frontend remains responsive, the system uses RabbitMQ for event-driven logging. I utilized Helm to deploy the RabbitMQ Operator, ensuring a repeatable and version-controlled installation process.

Decoupling: When a user logs a workout, the frontend publishes a message to a dedicated RabbitMQ exchange.

Consumer

A specialized Python Consumer handles the data persistence.

Fault Tolerance: Three replicas run indefinitely, monitoring their own connection to RabbitMQ and re-instantiating sessions if interrupted.

Persistence: The consumer binds to the workout queue, processes incoming messages, and writes them into the MongoDB collection.

MongoDB Replica Set

State is maintained in a 3-node MongoDB Replica Set.

Automated Failover and High Availability: The 3-node configuration ensures that if the primary node becomes unavailable, an election is automatically triggered to promote a secondary node. This maintains continuous write availability for the Python consumer and read availability for the Dash frontend.

Data Redundancy and Durability: Every workout logged by the consumer is replicated across all three nodes. This protects against data loss in the event of hardware failure on a specific host.

Read Scaling and Performance: The architecture allows for the offloading of read queries to secondary nodes. This is particularly beneficial for the Dash webapp, which frequently fetches historical data to render workout recommendations and analytics without impacting the write performance of the logging consumer.

Operational Consistency via mTLS: To secure the database at the transport layer, all communication—including replication heartbeat and client traffic—is encrypted using internal certificates. This ensures that any data remains encrypted in transit, even across the internal network.

Tradeoffs & Considerations

Architectural Complexity vs. Responsiveness

Decoupling the frontend from the database via RabbitMQ introduces several “moving parts” (the broker, the exchange, and the consumer service).

  • The Tradeoff: This requires managing the lifecycle of the RabbitMQ Operator and ensuring the Python Consumer handles connection retries gracefully. It increases the initial development time and troubleshooting surface area.
  • The Justification: In a production environment, database “hiccups” or primary elections (which take seconds) should not lead to 504 Gateway Timeouts or “frozen” UIs. The asynchronous bridge ensures the user experience is decoupled from the state of the storage tier.

Operational Overhead of DNS-01 Challenges

Using DNS-01 challenges via cert-manager is more complex than the standard HTTP-01 challenge.

  • The Tradeoff: It requires managing Cloudflare API tokens within Kubernetes secrets.
  • The Justification: This is the only way to maintain a Zero-Inbound posture. HTTP-01 requires Port 80 to be open to the internet for Let’s Encrypt validation. By choosing DNS-01, I prioritize the security of the internal network perimeter over the simplicity of the certificate issuance process.

NGINX Proxy vs. Direct Node Exposure

Using a dedicated NGINX/Keepalived pool to proxy traffic into the Kubernetes cluster adds an extra hop.

  • The Tradeoff: Introduces a small amount of latency and requires maintaining the Keepalived/NGINX configuration outside of the Kubernetes manifest ecosystem.
  • The Justification: It provides a stable, hardware-agnostic VIP. This allows the Kubernetes cluster to be completely rebuilt or nodes to be cycled without changing the target IP address in the Cloudflare Tunnel configuration, enhancing overall platform resiliency.

Alternatives Considered

ALTERNATIVEWHY IT WAS REJECTED
Traditional IngressWhile standard NGINX Ingress is widely supported, the Gateway API (implemented via Traefik) provides a more expressive, role-oriented configuration that better handles multi-tenant routing and future-proofs the cluster.
HTTP-01 ACME ChallengeRejected because it requires opening Port 80 to the public internet for validation. To maintain a Zero Trust posture, DNS-01 was the only viable path for automated TLS.
Synchronous DB WritesWriting directly from the Dash app to MongoDB would simplify the code, but any database latency or failover event would cause the web UI to hang or crash, violating the goal of 100% responsiveness.
Single-Node MongoDBA single node is a massive SPOF (Single Point of Failure). The 3-node Replica Set was chosen to ensure that data integrity and availability are maintained during both unplanned outages and routine maintenance.