The Fitness Webapp
This production-grade distributed system serves as a blueprint for high-availability application hosting, designed with a Zero Trust security posture and zero single points of failure (SPOF). By architecting a 5-node Kubernetes cluster integrated with an asynchronous RabbitMQ pipeline and a 3-node MongoDB Replica Set, I achieved 100% UI responsiveness during database failover events. The implementation eliminates the public attack surface by requiring no inbound ports, while automating the full SSL/TLS lifecycle to ensure zero-touch operational overhead for security compliance.
| CHALLENGE | DECISION | IMPACT |
|---|---|---|
| Secure Remote Access | Cloudflare Tunnel & WARP | Achieved Zero Trust connectivity; eliminated all inbound firewall ports and exposed public IPs. |
| Traffic Orchestration | Traefik Gateway API | Future-proofed the platform (over traditional Ingress Controllers). Gateway API allows for a role-oriented architecture, where infrastructure lifecycle (Gateway) is decoupled from application routing logic (HTTPRoute), facilitating cleaner multi-tenant operations as the platform scales. |
| Write Reliability | RabbitMQ Pipeline | While the asynchronous messaging tier adds architectural complexity and requires a separate consumer service, it was a necessary trade-off to ensure the UI remains non-blocking during MongoDB primary elections or transient network partitions. |
| Control Plane HA | 3-Node Kubernetes Control Plane | Engineered a stacked control plane to ensure API server quorum and high availability. This enables zero-downtime cluster maintenance and protects against etcd data loss during localized hardware failures. |
| Data Integrity | MongoDB Replica Set | Deployed a 3-node Replica Set to prioritize durability. This configuration guarantees automated primary elections and data replication across fault domains, ensuring zero data loss during node outages. |
| Secrets & Identity | Cert-Manager w/ DNS-01 | Standardized on DNS-01 ACME challenges via cert-manager to maintain a strict zero-inbound-port policy. By leveraging Cloudflare API tokens for TXT record verification, I bypassed the need for public-facing HTTP/80 listeners required by standard HTTP-01 challenges, preserving the integrity of the internal network perimeter. |
Security is enforced through a Zero Trust model, only providing access to authorized Cloudflare WARP clients. Cloudflared runs as a system service on multiple internal servers, establishing encrypted tunnels that provide remote clients direct access to the internal network. This allows for remote connectivity without requiring any inbound ports to be accessible via the open internet.
DNS for our application domain resolves to an internal Virtual IP (VIP) shared across an NGINX Load Balancer Pool via Keepalived. Since the VIP is an internal address, only our authorized WARP clients are able to reach it.
The entry point for all traffic is a robust, load-balanced ingress tier designed for zero downtime.
Shared VIP: Multiple servers utilize Keepalived to maintain a Shared VIP across the Load Balancer Pool.
L4 Load Balancing: An NGINX Stream configuration is synchronized across the Load Balancer Pool. This manages Layer 4 traffic, forwarding requests to the appropriate backend target. This is utilized for client application traffic, as well as administrative access to the Kubernetes control plane.
Note: Access to the Kubernetes control plane is secured via mTLS. Only admins possessing an authorized certificate are provided access.
I bootstrapped a 5-node Kubernetes cluster (3 control plane nodes, 2 worker nodes). Three control plane nodes achieves ensures the API server remains available during node maintenance or failure. The cluster is managed via Kustomize and leverages the modern Gateway API for traffic management.
I deployed the Gateway API via Helm, choosing Traefik as the controller. Traefik is exposed via a nodePort which acts as the target for the upstream NGINX L4 Load Balancers proxying client application traffic.
I created three distinct HTTPRoutes to manage traffic for the fitness app, RabbitMQ management UI, and the Traefik dashboard itself. The entire cluster state, including these routes and gateway configurations, is managed declaratively using Kustomize.
I integrated cert-manager to automate TLS certificate issuance and renewal for all services.
Cluster-Wide Issuance: Created a ClusterIssuer named that interfaces with Let’s Encrypt for issuing TLS certificates.
DNS-01 Challenge: To secure internal services without opening firewall ports, I configured the issuer to use Cloudflare DNS-01 challenges, leveraging a Cloudflare API token stored as a Kubernetes secret for automated verification.
Dynamic Provisioning: Certificates are automatically requested and renewed based on certificateRefs defined within the Traefik Gateway listeners.
The Fitness Webapp is a containerized Python application built with the Dash framework.
The app fetches historical workout data from MongoDB to generate personalized workout plans. Users input workout metrics as they complete the personalized plan, then utilize a “Log Workout” interface that asynchronously offloads data processing by publishing a message to a RabbitMQ Exchange.
Reliability: The app ensures the required queues and exchanges exist before publishing, and uses the internal Kubernetes service name (rabbitmq) for service discovery.
To ensure the web frontend remains responsive, the system uses RabbitMQ for event-driven logging. I utilized Helm to deploy the RabbitMQ Operator, ensuring a repeatable and version-controlled installation process.
Decoupling: When a user logs a workout, the frontend publishes a message to a dedicated RabbitMQ exchange.
A specialized Python Consumer handles the data persistence.
Fault Tolerance: Three replicas run indefinitely, monitoring their own connection to RabbitMQ and re-instantiating sessions if interrupted.
Persistence: The consumer binds to the workout queue, processes incoming messages, and writes them into the MongoDB collection.
State is maintained in a 3-node MongoDB Replica Set.
Automated Failover and High Availability: The 3-node configuration ensures that if the primary node becomes unavailable, an election is automatically triggered to promote a secondary node. This maintains continuous write availability for the Python consumer and read availability for the Dash frontend.
Data Redundancy and Durability: Every workout logged by the consumer is replicated across all three nodes. This protects against data loss in the event of hardware failure on a specific host.
Read Scaling and Performance: The architecture allows for the offloading of read queries to secondary nodes. This is particularly beneficial for the Dash webapp, which frequently fetches historical data to render workout recommendations and analytics without impacting the write performance of the logging consumer.
Operational Consistency via mTLS: To secure the database at the transport layer, all communication—including replication heartbeat and client traffic—is encrypted using internal certificates. This ensures that any data remains encrypted in transit, even across the internal network.
Decoupling the frontend from the database via RabbitMQ introduces several “moving parts” (the broker, the exchange, and the consumer service).
Using DNS-01 challenges via cert-manager is more complex than the standard HTTP-01 challenge.
Using a dedicated NGINX/Keepalived pool to proxy traffic into the Kubernetes cluster adds an extra hop.
| ALTERNATIVE | WHY IT WAS REJECTED |
|---|---|
| Traditional Ingress | While standard NGINX Ingress is widely supported, the Gateway API (implemented via Traefik) provides a more expressive, role-oriented configuration that better handles multi-tenant routing and future-proofs the cluster. |
| HTTP-01 ACME Challenge | Rejected because it requires opening Port 80 to the public internet for validation. To maintain a Zero Trust posture, DNS-01 was the only viable path for automated TLS. |
| Synchronous DB Writes | Writing directly from the Dash app to MongoDB would simplify the code, but any database latency or failover event would cause the web UI to hang or crash, violating the goal of 100% responsiveness. |
| Single-Node MongoDB | A single node is a massive SPOF (Single Point of Failure). The 3-node Replica Set was chosen to ensure that data integrity and availability are maintained during both unplanned outages and routine maintenance. |