From one broken server to a self-operating cloud.
"Café Nimbus came to me with one broken server and no plan for growth. I left them with an infrastructure that scales automatically, recovers from failures without human intervention, and reports on itself every morning — without anyone touching a keyboard."
The Engagement
Café Nimbus is a growing café brand preparing for national expansion. When I came on board, their entire online presence ran on a single EC2 instance — no redundancy, no backups, no plan for what happens when traffic spikes during a promotion. One bad day and the whole operation goes dark.
My mandate: build an AWS infrastructure that could grow with the business, survive failures automatically, and operate with as little manual intervention as possible. What followed was a five-phase architectural engagement. Each phase solved a specific business problem. Each phase made the next one possible.
Architecture Overview
Five Phases
Amazon S3 · Versioning · Lifecycle Policies · Cross-Region Replication
Café Nimbus had no web presence. Customers were finding competitors instead. The site needed to be fast, cheap to run, and impossible to accidentally break during a content update.
Static website on Amazon S3 with public access controlled entirely through bucket policy. S3 versioning enabled from day one. Lifecycle policies to transition older versions to S3-IA after 30 days. Cross-region replication as day-one non-negotiable.
Cross-region replication from day one. The counter-argument: Café Nimbus is too small for it. My counter: a regional outage hitting their only S3 bucket takes their entire web presence offline with no recovery option. Replication costs pennies at this scale.
The principle: Replication from day one, not after the first outage. The cost of preventing a disaster is always lower than the cost of recovering from one.
EC2 · LAMP Stack · AMI · Multi-Region Deployment
A static site can display a menu. It cannot take orders, manage inventory, or run a real application. Café Nimbus needed a backend — and one that could be reproduced exactly if it ever had to be rebuilt.
EC2 running a full LAMP stack — Linux, Apache, MySQL, PHP. After validating end-to-end (menu, order placement, data persistence), I created a golden AMI before touching anything else. From that AMI, launched an identical instance in a second region in minutes.
AMI before anything else. A manually configured server that only one person knows how to rebuild is a liability. An AMI is an asset. Every production environment should have a golden image before it serves a single real user.
The principle: A manually configured server is a liability. An AMI is an asset. The cost of creating it is an hour. The cost of not having it is a full rebuild under pressure.
Custom VPC · Bastion Host · NAT Gateway · Network ACLs
Café Nimbus was going public with their platform. Their infrastructure had no meaningful network boundaries — everything was reachable from everywhere. Security had to be layered. A single misconfigured security group should not be enough to expose the entire backend.
Custom VPC with /16 CIDR. Public subnet for ALB and bastion only. Private subnet for all application servers and the database — no public IPs, ever. NAT Gateway for outbound-only private traffic. Network ACLs as a stateless second layer of defence.
Security groups + NACLs in combination. Security groups are stateful — they remember connections. NACLs are stateless — they evaluate every packet independently. Having both means a misconfigured security group doesn't automatically become an open door.
The principle: No backend resource ever got a public IP. A single misconfigured security group without NACLs as a backstop is one mistake from a complete exposure.
Application Load Balancer · Auto Scaling Group · Multi-AZ
A single EC2 instance — no matter how well configured — is a single point of failure. When traffic spikes during a promotion, the site goes down. When the instance fails, the business goes dark. Neither is acceptable for a company preparing for national expansion.
Application Load Balancer across two Availability Zones. Auto Scaling Group triggered by CPU utilization — not a schedule. Minimum instances maintained, scale-out on threshold breach, scale-in when load drops. Health checks replace failed instances automatically.
CPU utilization trigger, not scheduled scaling. Traffic is demand-driven, not time-predictable. Scheduling assumes you know when customers will come. Utilization-based scaling responds to what's actually happening.
The principle: Multi-AZ is not a luxury. A single-AZ deployment with ten instances is still a single point of failure. Two AZs with two instances each is genuinely resilient.
AWS Lambda · Amazon SNS · Amazon EventBridge
Every morning, the operations team spent 45 minutes manually pulling the previous day's sales data and emailing it to management. Error-prone, time-consuming, and entirely unnecessary. The solution had to be serverless — running a cron job on EC2 means paying compute 24/7 for a 30-second task.
Two Lambda functions: DataExtractor (queries RDS inside VPC) and SalesAnalysisReport (formats and delivers the report). SNS email topic for the operations distribution list. EventBridge rule fires at 8AM daily — no human involved, no idle compute.
Lambda, not a cron job on EC2. Lambda runs only when triggered — at this workload, monthly cost is effectively zero. An EC2-based cron costs $15–30/month to idle 24/7 for a task that executes once per day. Serverless is not always the right answer. Here, it is the only answer.
The principle: The infrastructure does not need a human to survive a failure, handle a traffic spike, or send the morning report. That was the mandate. This is the result.
Architecture Decision Log
Architecture is not a list of what was built. It is a record of what was chosen and what was rejected — and why those tradeoffs were made in this order, for this client, at this stage.
| Decision | Chosen ✓ | Rejected ✗ | Rationale |
|---|---|---|---|
| Static hosting | S3 + bucket policy | EC2-hosted static site | No reason to run compute for files that never change |
| Content protection | S3 versioning day one | No versioning | Accidental overwrites have no recovery path without it |
| Regional resilience | Cross-region replication | Single-region only | One regional outage = total web presence loss |
| Server reproducibility | AMI before second deploy | Manual reconfiguration | A manually built server cannot be rebuilt reliably under pressure |
| Backend access | Bastion host only | Direct SSH + public IP | No backend resource should ever have a direct public route |
| Outbound private traffic | NAT Gateway | Public subnet for EC2s | Private isolation requires outbound-only — not bidirectional |
| Network defence | SGs + NACLs combined | Security groups alone | One misconfigured SG without NACLs = open door |
| Scaling trigger | CPU utilization | Scheduled scaling | Traffic is demand-driven, not time-predictable |
| AZ strategy | Multi-AZ ALB + ASG | Single-AZ more instances | Single AZ is a single point of failure regardless of count |
| Reporting automation | Lambda + EventBridge | Cron job on EC2 | Idle compute 24/7 for a 30-second daily task |
Technologies Used
What I'd Add in Production
These are the gaps I would close before calling this production-ready for a real business.
The load balancer is publicly exposed. Without a Web Application Firewall, SQL injection and XSS have no automated defence at the network edge.
In a hardened environment, RDS credentials would be rotated automatically through Secrets Manager — no application code would contain a hardcoded password.
Every API call in the account should be logged. Without CloudTrail, there is no audit trail if something goes wrong or someone does something they shouldn't.
Traffic between EC2 and S3 currently routes through NAT Gateway. VPC Endpoints keep that traffic on the AWS private network and eliminate the NAT cost.
ALB request count, ASG instance count, Lambda errors, and RDS connections — all visible in one place, with SNS alarms when anything goes out of range.
For a nationally expanding café brand, DDoS protection at the ALB layer moves from a nice-to-have to a business continuity requirement.
The Result
The infrastructure does not need a human to survive a failure. It does not need a human to handle a traffic spike. And it does not need a human to send the morning sales report. That was the mandate. That is the result.