AWS Cloud Infrastructure · Five-Phase Architecture · Self-Directed Case Study

Café
Nimbus

From one broken server to a self-operating cloud.

"Café Nimbus came to me with one broken server and no plan for growth. I left them with an infrastructure that scales automatically, recovers from failures without human intervention, and reports on itself every morning — without anyone touching a keyboard."

01
Get Online
02
Get Dynamic
03
Get Secure
04
Get Scalable
05
Get Autonomous

The Engagement

A café brand.
One server. No plan.

Café Nimbus is a growing café brand preparing for national expansion. When I came on board, their entire online presence ran on a single EC2 instance — no redundancy, no backups, no plan for what happens when traffic spikes during a promotion. One bad day and the whole operation goes dark.


My mandate: build an AWS infrastructure that could grow with the business, survive failures automatically, and operate with as little manual intervention as possible. What followed was a five-phase architectural engagement. Each phase solved a specific business problem. Each phase made the next one possible.

Architecture Overview

The full AWS stack

AWS CLOUD — CAFÉ NIMBUS 🌐 Internet Application Load Balancer · 2 AZs Auto Scaling Group EC2 EC2 EC2 CUSTOM VPC /16 PUBLIC SUBNET 🛡️ Bastion Host 🔀 NAT Gateway Internet Gateway PRIVATE SUBNET — NO PUBLIC IPs 💻 EC2 + LAMP 🗄️ RDS (MySQL) AMI Golden Image Network ACLs Security Groups (stateful) + NACLs (stateless) Amazon S3 Static Site + Replication SERVERLESS ⚡ Lambda (x2) 📅 EventBridge 📢 SNS → Email 8AM Daily Report Phase 1 Phase 3 Phase 2 + 4 Phase 5

Five Phases

Every decision.
Every tradeoff.

01
Phase One

Get Online

Amazon S3 · Versioning · Lifecycle Policies · Cross-Region Replication

The Problem

Café Nimbus had no web presence. Customers were finding competitors instead. The site needed to be fast, cheap to run, and impossible to accidentally break during a content update.

What I Built

Static website on Amazon S3 with public access controlled entirely through bucket policy. S3 versioning enabled from day one. Lifecycle policies to transition older versions to S3-IA after 30 days. Cross-region replication as day-one non-negotiable.

Validated

  • Site live via S3 endpoint
  • 403 confirmed before policy — access correctly blocked
  • File deleted and restored in under 60 seconds
  • Lifecycle rules confirmed active
  • Replication: object in destination bucket within 30 seconds

The Decision

Cross-region replication from day one. The counter-argument: Café Nimbus is too small for it. My counter: a regional outage hitting their only S3 bucket takes their entire web presence offline with no recovery option. Replication costs pennies at this scale.

The principle: Replication from day one, not after the first outage. The cost of preventing a disaster is always lower than the cost of recovering from one.

02
Phase Two

Get Dynamic

EC2 · LAMP Stack · AMI · Multi-Region Deployment

The Problem

A static site can display a menu. It cannot take orders, manage inventory, or run a real application. Café Nimbus needed a backend — and one that could be reproduced exactly if it ever had to be rebuilt.

What I Built

EC2 running a full LAMP stack — Linux, Apache, MySQL, PHP. After validating end-to-end (menu, order placement, data persistence), I created a golden AMI before touching anything else. From that AMI, launched an identical instance in a second region in minutes.

Validated

  • LAMP stack deployed, Apache accessible
  • Menu items loading correctly
  • Order placed and confirmed persisted
  • AMI created from configured instance
  • Second instance from AMI in alternate region — identical

The Decision

AMI before anything else. A manually configured server that only one person knows how to rebuild is a liability. An AMI is an asset. Every production environment should have a golden image before it serves a single real user.

The principle: A manually configured server is a liability. An AMI is an asset. The cost of creating it is an hour. The cost of not having it is a full rebuild under pressure.

03
Phase Three

Get Secure

Custom VPC · Bastion Host · NAT Gateway · Network ACLs

The Problem

Café Nimbus was going public with their platform. Their infrastructure had no meaningful network boundaries — everything was reachable from everywhere. Security had to be layered. A single misconfigured security group should not be enough to expose the entire backend.

What I Built

Custom VPC with /16 CIDR. Public subnet for ALB and bastion only. Private subnet for all application servers and the database — no public IPs, ever. NAT Gateway for outbound-only private traffic. Network ACLs as a stateless second layer of defence.

Validated

  • Private instances unreachable via direct connection
  • Bastion SSH confirmed as only entry path
  • NAT Gateway routing confirmed for private outbound
  • NACL deny rules blocked traffic as expected
  • Permitted traffic passed normally

The Decision

Security groups + NACLs in combination. Security groups are stateful — they remember connections. NACLs are stateless — they evaluate every packet independently. Having both means a misconfigured security group doesn't automatically become an open door.

The principle: No backend resource ever got a public IP. A single misconfigured security group without NACLs as a backstop is one mistake from a complete exposure.

04
Phase Four

Get Scalable

Application Load Balancer · Auto Scaling Group · Multi-AZ

The Problem

A single EC2 instance — no matter how well configured — is a single point of failure. When traffic spikes during a promotion, the site goes down. When the instance fails, the business goes dark. Neither is acceptable for a company preparing for national expansion.

What I Built

Application Load Balancer across two Availability Zones. Auto Scaling Group triggered by CPU utilization — not a schedule. Minimum instances maintained, scale-out on threshold breach, scale-in when load drops. Health checks replace failed instances automatically.

Validated

  • ALB distributing traffic across both AZs
  • Load simulation triggered scale-out within 90 seconds
  • Instance manually terminated mid-test — no visible interruption
  • ASG replaced terminated instance automatically
  • Scale-in confirmed when load dropped

The Decision

CPU utilization trigger, not scheduled scaling. Traffic is demand-driven, not time-predictable. Scheduling assumes you know when customers will come. Utilization-based scaling responds to what's actually happening.

The principle: Multi-AZ is not a luxury. A single-AZ deployment with ten instances is still a single point of failure. Two AZs with two instances each is genuinely resilient.

05
Phase Five

Get Autonomous

AWS Lambda · Amazon SNS · Amazon EventBridge

The Problem

Every morning, the operations team spent 45 minutes manually pulling the previous day's sales data and emailing it to management. Error-prone, time-consuming, and entirely unnecessary. The solution had to be serverless — running a cron job on EC2 means paying compute 24/7 for a 30-second task.

What I Built

Two Lambda functions: DataExtractor (queries RDS inside VPC) and SalesAnalysisReport (formats and delivers the report). SNS email topic for the operations distribution list. EventBridge rule fires at 8AM daily — no human involved, no idle compute.

Validated

  • DataExtractor confirmed connecting to RDS within VPC
  • Sales data pulled and formatted correctly
  • SNS subscription confirmed active
  • Lambda manually triggered — email within 30 seconds
  • EventBridge scheduled execution confirmed

The Decision

Lambda, not a cron job on EC2. Lambda runs only when triggered — at this workload, monthly cost is effectively zero. An EC2-based cron costs $15–30/month to idle 24/7 for a task that executes once per day. Serverless is not always the right answer. Here, it is the only answer.

The principle: The infrastructure does not need a human to survive a failure, handle a traffic spike, or send the morning report. That was the mandate. This is the result.

Architecture Decision Log

Every choice. Every reason.

Architecture is not a list of what was built. It is a record of what was chosen and what was rejected — and why those tradeoffs were made in this order, for this client, at this stage.

Decision Chosen ✓ Rejected ✗ Rationale
Static hostingS3 + bucket policyEC2-hosted static siteNo reason to run compute for files that never change
Content protectionS3 versioning day oneNo versioningAccidental overwrites have no recovery path without it
Regional resilienceCross-region replicationSingle-region onlyOne regional outage = total web presence loss
Server reproducibilityAMI before second deployManual reconfigurationA manually built server cannot be rebuilt reliably under pressure
Backend accessBastion host onlyDirect SSH + public IPNo backend resource should ever have a direct public route
Outbound private trafficNAT GatewayPublic subnet for EC2sPrivate isolation requires outbound-only — not bidirectional
Network defenceSGs + NACLs combinedSecurity groups aloneOne misconfigured SG without NACLs = open door
Scaling triggerCPU utilizationScheduled scalingTraffic is demand-driven, not time-predictable
AZ strategyMulti-AZ ALB + ASGSingle-AZ more instancesSingle AZ is a single point of failure regardless of count
Reporting automationLambda + EventBridgeCron job on EC2Idle compute 24/7 for a 30-second daily task

Technologies Used

13 services.
One infrastructure.

Amazon S3
Static hosting, versioning, lifecycle, cross-region replication
Amazon EC2
Application compute, LAMP stack, bastion host
Amazon VPC
Network isolation, public/private subnet segmentation
Internet Gateway
Public subnet internet access
NAT Gateway
Outbound-only internet access for private subnet
Network ACLs
Stateless subnet-level traffic control layer
App Load Balancer
Traffic distribution across AZs, health checking
Auto Scaling Group
Demand-based compute scaling, auto instance replacement
AWS Lambda
Serverless data extraction and report generation
Amazon RDS
Managed MySQL database for application data
Amazon SNS
Email notification delivery for sales reports
Amazon EventBridge
Scheduled Lambda trigger — 8AM daily
AWS AMI
Golden image capture for reproducible deployments

What I'd Add in Production

No architecture is finished.

These are the gaps I would close before calling this production-ready for a real business.

AWS WAF on ALB

The load balancer is publicly exposed. Without a Web Application Firewall, SQL injection and XSS have no automated defence at the network edge.

RDS + Secrets Manager

In a hardened environment, RDS credentials would be rotated automatically through Secrets Manager — no application code would contain a hardcoded password.

CloudTrail — All Regions

Every API call in the account should be logged. Without CloudTrail, there is no audit trail if something goes wrong or someone does something they shouldn't.

VPC Endpoints for S3

Traffic between EC2 and S3 currently routes through NAT Gateway. VPC Endpoints keep that traffic on the AWS private network and eliminate the NAT cost.

CloudWatch Dashboard

ALB request count, ASG instance count, Lambda errors, and RDS connections — all visible in one place, with SNS alarms when anything goes out of range.

WAF + Shield

For a nationally expanding café brand, DDoS protection at the ALB layer moves from a nice-to-have to a business continuity requirement.

The Result

Five layers.
Zero manual steps.

The infrastructure does not need a human to survive a failure. It does not need a human to handle a traffic spike. And it does not need a human to send the morning sales report. That was the mandate. That is the result.