AWS Cloud
Fundamentals
Everything you need before touching a single AWS service โ what cloud is, how service models work, virtualization, global infrastructure, shared responsibility, and design principles.
What is Cloud Computing?
Cloud computing is the on-demand delivery of computing resources โ servers, storage, databases, networking โ over the internet, paid for by usage rather than ownership.
Instead of owning and managing physical hardware, you can access these resources whenever you need them and pay only for what you use. The hardware still exists โ it just lives in someone else's data center, abstracted behind APIs.
Before cloud computing, organizations relied on on-premise infrastructure โ buying, racking, powering, and operating their own servers in their own buildings.
The Traditional Setup
- Purchase physical servers up-front
- Run private data centers (cooling, power, security)
- Manage networking, storage, OS patches
- Plan capacity months โ sometimes years โ in advance
Why It Broke
- High up-front capital expense
- Long lead times (weeks to months for new servers)
- Hard to scale โ you over-provision or run out
- Most hardware sits idle most of the time
As applications grew and internet usage exploded, this model became inefficient. The cloud emerged as a way to share large pools of hardware across many tenants, billed by the hour (and later, the millisecond).
Infrastructure Overhead
- No physical hardware, cooling, or networking to manage
Scalability
- Scale resources up or down in minutes, not months
Capital Cost
- No up-front investment โ pay only for what you actually use
Speed
- Provision a database or server in seconds, not weeks
Cloud computing provides on-demand access to shared computing resources over the network.
Five characteristics โ codified by NIST โ define a true cloud:
On-Demand Self-Service
- Provision resources via API or console โ no human in the loop
Scalability & Elasticity
- Capacity grows and shrinks with load, automatically
Pay-as-You-Go
- Metered billing โ by hour, second, request, or GB
Resource Pooling
- Multi-tenant infrastructure shared securely across customers
High Availability
- Redundancy built in โ failures are absorbed, not fatal
Broad Network Access
- Reachable from anywhere over standard internet protocols
Think of cloud computing the way you think of electricity. You don't build a power plant in your basement โ you plug in.
| Electricity Grid | Cloud Computing |
|---|---|
| You don't build your own power plant | You don't own racks of servers |
| You consume power on demand | You consume compute & storage on demand |
| You pay a utility bill based on kWh used | You pay a cloud bill based on usage (CPU-hours, GB, requests) |
| The grid handles generation, transmission, redundancy | The provider handles hardware, failover, capacity |
| Outages are rare and absorbed by the grid | Failures are isolated to zones; services remain available |
Every industry runs on cloud today. A non-exhaustive list:
Web & Mobile Apps
- Hosting backends, APIs, static sites
Data & Analytics
- Storing & querying terabytes of data
Machine Learning
- Training and serving models on GPUs
Streaming & CDN
- Video, audio, content delivery globally
SaaS Platforms
- Multi-tenant business apps (CRM, HR, billing)
Backup & DR
- Off-site backups, cross-region failover
Cloud computing is implemented through providers โ and the largest by market share is AWS (Amazon Web Services). Two foundational services map directly to the diagram above:
Amazon EC2
- Virtual servers โ pick OS, CPU, RAM, network
- Pay per second of running time
- The "compute" pillar of the cloud
Amazon S3
- Object storage โ durable, virtually unlimited
- Pay per GB stored and per request
- The "storage" pillar of the cloud
AWS abstracts the underlying hardware so you can focus on building applications instead of operating infrastructure.
| Model | Owned By | Used By | Typical Fit |
|---|---|---|---|
| Public Cloud | Provider (AWS, Azure, GCP) | Many tenants share | Default for most workloads |
| Private Cloud | Single organization | One tenant only | Regulated / strict data residency |
| Hybrid Cloud | Mix of both | Per workload | Lift-and-shift, gradual migration |
| Multi-Cloud | Multiple providers | Per workload | Avoid lock-in โ but more complex |
Speed
- Faster development & deployment cycles
- Idea โ production in days, not quarters
Global Reach
- Deploy to 30+ regions around the planet
- Latency-aware routing built in
Reduced Ops Burden
- Provider handles HW, patching, replacement
- Engineers focus on product
Reliability
- Multi-AZ, multi-region high availability
- SLAs measured in 9s
Cost Efficiency
- OpEx instead of CapEx
- Scale-to-zero possible with serverless
Experimentation
- Spin up an experiment for $5 and tear it down
- Innovation cost approaches zero
| Myth | Reality |
|---|---|
| "Cloud means data floats somewhere ethereal." | Data lives in real, physical data centers in specific countries. You can usually pick the region. |
| "Cloud is always cheaper." | It depends on usage and architecture. Idle reserved capacity or chatty workloads can cost more than on-prem. |
| "Cloud removes all responsibility." | Wrong โ see the Shared Responsibility Model. You still own apps, data, IAM, and configuration. |
| "Cloud is automatically secure." | The provider secures the infrastructure; you secure what you put in it (mis-configured S3 buckets are the classic failure). |
| "Cloud is just someone else's computer." | Reductive โ you also get global networking, managed services, autoscaling, and a programmable API surface that's not feasible on-prem. |
- Cloud computing provides on-demand access to computing resources over the network.
- It removes the need to own and operate physical infrastructure.
- It enables scalability, flexibility, and cost efficiency via pay-per-use billing.
- It's the foundation of modern application development โ every major SaaS, mobile app, and ML system runs on it.
- AWS is the largest implementation; EC2 and S3 are the canonical compute and storage services.
- The cloud doesn't remove responsibility โ it shifts it (hardware to provider, configuration to you).
Cloud computing turns infrastructure into an on-demand utility โ just like electricity. You stop owning hardware and start consuming capability.
Cloud Service Models โ IaaS ยท PaaS ยท SaaS
Cloud service models define how responsibilities are divided between the cloud provider and the user.
Who manages what in the cloud?
Every AWS service sits inside one of these models. Knowing which model you're working in tells you immediately what you're responsible for โ and what you can safely ignore.
Before cloud computing, organizations managed everything:
What They Owned
- Hardware (servers, switches, storage arrays)
- Operating systems and patches
- Runtimes, middleware, databases
- Applications and data
The Cost of Full Ownership
- High operational complexity
- Constant maintenance overhead
- Slow development cycles
- Large, specialized ops teams
Cloud providers introduced service models to reduce this burden gradually โ letting teams choose exactly how much infrastructure complexity they want to own.
Unclear Ownership
- Without a model, users don't know what they're responsible for โ security gaps emerge
Slow Development
- Developers waste time provisioning infra instead of writing code
Wasted Ops Effort
- Teams hand-hold infrastructure that providers can operate at massive scale for a fraction of the cost
Wrong Tool for the Job
- Picking a wrong model means over-managing simple apps or under-controlling complex ones
Three models โ IaaS, PaaS, and SaaS โ each offer a different level of abstraction. The higher the model, the less you manage.
IaaS
Infrastructure as a Service
- Raw compute, storage, networking
- You manage OS upward
PaaS
Platform as a Service
- Runtime + OS managed for you
- You manage code & data
SaaS
Software as a Service
- Fully managed application
- You configure & use it
Think of the three models as different housing arrangements:
| Model | Housing Analogy | What You Handle |
|---|---|---|
| IaaS | ๐๏ธ Empty apartment โ four walls, utilities connected | Furniture, appliances, decorating, cleaning โ everything inside |
| PaaS | ๐๏ธ Furnished apartment โ furniture and appliances included | Just bring your belongings; don't worry about pipes or wiring |
| SaaS | ๐จ Hotel room โ fully serviced, front desk on call | Unpack your suitcase; use the room; someone else cleans it |
Enterprises โ IaaS
- Legacy app migrations
- Full control over OS & security baseline
- Hybrid cloud bridging
Startups โ PaaS
- Ship fast, skip infra setup
- Focus 100% on product code
- Auto-managed runtimes & DBs
Everyone โ SaaS
- Email, CRM, HR, collaboration
- No IT overhead
- Browser or mobile app access
AWS spans all three models:
| Model | AWS Service | What you manage |
|---|---|---|
| IaaS | Amazon EC2 | OS, AMI, patches, runtime, app, security groups |
| IaaS | Amazon S3 | Bucket policies, data, lifecycle rules |
| PaaS | AWS Elastic Beanstalk | App code and config; AWS manages OS, runtime, LB |
| PaaS | AWS Lambda | Function code only; AWS manages everything else |
| PaaS | Amazon RDS | Schema, queries, data; AWS manages DB engine & OS |
| SaaS | Amazon WorkMail / Chime | User accounts & configuration only |
IaaS provides raw building blocks โ virtual machines, storage, networking โ with maximum flexibility and maximum responsibility.
You Manage
- Operating system & patches
- Runtime environment
- Middleware & frameworks
- Application code
- Data & backups
Provider Manages
- Physical hardware & data center
- Hypervisor & virtualization
- Network fabric & switches
- Hardware failure & replacement
When to use IaaS: you need full control (custom OS hardening, legacy apps, specific kernel tuning), or you're migrating on-prem workloads with minimal changes.
Deep dive โ Amazon EC2 (the canonical IaaS service)PaaS hands you a ready-to-code platform โ the OS, runtime, and scaling are handled. You push code, the platform runs it.
You Manage
- Application code & logic
- Data & schemas
- Environment configuration
Provider Manages
- OS installation & patching
- Runtime & SDK versions
- Load balancing & scaling
- Infrastructure provisioning
When to use PaaS: you want to ship fast and don't need to tune the OS or runtime. Typical for new web apps, APIs, microservices, and event-driven functions.
SaaS delivers a fully managed application over the internet. Open a browser, log in, use it. The provider operates everything underneath.
You Manage
- User accounts & access control
- Application-level configuration
- Your own data (content)
Provider Manages
- Application code & features
- Runtime, OS, hardware
- Uptime, updates, security patches
- Data storage & backups
When to use SaaS: you need a capability (email, CRM, monitoring) and building it in-house isn't core business. Use the service, not the stack.
| Dimension | IaaS | PaaS | SaaS |
|---|---|---|---|
| Control level | High | Medium | Low |
| Your responsibility | OS, runtime, app, data | App & data only | Configuration & usage |
| Time to first deploy | Hoursโdays (infra setup) | Minutesโhours | Minutes (sign up) |
| Flexibility | Maximum โ any OS, any config | Constrained by platform | Vendor's feature set only |
| Security ownership | You own most of the stack | Shared โ infra secured by provider | Provider secures infra; you own data classification |
| AWS examples | EC2, S3, VPC | Lambda, Beanstalk, RDS | WorkMail, Chime, Amazon Connect |
| Myth | Reality |
|---|---|
| "PaaS removes all responsibility." | You still own your application code and data. If your code has a SQL injection, PaaS won't save you. |
| "IaaS is better because you have more control." | More control = more work. IaaS is right when you need that control โ not as a default. |
| "SaaS is only for non-technical users." | Teams use SaaS tools (GitHub, Datadog, Snowflake) for critical engineering workflows daily. |
| "These models are mutually exclusive." | Most architectures mix them. A SaaS app might use IaaS for compute, PaaS for its DB, and third-party SaaS for logging. |
- IaaS, PaaS, SaaS define how far up the stack the provider manages for you.
- IaaS (EC2, S3) โ maximum flexibility, you manage OS and above.
- PaaS (Lambda, Beanstalk, RDS) โ platform handled, you manage code & data.
- SaaS โ fully managed app, you configure and use it.
- Most real architectures mix all three models.
- The right model is the least infrastructure you need to meet your requirements.
The higher the abstraction, the less you manage โ and the more the cloud provider handles. Pick the model that matches your acceptable responsibility level, not just your comfort zone.
Virtualization & Hypervisors
Virtualization allows a single physical machine to run multiple independent systems simultaneously โ turning raw hardware into flexible, multi-tenant infrastructure.
Without virtualization, AWS could not run millions of isolated customer workloads on shared hardware. Every EC2 instance you launch is a virtual machine. Understanding how virtual machines work is understanding the foundation of cloud compute.
Before virtualization, applications ran directly on dedicated physical servers โ a model known as bare-metal computing.
Traditional Setup
- One server โ one application
- Hardware heavily underutilized (10โ20% capacity typical)
- Scaling meant buying & racking new physical machines
- Deployment cycles measured in weeks
The Growing Problem
- Data centers ballooned in size and cost
- Managing thousands of heterogeneous servers was a nightmare
- Peak load required dedicated hardware sitting idle the rest of the time
- Applications couldn't be easily moved between machines
IBM pioneered virtualization in the 1960s on mainframes. It became mainstream in the 2000s when VMware brought it to commodity x86 hardware โ and it became the bedrock of modern cloud infrastructure.
Low Hardware Utilization
- Servers idling at ~15% CPU โ with virtualization, that same machine runs 10+ VMs at high utilization
High Infrastructure Cost
- One physical machine per app is expensive. VMs let you pack many workloads onto the same hardware
Scaling Difficulty
- Adding capacity used to mean a hardware procurement cycle. VMs can be spun up in seconds
Lack of Isolation
- Without VMs, one misbehaving app could crash others. VMs give hard process and memory boundaries
A hypervisor sits between physical hardware and virtual machines, dividing resources and isolating each VM from the others.
Physical Host
- Real CPU, RAM, disk, NIC
- The actual hardware in the data center
- "Host" in virtualization vocabulary
Hypervisor
- Software layer managing VMs
- Allocates CPU slices, RAM, disk I/O
- Enforces isolation between VMs
Virtual Machine (VM)
- Full OS + applications inside a software envelope
- Sees virtualized hardware (vCPU, vRAM, vDisk)
- "Guest" in virtualization vocabulary
Think of a physical server as an apartment building:
| Real World | Virtualization |
|---|---|
| ๐ข The building itself | Physical server (CPU, RAM, disk) |
| ๐ Each individual apartment | Virtual machine (isolated OS + apps) |
| ๐ท Building manager | Hypervisor (allocates space, enforces rules) |
| ๐ฅ Tenants sharing the building | Multiple VMs sharing hardware |
| ๐ Locked apartment doors | VM isolation โ one VM can't see another's memory |
| ๐ง Utilities (water, power, internet) | Shared hardware resources (CPU cycles, RAM, network) |
Each tenant has their own space and doesn't interfere with neighbours โ even though they share the same building's infrastructure.
There are two classes of hypervisor, differing in where they sit relative to the host OS:
- No host OS between hypervisor and hardware
- Lower overhead โ better performance
- Used in production & cloud data centers
- Examples: VMware ESXi, Microsoft Hyper-V, AWS Nitro, Xen
- Host OS layer between hypervisor and hardware
- Easier to install โ popular for dev & testing
- Higher overhead than Type 1
- Examples: VirtualBox, VMware Workstation, Parallels
| Aspect | Type 1 (Bare-metal) | Type 2 (Hosted) |
|---|---|---|
| Sits on | Hardware directly | Host operating system |
| Performance | High โ minimal overhead | Lower โ extra OS layer |
| Security isolation | Strong | Weaker (host OS is attack surface) |
| Primary use | Production clouds, data centers | Developer laptops, testing |
| Cloud relevance | This is what AWS uses | Not used in cloud providers |
Cloud Providers
- AWS, Azure, GCP run billions of VMs
- Multi-tenancy is only possible with hypervisor isolation
Enterprise Data Centers
- Server consolidation โ 10 physical servers โ 1 host with 10 VMs
- Live migration for zero-downtime maintenance
Developer Environments
- Run different OSes on one laptop (VirtualBox, Parallels)
- Reproducible testing across OS versions
Every Amazon EC2 instance is a virtual machine. When you click "Launch instance" in the AWS console, the hypervisor on a physical server in an AWS data center carves out a VM for you in seconds.
AWS Nitro Hypervisor
- AWS's custom Type-1 hypervisor (based on KVM)
- Offloads I/O to dedicated Nitro cards (NVMe, networking)
- Near bare-metal performance โ almost no overhead
- Released 2017; now powers all modern EC2 instances
Isolation Guarantee
- Each customer's VMs are isolated from others on the same host
- Memory is scrubbed between customers
- Nitro Controller enforces hardware-level security boundaries
- Basis of AWS's multi-tenant security model
When you launch an EC2 instance, AWS:
- Selects a physical host with spare capacity in the chosen AZ
- Nitro hypervisor allocates the requested vCPUs, RAM, and EBS/NVMe storage
- The instance boots your selected AMI (Amazon Machine Image โ OS snapshot)
- Your VM is fully isolated from every other customer on that same physical host
| Myth | Reality |
|---|---|
| "VMs are fake computers." | VMs behave like real machines. They have full OS control, networking, storage, and can run any software a physical machine can. |
| "Each VM gets its own dedicated hardware." | Resources are shared and scheduled by the hypervisor. CPU time is multiplexed; RAM is allocated but pooled across the host. |
| "Virtualization only exists in the cloud." | Virtualization existed in data centers and developer machines for decades before cloud. Cloud added APIs, billing, and scale on top. |
| "Containers are the same as VMs." | Containers share the host OS kernel; VMs include a full guest OS. VMs are stronger isolation; containers are lighter-weight. |
| "The hypervisor adds no overhead." | Modern hypervisors (especially AWS Nitro) are near-zero overhead โ but there's always a small cost for resource scheduling and isolation enforcement. |
- Virtualization runs multiple independent VMs on a single physical machine.
- The hypervisor manages resource allocation and enforces isolation between VMs.
- Type 1 (bare-metal) hypervisors run directly on hardware โ used by all cloud providers.
- Type 2 (hosted) hypervisors run on a host OS โ used for developer machines.
- Amazon EC2 instances are VMs powered by AWS's custom Nitro hypervisor.
- Virtualization enables the multi-tenancy, scalability, and isolation that make cloud economically viable.
Virtualization is what turns physical hardware into flexible, scalable cloud infrastructure โ every EC2 instance you launch is a VM created by a hypervisor in milliseconds.
AWS Global Infrastructure โ Regions & Availability Zones
Cloud computing is not just about what runs โ it's about where it runs. AWS's global infrastructure lets applications operate across multiple geographic locations, ensuring high availability, low latency, and fault isolation.
Before picking a single service in AWS you answer one question: which region? That choice determines latency for your users, data sovereignty compliance, and what disaster recovery options you have. This page explains the geography under every AWS workload.
Traditional Architecture
- Applications ran in a single data center
- One data center failure = total outage
- Global reach required building & operating DCs in each country
- Disaster recovery was expensive and rarely tested
The Problems
- High operational cost of each additional DC
- Complex network inter-connects between owned facilities
- Users far from the DC experienced high latency
- Regulatory/data-residency compliance was manual
Cloud providers built globally distributed infrastructure to solve these problems at scale โ letting any customer get multinational reach without owning a single building.
Single Point of Failure
- Multi-AZ and multi-region deployments mean no single location failure can take down a properly designed system
Global Latency
- Deploy to the region closest to your users โ shave 100+ ms off response times for overseas traffic
Data Sovereignty
- Keep data inside a specific country or continent to comply with GDPR, PDPA, or domestic regulations
Disaster Recovery
- Replicate workloads across regions โ if one is unavailable, traffic fails over automatically
AWS global infrastructure has three nested layers: Regions โ Availability Zones โ Edge Locations. Each layer adds a dimension of resilience and performance.
Region
- A named geographic area (e.g.,
us-east-1,ap-southeast-1) - Completely independent โ failures don't cross region boundaries
- Contains โฅ 3 Availability Zones
- Most AWS services are region-scoped
Availability Zone (AZ)
- One or more discrete data centers within a region
- Physically separate (km apart) โ different power, cooling, networking
- Connected by low-latency private fiber (<2ms between AZs)
- Named
us-east-1a,us-east-1b, etc.
Edge Location
- Points of Presence (PoPs) distributed in 90+ cities globally
- Used by CloudFront CDN, Route 53, AWS Shield
- Caches content and performs DNS resolution close to end users
- Not for running EC2/RDS โ for delivery & caching only
Think of AWS infrastructure as a global network of cities:
| Real World | AWS Infrastructure | Example |
|---|---|---|
| ๐ Country / Continent | Global AWS infrastructure | All of AWS worldwide |
| ๐๏ธ City | Region | Singapore (ap-southeast-1) |
| ๐ข Building district in the city | Availability Zone | ap-southeast-1a, ap-southeast-1b |
| ๐ฌ Local post office / delivery hub | Edge Location | CloudFront PoP in Mumbai |
| ๐ Fire in one building doesn't spread to others | AZ failure isolation | AZ-a down; AZ-b & AZ-c keep running |
us-east-1, ap-southeast-1, etc. โ based on user proximity and compliance.E-Commerce
- Multi-AZ RDS for zero-downtime DB failover
- Auto Scaling groups span 3 AZs
- CloudFront for product images & static assets
Streaming Platforms
- Origin in one region, CloudFront PoPs globally
- S3 as source for CDN โ 99.999999999% durability
- Route 53 latency routing for API calls
Financial Systems
- Active-active multi-region for RPO/RTO near zero
- Data replicated synchronously within region, async across
- Data sovereignty enforced by region choice
Every service you use in AWS has a geographic scope. Knowing the scope tells you what happens during a failure:
| Service | Scope | What this means |
|---|---|---|
| Amazon EC2 | AZ-level | An instance lives in one AZ. Deploy in multiple AZs for HA. |
| Amazon RDS Multi-AZ | Region (spans AZs) | Primary in one AZ, standby in another. Automatic failover <60s. |
| Amazon S3 | Region (stored across โฅ3 AZs) | Eleven 9s durability โ survives any single AZ failure. |
| Elastic Load Balancer | Region (nodes in each AZ) | Distributes traffic across AZs automatically. |
| Amazon CloudFront | Global (Edge Locations) | Caches at 600+ PoPs โ closest possible to the end user. |
| Amazon Route 53 | Global | DNS with health checks โ routes around failures automatically. |
| IAM | Global | Not region-specific โ one IAM policy applies everywhere. |
| Myth | Reality |
|---|---|
| "A Region is a single data center." | A region contains at least 3 physically separate Availability Zones, each of which can be multiple data centers. |
| "One AZ is enough for high availability." | Single-AZ is a Single Point of Failure. AWS's SLAs for multi-AZ services assume you're using multiple AZs. |
| "AZs are just different rooms in one building." | AZs are kilometres apart, on separate power grids with separate networking. A natural disaster or power outage affecting one AZ will not affect another. |
| "Multi-region is always required." | Most applications only need multi-AZ. Multi-region is for disaster recovery and global latency โ it adds real operational complexity and cost. |
| "Edge Locations are the same as AZs." | Edge Locations only run CloudFront, Route 53, and Shield. You cannot deploy EC2 or databases there. They are delivery nodes, not compute regions. |
- AWS infrastructure has three layers: Regions โ AZs โ Edge Locations.
- A Region is an independent geographic area; data stays there unless you replicate it out.
- An AZ is one or more separate data centers in a region, connected by low-latency fiber.
- Edge Locations are CloudFront PoPs โ for delivery and caching, not compute.
- Always deploy across โฅ 2 AZs. Use 3 for production workloads.
- Multi-region is optional โ use it for DR requirements or global latency-sensitive apps.
- Understanding geographic scope (
AZ / Region / Global) is required to reason about any AWS service's failure modes.
High availability in the cloud comes from distributing systems across multiple Availability Zones โ and optionally across Regions for disaster recovery. Geography is an architectural decision, not an afterthought.
Shared Responsibility Model
The Shared Responsibility Model answers one foundational question:
Who is responsible for what in the cloud?
It's the most important security concept to internalise before you use a single AWS service. Misunderstanding it is the source of the majority of real-world cloud security incidents โ not because AWS failed, but because the customer didn't know what they needed to secure.
On-Premise: Full Ownership
- You own every layer: physical rack, OS, network, app, data
- Full control = full accountability
- Security team patches hardware, applies firmware, monitors everything
- Expensive, but the responsibility boundary is clear: it's all yours
Cloud: The New Question
- AWS manages the data center, hardware, and hypervisor
- But where does AWS's job end and yours begin?
- The answer differs by service type (IaaS vs PaaS vs SaaS)
- Without a model, gaps form โ and attackers exploit gaps
Responsibility is not eliminated by moving to cloud โ it is shared and redistributed depending on which services you use.
Security Gaps
- When nobody knows who owns a layer, nobody secures it. The model eliminates ambiguity
Unclear Accountability
- After a breach: "Was it AWS or us?" The model gives a precise answer for any incident
Misconfiguration Risk
- Public S3 buckets, open security groups, unencrypted data โ all user-layer problems the model flags as your responsibility
Compliance Clarity
- Auditors ask "who controls what?" โ the model gives you the exact answer for your compliance documentation
AWS is responsible for security of the cloud โ the physical and virtual infrastructure. You are responsible for security in the cloud โ what you deploy and configure on top of it.
- Customer data (encryption at rest & in transit)
- Identity & Access Management (IAM users, roles, policies)
- Operating system on EC2 (patches, hardening)
- Application code & runtime configuration
- Network & firewall rules (Security Groups, NACLs)
- Client-side encryption & data integrity
- Platform, applications, identity management
- Physical data center security (guards, biometrics, CCTV)
- Hardware (servers, storage, networking equipment)
- Host operating system & virtualization layer (Nitro)
- Global network infrastructure (fibre, routers, DDoS)
- Managed service software (RDS DB engine, Lambda runtime)
- Availability Zone & region fault isolation design
- AWS hardware & global infrastructure compliance (SOC 2, ISO 27001)
Think of cloud infrastructure as a secure apartment building:
| Building (AWS) | Apartment (Your workload) |
|---|---|
| ๐ข Guards at the front entrance | ๐ You lock your apartment door |
| ๐ Secured lifts and common areas | ๐ช You close your windows |
| ๐ก Electricity and utilities managed | ๐ค You control who has your key |
| ๐ง Building structure maintained | ๐งน You keep your own space tidy |
| ๐น CCTV on the street outside | ๐จ You configure your own alarm inside |
The majority of cloud security incidents fall on the customer side of the boundary. Common root causes:
Public S3 Buckets
- AWS provides the bucket; you set the ACL
- Misconfigured public access has exposed millions of records
- Fix: S3 Block Public Access + bucket policies
Exposed IAM Keys
- AWS secures the IAM service; you manage the keys
- Hardcoded credentials in GitHub repos is a user error
- Fix: IAM roles, Secrets Manager, no long-lived keys
Unpatched EC2
- AWS provides the hypervisor and hardware; you patch the OS
- EC2 instances running 6-month-old kernels are your problem
- Fix: Systems Manager Patch Manager, IMDSv2
| Service | AWS Secures | You Secure |
|---|---|---|
| Amazon EC2 | Host OS, hypervisor, hardware, data center | Guest OS patches, Security Groups, IAM instance profile, app code |
| Amazon S3 | Storage infrastructure, 11-9s durability, hardware redundancy | Bucket policies, ACLs, Block Public Access, KMS encryption, versioning |
| Amazon RDS | DB engine installation, OS patching, hardware, Multi-AZ failover | DB users & passwords, security group rules, parameter groups, data encryption |
| AWS Lambda | Runtime, underlying infra, function isolation, scaling | Function code, execution role (IAM), environment variable secrets |
| Amazon VPC | Physical network, transit infrastructure | Subnets, route tables, Security Groups, NACLs, internet gateway configs |
| IAM | IAM service availability | Every policy, role, user, group, permission boundary โ entirely yours |
As you move from IaaS โ PaaS โ SaaS, your security surface shrinks โ but it never disappears:
| Myth | Reality |
|---|---|
| "AWS handles all security." | AWS secures the infrastructure. Your applications, IAM policies, and data configurations are entirely your responsibility. |
| "If data is in the cloud, it's automatically safe." | Data safety depends on your encryption, access controls, and logging config. Misconfigured S3 buckets with sensitive data have caused massive real-world breaches. |
| "Using managed services removes responsibility." | PaaS and SaaS reduce your attack surface โ they don't eliminate it. You still own your data, IAM roles, and application logic. |
| "AWS compliance certifications cover my workload." | AWS's SOC 2, ISO 27001, etc. cover their infrastructure. For your workload to be compliant, you must implement the required controls on your side of the boundary. |
| "The boundary is always the same." | The boundary shifts with the service model. On EC2 (IaaS) you own the OS. On RDS (PaaS) you don't. The model must be evaluated per service. |
- Cloud security is shared โ AWS protects infrastructure, you protect what you build on top.
- AWS is responsible for security of the cloud: hardware, data centers, hypervisor, global network.
- You are responsible for security in the cloud: OS, data, IAM, application code, network configs.
- The boundary shifts by service model โ on EC2 you own the OS; on RDS you don't.
- Most real-world cloud security incidents are customer-side failures: public S3 buckets, exposed keys, unpatched OS.
- AWS compliance certs cover AWS's side. Your workload compliance is your job.
- Higher abstraction (PaaS/SaaS) reduces your surface โ but never to zero.
Cloud security is a shared effort โ AWS secures the foundation, but you are fully responsible for what you build, configure, and deploy on top of it. Never assume the cloud provider handles it all.
Cloud Design Principles & Well-Architected Framework
Building in the cloud isn't just about picking services โ it's about designing systems that are reliable, scalable, secure, and cost-efficient under real-world conditions.
Any engineer can launch an EC2 instance. Far fewer design a system that handles 10ร the expected traffic, survives an AZ failure, stays within budget, and keeps operations teams from being paged at 3 am. That's what cloud design principles enable.
Traditional System Design
- Tightly coupled monoliths โ one failure, total outage
- Scaling meant buying bigger hardware
- Manual operations, slow deployments
- No standard vocabulary for "good design"
Cloud Without Principles
- Teams reinvent the wheel โ and make the same mistakes
- Architectures grow organically โ brittle, expensive, hard to change
- Security bolted on after the fact
- Costs spiral because nobody owns them
In 2015, AWS published the Well-Architected Framework โ a structured set of guidance for evaluating and improving cloud architectures across five dimensions. It's now the industry-standard vocabulary for cloud design.
Fragile Systems
- Without reliability principles, a single component failure cascades. Design-for-failure patterns break the cascade
Inefficient Scaling
- Vertical scaling hits walls. Horizontal scaling with decoupled components is the cloud-native approach
Unexpected Costs
- Over-provisioned resources, always-on dev environments, missing auto-scaling โ cost optimization principles address all of them
Security Gaps
- Security as an afterthought leaves holes. Security pillar principles bake it into the design from day one
The AWS Well-Architected Framework evaluates architectures across five pillars: Reliability ยท Performance Efficiency ยท Security ยท Cost Optimization ยท Operational Excellence. A well-architected system balances all five.
No pillar dominates the others. A system that is perfectly reliable but astronomically expensive is not well-architected. The framework forces you to evaluate trade-offs explicitly rather than optimising one dimension in ignorance of the rest.
Think of cloud architecture like planning a modern city:
| City Planning | Cloud Architecture | Well-Architected Pillar |
|---|---|---|
| ๐ฆ Traffic management & road redundancy | Multi-AZ load balancing, circuit breakers | Reliability |
| โก Power grid that scales with population | Auto Scaling, serverless compute | Performance Efficiency |
| ๐ Locks, CCTV, access zones in buildings | IAM least privilege, encryption, VPC isolation | Security |
| ๐ก Utilities metered โ pay for what you use | Right-sizing, spot instances, savings plans | Cost Optimization |
| ๐ ๏ธ City maintenance crews & alert systems | CloudWatch, runbooks, automated remediation | Operational Excellence |
High-Traffic Web Apps
- Multi-AZ ALB + Auto Scaling (Reliability)
- CloudFront for global latency (Performance)
- WAF on the load balancer (Security)
Microservices
- Decoupled via SQS/SNS (Reliability)
- Independent scaling per service (Performance)
- Service-specific IAM roles (Security)
Data Pipelines
- S3 checkpointing for fault tolerance (Reliability)
- Spot instances for batch jobs (Cost)
- VPC endpoints โ no public internet (Security)
AWS provides first-party tooling to operationalise these principles:
AWS Well-Architected Tool
- Free from the AWS console
- Structured questionnaire per pillar
- Identifies High / Medium / Low risks
- Generates an improvement plan with AWS guidance links
- Can be run during design and post-deploy
AWS Well-Architected Partner Program
- AWS partners (consultants, SIs) can run formal reviews
- Architecture deep-dives per workload type
- Lenses available: SaaS, IoT, ML, Serverless, Analytics
- Result: signed-off architecture review document
A system's ability to recover from failures and continue to function correctly over time.
- Design with quotas and limits in mind
- Deploy across โฅ 2 AZs for every stateful component
- Use health checks + automatic failover (ELB, Route 53)
- Test recovery: chaos engineering, game days
- Backup data and test restores regularly
Selecting and using the right resources in the right amounts efficiently as requirements change.
- Use purpose-built compute (GPU for ML, memory-optimised for in-memory DBs)
- Cache aggressively at every layer (ElastiCache, CloudFront, DAX)
- Go serverless where you can โ Lambda, Fargate, Aurora Serverless
- Re-evaluate instance types annually as AWS releases new generations
Protecting data, systems, and assets through risk assessments and mitigation strategies.
- Apply least privilege to every IAM entity โ start with deny-all
- Enable CloudTrail, Config, GuardDuty in every account
- Encrypt everything: KMS for data at rest, TLS for data in transit
- Use VPC endpoints to keep traffic off the public internet
- Rotate credentials; eliminate long-lived access keys
Running workloads at the lowest price point without sacrificing performance or reliability.
- Right-size instances with Compute Optimizer recommendations
- Use Spot for fault-tolerant batch workloads (up to 90% savings)
- Purchase Savings Plans or Reserved Instances for steady-state workloads
- Shut down non-production environments outside business hours
- Delete unattached EBS volumes, stale snapshots, idle load balancers
Running and monitoring systems, and continually improving processes and procedures.
- Define everything as code: infrastructure, pipelines, runbooks
- Make small, reversible changes โ not infrequent, risky big-bang deploys
- Define and measure business KPIs in CloudWatch dashboards
- Run post-mortems blameless; capture learnings as action items
- Anticipate failure modes with game days and chaos experiments
- Deploy critical services across at least 2 Availability Zones
- Use Auto Scaling groups for all stateless compute tiers
- Enable Multi-AZ for all production databases (RDS, ElastiCache)
- Implement automated backups and test restores quarterly
- Apply least-privilege IAM with SCPs at the AWS Organizations level
- Encrypt data at rest (KMS) and in transit (TLS 1.2+) everywhere
- Use managed services (RDS, Lambda, SQS) over self-managed equivalents where possible
- Tag every resource: Owner, Environment, CostCenter, Application
- Set billing alarms and AWS Budgets for every account
- Run infrastructure from code (CloudFormation, CDK, Terraform)
- Enable CloudTrail, AWS Config, and GuardDuty in every region and account
- Conduct a formal Well-Architected Review before every major launch
| Myth | Reality |
|---|---|
| "Cloud automatically makes systems scalable." | Cloud gives you scalable primitives. A monolith deployed on EC2 with no Auto Scaling is not scalable just because it's in AWS. |
| "High performance always means high cost." | Not with the right design. Caching, CDN, right-sizing, and serverless often deliver better performance and lower cost than brute-force compute. |
| "Best practices are fixed rules to follow blindly." | The framework explicitly says: every principle involves trade-offs. A startup's MVP has different reliability requirements than a banking core system. |
| "More services = better architecture." | Complexity is a cost. Every additional service adds operational burden and potential failure points. The simplest architecture that meets requirements is the best architecture. |
| "The Well-Architected Framework only applies to large systems." | Even a personal project benefits from the principles. Cost optimization and security are relevant at any scale. |
- The AWS Well-Architected Framework provides five pillars: Reliability, Performance, Security, Cost Optimization, Operational Excellence.
- Good architecture balances all five โ optimising one at the expense of others is an anti-pattern.
- Core design patterns: design for failure, decouple, scale horizontally, automate, observe, security-by-default.
- Use the AWS Well-Architected Tool (free in console) to formally evaluate your workloads.
- Architecture is continuous: review after launch, after incidents, and as AWS releases new services.
- Simplicity beats complexity โ the best architecture is the simplest one that meets requirements.
Good cloud architecture isn't about using more services โ it's about applying the right principles to design resilient, efficient, and cost-effective systems. The five pillars are your compass.