Introduction to Software Architecture
Software architecture defines the high-level structure of a system — the major components, their relationships, and the principles guiding their design and evolution.
Architecture is the set of significant design decisions about the organisation of a software system:
- Structure: The decomposition of the system into components/modules and their responsibilities.
- Communication: How components interact — synchronous (REST, gRPC) vs asynchronous (events, message queues).
- Trade-offs: Every architectural decision involves trade-offs — consistency vs availability, simplicity vs flexibility, performance vs maintainability.
- Evolution: Good architecture accommodates change. Systems are never "done" — they evolve with requirements.
- Technical Leadership: Make and document key technical decisions. Guide the team on patterns, tools, and trade-offs.
- Communication: Bridge between business stakeholders and engineering teams. Translate requirements into architecture.
- Breadth over Depth: Understand a wide range of technologies at a conceptual level. Know when to go deep.
- Continuous Learning: Stay current with evolving patterns, cloud services, and industry practices.
- Pragmatism: Choose "good enough" over "perfect". Avoid over-engineering and analysis paralysis.
Architecture Fundamentals
Core architectural styles and patterns that form the foundation of system design.
- Single deployable unit — all components compiled and deployed together.
- Advantages: Simple to develop, test, deploy, and debug. Low operational overhead. Good for small teams and MVPs.
- Disadvantages: Scaling requires scaling the whole app. Long build/deploy times as it grows. Tight coupling makes changes risky.
- When to use: Early-stage products, small teams, low complexity. Start monolithic, extract services when needed.
System decomposed into small, independently deployable services, each owning its own data and business logic.
- Characteristics: Single responsibility per service, independent deployment, decentralised data management, technology heterogeneity.
- Communication: Sync (REST, gRPC) for queries; async (Kafka, RabbitMQ) for events and commands.
- Benefits: Independent scaling, isolated failures, team autonomy, technology flexibility.
- Challenges: Distributed system complexity (network failures, data consistency), operational overhead (monitoring, tracing, deployment), testing across service boundaries.
- Key Patterns: API Gateway, Service Discovery, Circuit Breaker, Saga, Event Sourcing, CQRS.
- Components communicate by producing and consuming events — loose coupling, high scalability.
- Event Broker: Kafka, RabbitMQ, AWS SNS/SQS — decouples producers from consumers.
- Event Sourcing: Store state as a sequence of events rather than current state. Enables full audit trails and temporal queries.
- CQRS: Separate read and write models for different optimisation strategies — write-optimised command store, read-optimised query store.
- Challenges: Eventual consistency, event ordering, idempotency, debugging event flows.
- Traditional n-tier approach: Presentation → Business Logic → Data Access → Database.
- Each layer depends only on the layer below — separation of concerns.
- Advantages: Simple, well-understood, easy to organise code for CRUD applications.
- Disadvantages: Tends toward tight coupling between layers. Changes often cascade through all layers.
- Clean Architecture: Invert dependencies — business logic at the centre, frameworks and databases at the edges. Dependency rule: inner layers never depend on outer layers.
- Execute code in response to events without managing servers — AWS Lambda, Azure Functions, Google Cloud Functions.
- Benefits: Zero server management, automatic scaling, pay-per-execution.
- Use cases: Event processing, scheduled tasks, webhooks, lightweight APIs, data transformation.
- Limitations: Cold starts (latency on first invocation), execution time limits, statelessness, vendor lock-in.
- Patterns: API Gateway + Lambda, S3 event triggers, SQS + Lambda for async processing, Step Functions for orchestration.
Design Principles
Foundational principles that guide good software design across all architectural styles.
- S — Single Responsibility: A class should have only one reason to change. Separate concerns into focused classes.
- O — Open/Closed: Open for extension, closed for modification. Add behaviour via new classes/interfaces, not by changing existing code.
- L — Liskov Substitution: Subtypes must be substitutable for their base types without altering correctness.
- I — Interface Segregation: Prefer many specific interfaces over one general-purpose interface. Clients shouldn't depend on methods they don't use.
- D — Dependency Inversion: High-level modules depend on abstractions, not concrete implementations. The foundation of testability and flexibility.
ACID (Traditional Databases)
- Atomicity: All or nothing — transactions complete fully or roll back entirely.
- Consistency: Database moves from one valid state to another — constraints are always satisfied.
- Isolation: Concurrent transactions don't interfere — each sees a consistent snapshot.
- Durability: Committed data survives crashes — written to non-volatile storage.
BASE (Distributed Systems)
- Basically Available: System guarantees availability (possibly with stale data).
- Soft State: State may change over time even without new input (due to eventual consistency).
- Eventual Consistency: Given enough time, all replicas converge to the same state.
Use ACID for financial transactions, inventory. Use BASE for social feeds, analytics, caching.
In a distributed system, you can guarantee at most two of three properties simultaneously:
- Consistency: Every read receives the most recent write.
- Availability: Every request receives a response (not necessarily the latest data).
- Partition Tolerance: System continues operating despite network partitions between nodes.
Since network partitions are inevitable in distributed systems, the real choice is between CP (consistency + partition tolerance — e.g., ZooKeeper, HBase) and AP (availability + partition tolerance — e.g., Cassandra, DynamoDB).
PACELC extension: When there's no Partition, choose between Latency and Consistency. E.g., DynamoDB is PA/EL (available during partition, low latency normally).
General Responsibility Assignment Software Patterns — guide for assigning responsibilities to classes.
- Information Expert: Assign responsibility to the class that has the information needed to fulfil it.
- Creator: Assign object creation to the class that contains, aggregates, or closely uses the created object.
- Controller: Assign system event handling to a non-UI class that represents the use case or session.
- Low Coupling: Minimise dependencies between classes — easier to change, test, and reuse.
- High Cohesion: Keep related responsibilities together in one class — focused, understandable modules.
- Polymorphism: Use polymorphism to handle type-based alternatives instead of conditionals.
- Indirection: Assign responsibility to an intermediate object to decouple components.
- Pure Fabrication: Create a class that doesn't represent a domain concept but achieves low coupling and high cohesion (e.g., a Repository class).
An approach to modelling complex business domains through collaboration between developers and domain experts.
- Ubiquitous Language: Shared vocabulary between developers and business — used in code, documentation, and conversations.
- Bounded Context: Explicit boundary within which a particular domain model applies. Different contexts may use the same term differently (e.g., "Order" in Sales vs Shipping).
- Entities: Objects with identity that persists across state changes (e.g., a User).
- Value Objects: Immutable objects defined by their attributes, not identity (e.g., Money, Address).
- Aggregates: Cluster of entities treated as a single unit for data changes. One entity is the Aggregate Root.
- Domain Events: Significant occurrences in the domain (e.g.,
OrderPlaced,PaymentReceived). - Repository: Abstraction for accessing aggregates from storage — hides persistence details from the domain.
Designing Systems
Practical patterns and strategies for building scalable, reliable, and maintainable systems.
- Vertical Scaling (Scale Up): More CPU, RAM, SSD on a single machine. Simple but has limits.
- Horizontal Scaling (Scale Out): Add more machines behind a load balancer. Requires stateless design.
- Database Scaling:
- Read Replicas: Route reads to replicas, writes to primary.
- Sharding: Partition data across multiple databases by key (e.g., user ID modulo shard count).
- Caching: Redis/Memcached to offload frequently accessed data from the database.
- Auto-Scaling: Cloud-based — scale instances based on CPU, memory, or custom metrics.
- Load Balancer: Distributes traffic across healthy instances. Algorithms: round-robin, least connections, IP hash.
- Layer 4 (TCP): Fast, protocol-agnostic. Can't inspect HTTP content.
- Layer 7 (HTTP): Content-based routing, SSL termination, header manipulation.
- API Gateway: Single entry point for microservices — routing, authentication, rate limiting, request transformation, response aggregation.
- Health Checks: Periodic probes to remove unhealthy instances from the pool.
- Cache-Aside (Lazy Loading): App checks cache first → on miss, loads from DB and populates cache. Most common pattern.
- Write-Through: App writes to cache and DB simultaneously. Ensures cache is always consistent but slower writes.
- Write-Behind (Write-Back): App writes to cache only → cache asynchronously flushes to DB. Fast writes but risk of data loss.
- TTL (Time-to-Live): Cache entries expire after a set duration. Balances freshness vs hit rate.
- Eviction Policies: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO.
- CDN: Cache static assets at edge locations — reduces latency for global users.
- Strong Consistency: Every read returns the latest write. Simple to reason about but limits availability and performance.
- Eventual Consistency: Reads may return stale data temporarily. Trades consistency for availability and latency.
- Saga Pattern: Sequence of local transactions with compensating actions on failure. Replaces distributed transactions in microservices.
- Two-Phase Commit (2PC): Coordinator ensures all participants commit or abort. Strong guarantees but blocking — rarely used in modern distributed systems.
- Outbox Pattern: Write event to an "outbox" table in the same DB transaction. A separate process publishes events — guarantees atomicity between state change and event publishing.
Quality Attributes
Non-functional requirements that define how well a system performs and adapts — often called the "-ilities".
- Ability to handle increased load by adding resources — without redesigning the system.
- Horizontal: Add more instances. Requires stateless services, shared-nothing architecture.
- Vertical: Upgrade hardware. Simpler but has a ceiling.
- Measure: Requests/second, concurrent users, data volume — under target latency.
- Auto-scaling: Cloud-native scaling based on metrics — CPU, queue depth, custom business metrics.
- Reliability: System produces correct results under stated conditions.
- Availability: System is operational when needed. Measured as uptime percentage (99.9% = 8.76 hours downtime/year).
- Redundancy: Eliminate single points of failure — multi-AZ deployments, database replicas, load balancer failover.
- Graceful Degradation: Serve partial functionality when components fail (e.g., show cached data when recommendations service is down).
- Chaos Engineering: Intentionally inject failures to verify resilience (Netflix Chaos Monkey).
- Defense in Depth: Multiple security layers — network, application, data.
- Authentication: Verify identity — OAuth2/OIDC, JWT tokens, MFA.
- Authorisation: Verify permissions — RBAC (role-based), ABAC (attribute-based).
- Encryption: In transit (TLS) and at rest (AES-256). Manage keys with KMS.
- Input Validation: Prevent injection (SQL, XSS, command) — validate and sanitise all external input.
- Least Privilege: Grant minimum necessary permissions to every user, service, and process.
- OWASP Top 10: Stay current with common web application security risks.
- Three Pillars:
- Metrics: Numeric measurements — request rate, error rate, latency (RED method), resource usage (USE method).
- Logs: Structured event records — timestamp, level, message, context (request ID, user ID).
- Traces: Request flow across services — distributed tracing with correlation IDs (OpenTelemetry, Jaeger).
- Alerting: Define SLOs (Service Level Objectives) and alert on SLI (Service Level Indicator) breaches — error rate > 1%, p99 latency > 500ms.
- Dashboards: Grafana for real-time system health visualisation — combine metrics from Prometheus, CloudWatch, custom sources.
Architecture Case Studies
How major companies solve complex architectural challenges at scale.
- Architecture: Microservices on AWS. Hundreds of services communicating via REST and event streams.
- CDN: Open Connect — Netflix's custom CDN with edge servers at ISPs for low-latency video delivery.
- Resilience: Hystrix (circuit breaker), Chaos Monkey (fault injection), Zuul (API gateway).
- Data: Cassandra for availability, EVCache (Memcached) for caching, Kafka for real-time event pipelines.
- Key Lesson: Design for failure. Every component assumes its dependencies will fail and has fallback behaviour.
- MapReduce: Distributed data processing framework — map (transform) and reduce (aggregate) across thousands of machines.
- Bigtable: Distributed wide-column store — billions of rows, millions of columns, petabytes of data.
- Spanner: Globally distributed SQL database with strong consistency — uses TrueTime (atomic clocks + GPS) for global ordering.
- Borg → Kubernetes: Google's internal container orchestration (Borg) inspired the open-source Kubernetes.
- Key Lesson: Build infrastructure abstractions that scale. Invest heavily in internal platforms.
- Fan-Out Problem: When a user with millions of followers tweets, delivering it to all followers' timelines efficiently.
- Hybrid Approach: Fan-out on write for most users (pre-compute timelines). Fan-out on read for celebrities (compute at read time to avoid massive writes).
- Technology: Scala/JVM services, Redis for timeline caching, Kafka for event streaming, Manhattan (distributed key-value store).
- Key Lesson: Hybrid strategies often beat pure approaches. Optimise for the common case, handle edge cases differently.
- Squad Model: Autonomous cross-functional teams — each owns a set of microservices end-to-end.
- Backstage: Internal developer portal (now open-source) for service catalogue, TechDocs, and scaffolding — solving microservices discoverability.
- Data Pipelines: Massive event processing for recommendations — Kafka, Google Cloud Dataflow, BigQuery.
- Key Lesson: Organisational structure mirrors system architecture (Conway's Law). Invest in developer experience and internal tooling.