Software Architecture | Learning Hub

Introduction to Software Architecture

Software architecture defines the high-level structure of a system — the major components, their relationships, and the principles guiding their design and evolution.

Architecture is the set of significant design decisions about the organisation of a software system:

Structure: The decomposition of the system into components/modules and their responsibilities.
Communication: How components interact — synchronous (REST, gRPC) vs asynchronous (events, message queues).
Trade-offs: Every architectural decision involves trade-offs — consistency vs availability, simplicity vs flexibility, performance vs maintainability.
Evolution: Good architecture accommodates change. Systems are never "done" — they evolve with requirements.

Technical Leadership: Make and document key technical decisions. Guide the team on patterns, tools, and trade-offs.
Communication: Bridge between business stakeholders and engineering teams. Translate requirements into architecture.
Breadth over Depth: Understand a wide range of technologies at a conceptual level. Know when to go deep.
Continuous Learning: Stay current with evolving patterns, cloud services, and industry practices.
Pragmatism: Choose "good enough" over "perfect". Avoid over-engineering and analysis paralysis.

Architecture Fundamentals

Core architectural styles and patterns that form the foundation of system design.

Single deployable unit — all components compiled and deployed together.
Advantages: Simple to develop, test, deploy, and debug. Low operational overhead. Good for small teams and MVPs.
Disadvantages: Scaling requires scaling the whole app. Long build/deploy times as it grows. Tight coupling makes changes risky.
When to use: Early-stage products, small teams, low complexity. Start monolithic, extract services when needed.

System decomposed into small, independently deployable services, each owning its own data and business logic.

Characteristics: Single responsibility per service, independent deployment, decentralised data management, technology heterogeneity.
Communication: Sync (REST, gRPC) for queries; async (Kafka, RabbitMQ) for events and commands.
Benefits: Independent scaling, isolated failures, team autonomy, technology flexibility.
Challenges: Distributed system complexity (network failures, data consistency), operational overhead (monitoring, tracing, deployment), testing across service boundaries.
Key Patterns: API Gateway, Service Discovery, Circuit Breaker, Saga, Event Sourcing, CQRS.

Components communicate by producing and consuming events — loose coupling, high scalability.
Event Broker: Kafka, RabbitMQ, AWS SNS/SQS — decouples producers from consumers.
Event Sourcing: Store state as a sequence of events rather than current state. Enables full audit trails and temporal queries.
CQRS: Separate read and write models for different optimisation strategies — write-optimised command store, read-optimised query store.
Challenges: Eventual consistency, event ordering, idempotency, debugging event flows.

Traditional n-tier approach: Presentation → Business Logic → Data Access → Database.
Each layer depends only on the layer below — separation of concerns.
Advantages: Simple, well-understood, easy to organise code for CRUD applications.
Disadvantages: Tends toward tight coupling between layers. Changes often cascade through all layers.
Clean Architecture: Invert dependencies — business logic at the centre, frameworks and databases at the edges. Dependency rule: inner layers never depend on outer layers.

Execute code in response to events without managing servers — AWS Lambda, Azure Functions, Google Cloud Functions.
Benefits: Zero server management, automatic scaling, pay-per-execution.
Use cases: Event processing, scheduled tasks, webhooks, lightweight APIs, data transformation.
Limitations: Cold starts (latency on first invocation), execution time limits, statelessness, vendor lock-in.
Patterns: API Gateway + Lambda, S3 event triggers, SQS + Lambda for async processing, Step Functions for orchestration.

Design Principles

Foundational principles that guide good software design across all architectural styles.

S — Single Responsibility: A class should have only one reason to change. Separate concerns into focused classes.
O — Open/Closed: Open for extension, closed for modification. Add behaviour via new classes/interfaces, not by changing existing code.
L — Liskov Substitution: Subtypes must be substitutable for their base types without altering correctness.
I — Interface Segregation: Prefer many specific interfaces over one general-purpose interface. Clients shouldn't depend on methods they don't use.
D — Dependency Inversion: High-level modules depend on abstractions, not concrete implementations. The foundation of testability and flexibility.

ACID (Traditional Databases)

Atomicity: All or nothing — transactions complete fully or roll back entirely.
Consistency: Database moves from one valid state to another — constraints are always satisfied.
Isolation: Concurrent transactions don't interfere — each sees a consistent snapshot.
Durability: Committed data survives crashes — written to non-volatile storage.

BASE (Distributed Systems)

Basically Available: System guarantees availability (possibly with stale data).
Soft State: State may change over time even without new input (due to eventual consistency).
Eventual Consistency: Given enough time, all replicas converge to the same state.

Use ACID for financial transactions, inventory. Use BASE for social feeds, analytics, caching.

In a distributed system, you can guarantee at most two of three properties simultaneously:

Consistency: Every read receives the most recent write.
Availability: Every request receives a response (not necessarily the latest data).
Partition Tolerance: System continues operating despite network partitions between nodes.

Since network partitions are inevitable in distributed systems, the real choice is between CP (consistency + partition tolerance — e.g., ZooKeeper, HBase) and AP (availability + partition tolerance — e.g., Cassandra, DynamoDB).

PACELC extension: When there's no Partition, choose between Latency and Consistency. E.g., DynamoDB is PA/EL (available during partition, low latency normally).

General Responsibility Assignment Software Patterns — guide for assigning responsibilities to classes.

Information Expert: Assign responsibility to the class that has the information needed to fulfil it.
Creator: Assign object creation to the class that contains, aggregates, or closely uses the created object.
Controller: Assign system event handling to a non-UI class that represents the use case or session.
Low Coupling: Minimise dependencies between classes — easier to change, test, and reuse.
High Cohesion: Keep related responsibilities together in one class — focused, understandable modules.
Polymorphism: Use polymorphism to handle type-based alternatives instead of conditionals.
Indirection: Assign responsibility to an intermediate object to decouple components.
Pure Fabrication: Create a class that doesn't represent a domain concept but achieves low coupling and high cohesion (e.g., a Repository class).

An approach to modelling complex business domains through collaboration between developers and domain experts.

Ubiquitous Language: Shared vocabulary between developers and business — used in code, documentation, and conversations.
Bounded Context: Explicit boundary within which a particular domain model applies. Different contexts may use the same term differently (e.g., "Order" in Sales vs Shipping).
Entities: Objects with identity that persists across state changes (e.g., a User).
Value Objects: Immutable objects defined by their attributes, not identity (e.g., Money, Address).
Aggregates: Cluster of entities treated as a single unit for data changes. One entity is the Aggregate Root.
Domain Events: Significant occurrences in the domain (e.g., OrderPlaced, PaymentReceived).
Repository: Abstraction for accessing aggregates from storage — hides persistence details from the domain.

Designing Systems

Practical patterns and strategies for building scalable, reliable, and maintainable systems.

Vertical Scaling (Scale Up): More CPU, RAM, SSD on a single machine. Simple but has limits.
Horizontal Scaling (Scale Out): Add more machines behind a load balancer. Requires stateless design.
Database Scaling:
- Read Replicas: Route reads to replicas, writes to primary.
- Sharding: Partition data across multiple databases by key (e.g., user ID modulo shard count).
- Caching: Redis/Memcached to offload frequently accessed data from the database.
Auto-Scaling: Cloud-based — scale instances based on CPU, memory, or custom metrics.

Load Balancer: Distributes traffic across healthy instances. Algorithms: round-robin, least connections, IP hash.
Layer 4 (TCP): Fast, protocol-agnostic. Can't inspect HTTP content.
Layer 7 (HTTP): Content-based routing, SSL termination, header manipulation.
API Gateway: Single entry point for microservices — routing, authentication, rate limiting, request transformation, response aggregation.
Health Checks: Periodic probes to remove unhealthy instances from the pool.

Cache-Aside (Lazy Loading): App checks cache first → on miss, loads from DB and populates cache. Most common pattern.
Write-Through: App writes to cache and DB simultaneously. Ensures cache is always consistent but slower writes.
Write-Behind (Write-Back): App writes to cache only → cache asynchronously flushes to DB. Fast writes but risk of data loss.
TTL (Time-to-Live): Cache entries expire after a set duration. Balances freshness vs hit rate.
Eviction Policies: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO.
CDN: Cache static assets at edge locations — reduces latency for global users.

Strong Consistency: Every read returns the latest write. Simple to reason about but limits availability and performance.
Eventual Consistency: Reads may return stale data temporarily. Trades consistency for availability and latency.
Saga Pattern: Sequence of local transactions with compensating actions on failure. Replaces distributed transactions in microservices.
Two-Phase Commit (2PC): Coordinator ensures all participants commit or abort. Strong guarantees but blocking — rarely used in modern distributed systems.
Outbox Pattern: Write event to an "outbox" table in the same DB transaction. A separate process publishes events — guarantees atomicity between state change and event publishing.

Quality Attributes

Non-functional requirements that define how well a system performs and adapts — often called the "-ilities".

Ability to handle increased load by adding resources — without redesigning the system.
Horizontal: Add more instances. Requires stateless services, shared-nothing architecture.
Vertical: Upgrade hardware. Simpler but has a ceiling.
Measure: Requests/second, concurrent users, data volume — under target latency.
Auto-scaling: Cloud-native scaling based on metrics — CPU, queue depth, custom business metrics.

Reliability: System produces correct results under stated conditions.
Availability: System is operational when needed. Measured as uptime percentage (99.9% = 8.76 hours downtime/year).
Redundancy: Eliminate single points of failure — multi-AZ deployments, database replicas, load balancer failover.
Graceful Degradation: Serve partial functionality when components fail (e.g., show cached data when recommendations service is down).
Chaos Engineering: Intentionally inject failures to verify resilience (Netflix Chaos Monkey).

Defense in Depth: Multiple security layers — network, application, data.
Authentication: Verify identity — OAuth2/OIDC, JWT tokens, MFA.
Authorisation: Verify permissions — RBAC (role-based), ABAC (attribute-based).
Encryption: In transit (TLS) and at rest (AES-256). Manage keys with KMS.
Input Validation: Prevent injection (SQL, XSS, command) — validate and sanitise all external input.
Least Privilege: Grant minimum necessary permissions to every user, service, and process.
OWASP Top 10: Stay current with common web application security risks.

Three Pillars:
- Metrics: Numeric measurements — request rate, error rate, latency (RED method), resource usage (USE method).
- Logs: Structured event records — timestamp, level, message, context (request ID, user ID).
- Traces: Request flow across services — distributed tracing with correlation IDs (OpenTelemetry, Jaeger).
Alerting: Define SLOs (Service Level Objectives) and alert on SLI (Service Level Indicator) breaches — error rate > 1%, p99 latency > 500ms.
Dashboards: Grafana for real-time system health visualisation — combine metrics from Prometheus, CloudWatch, custom sources.

Architecture Case Studies

How major companies solve complex architectural challenges at scale.

Architecture: Microservices on AWS. Hundreds of services communicating via REST and event streams.
CDN: Open Connect — Netflix's custom CDN with edge servers at ISPs for low-latency video delivery.
Resilience: Hystrix (circuit breaker), Chaos Monkey (fault injection), Zuul (API gateway).
Data: Cassandra for availability, EVCache (Memcached) for caching, Kafka for real-time event pipelines.
Key Lesson: Design for failure. Every component assumes its dependencies will fail and has fallback behaviour.

MapReduce: Distributed data processing framework — map (transform) and reduce (aggregate) across thousands of machines.
Bigtable: Distributed wide-column store — billions of rows, millions of columns, petabytes of data.
Spanner: Globally distributed SQL database with strong consistency — uses TrueTime (atomic clocks + GPS) for global ordering.
Borg → Kubernetes: Google's internal container orchestration (Borg) inspired the open-source Kubernetes.
Key Lesson: Build infrastructure abstractions that scale. Invest heavily in internal platforms.

Fan-Out Problem: When a user with millions of followers tweets, delivering it to all followers' timelines efficiently.
Hybrid Approach: Fan-out on write for most users (pre-compute timelines). Fan-out on read for celebrities (compute at read time to avoid massive writes).
Technology: Scala/JVM services, Redis for timeline caching, Kafka for event streaming, Manhattan (distributed key-value store).
Key Lesson: Hybrid strategies often beat pure approaches. Optimise for the common case, handle edge cases differently.

Squad Model: Autonomous cross-functional teams — each owns a set of microservices end-to-end.
Backstage: Internal developer portal (now open-source) for service catalogue, TechDocs, and scaffolding — solving microservices discoverability.
Data Pipelines: Massive event processing for recommendations — Kafka, Google Cloud Dataflow, BigQuery.
Key Lesson: Organisational structure mirrors system architecture (Conway's Law). Invest in developer experience and internal tooling.