This is a summary I wrote of each chapter of the excellent Designing Data-Intensive Applications (affiliate link) book. I use this to prepare for remote job interviews in infrastructure and devops. Please enjoy!

Chapter 1

Reliability: Systems must work correctly even when faults occur
Scalability: Systems have strategies for maintaining performance, even when load increases
Response time percentiles can measure performance
Maintainability: Makes life better for eng/ops teams who work sith the system
Good abstractions reduce complexity and make the system easier to modify and adapt
Good operability means visibility into system health and having ways to manage it

Chapter 2

Document DBs target use cases where data is self-contained and rarely related
Graph DBs target use cases where all data is deeply interlinked
Schema can be explicit (enforced on write) or implicit (enforced on read)

Chapter 3

OLTP: Optimized for transaction processing

User-facing, large volume of requests
Small record count per query
Key-indexed lookup, dependent on disk seek time

Log-structured:

Append-only to files; delete obsolete files; do not update a written file
Random-access writes are turned into sequential writes on disk, enabling higher write throughput on HDD/SSD
LevelDB, Cassandra, HBase, Lucene
Update-in-place: disk is a set of fixed-size pages that can be overwritten
B-trees, used in all major relational DBs

OLAP: Optimized for analytics processing

Used by biz analysts, not end users
Lower query volume, but they demand millions of records scanned
Disk bandwidth is the bottleneck
One solution: column-oriented storage
Data warehouses: when your queries require sequential scans, indexes matter a lot less
It’s more important to encode data compactly to minimize the amount of data that a query must read from disk

Chapter 4

Rolling upgrades allow new versions of a service to be released without downtime
This promotes frequent small releases over rare big releases
This derisks deployments by allowing faulty releases to be rolled back before large user impact
This improves evolvability, the ease of making changes to an app
During rolling upgrades, different versions of the app are running at the same time
Encoding must be backward-compatible (new code reading old data) and forward-compatible (old code reading new data)
JSON, XML, CSV have optional schemas and are vague about datatypes (e.g. numbers)
Binary schema formats (Thrift, Protobuf, Avro, gRPC) provide compact, efficient encoding, with explicit forward- and backward-compatibility semantics, but are not human-readable

Chapter 5

High availability: Keep system running even if 1+ machines goes down
Disconnected operation: App keeps working even if network unavailable
Latency: Place data closer to users so they can use it faster
Scalability: Handle higher volume than any one machine could using read replicas

Replication approaches:

Single-leader: Clients send all writes to a leader which streams data change events to followers; reads from followers may be stale.
Easy to understand, no conflict resolution
Multi-leader: Clients send a write to any leader node; leaders stream events to each other and any followers
Leaderless replication: Clients send each write to several nodes and read from several nodes in parallel to detect stale data.
Can be more robust to faulty nodes, network outage, latency, but harder to reason about, weak consistency guarantees

Replication lag causes issues:

Read-after-write: Users must always see data they submitted
Monotonic reads: Users must always see data in chronological order (not see earlier point-in-time data)
Consistent prefix reads: Users must always see data in a state that makes causal sense: a question is followed by its reply
In multi-leader and leaderless schemes, conflicts may occur and must be resolved

Chapter 6

Partitioning is necessary when data cannot fit onto a single machine
Goal is to spread data and query load evenly among multiple machines, avoiding hot spots
Must choose a partition scheme that fits data, and rebalance partitions when nodes are added or removed

Approaches:

Key range partitioning: e.g. node owns keys from A through F
May cause hot spots if application frequently accesses keys that are close together in the sorted order
Hash partitioning: keys are assigned to a node corresponding to their hashed value
Distributes load evenly but destroys ordering of keys, so range queries are inefficient
Common approach: Create a fixed number of partitions in advance, assign several to a node, and move entire partitions when a node is added/removed

Secondary indices must also be partitioned:

Document-partitioned (local): secondary indices are stored in same partition as primary K/V
Only update a single partition on write; read of secondary index requires scatter/gather across all partitions
Term-partitioned (global): secondary indices are partitioned separately using indexed values
Entry in secondary index may include records from all partitions of the primary key
Write updates several partitions; reads from a single partition

Chapter 7

Tranasactions allow an app to pretend that some concurrency problems and SW/HW faults don’t exist – lots of errors become “transaction aborts”
Txns hugely reduce the number of potential error cases you need to worry about
Without txns, hardware errors (power outage, disk crash) cause various data inconsistencies – hard to reason about effects of concurrent access

Race conditions include:

Dirty reads: Client sees another client’s writes before they are committed. Solvable with read-committed isolation level.
Dirty writes: Client overwrites another client’s data that has been written but not committed. Solvable with snapshot isolation.
Read skew: Client sees different parts of the DB at different points in time
Lost updates: Two clients perform a concurrent read-modify-write (e.g. bank balance problem). Solvable with snapshot isolation.
Write skew: Txn reads something, makes decision, writes decision – by the time the write is made, the premise is no longer true. Solvable with serializable isolation.
Phantom reads: Txn reads objects matching a search condition; someone else writes data that modifies those search results. Write skew issues may require index-range locks.

Approaches to implementing serializable transactions:

Literally executing transactions in serial order on a single CPU core
Two-phase locking: standard approach; may have poor performance
Serializable snapshot isolation: optimistically allow txns to proceed without blocking; commits are aborted if not serializable

Chapter 8

Some kinds of partial failures in distributed systems:

Network: Packets may be lost or artificially delayed
Time: Node clock may jump forward or backward and be out of sync with other nodes
Pauses: GC may pause a process; other nodes declare it dead; it resumes and is unaware it was paused
Any software that interacts with other nodes may fail, go slow, or time out

Detecting faults:

Most systems con’t know if a node has failed
Most distributed algorithms rely on timeouts to detect node failure
But timeouts might be network issues, not node failures
A limping node might cause more issues than a dead one
Once a fault is detected, info must flow over unreliable network between nodes – we rely on quorum protocols to make decisions
Distributed systems enable scalability, fault tolerance, and low latency

Chapter 9

Linearizability makes a database behave like an atomic variable, but is slow, especially across high-latency networks
Causality imposes an ordering on events, based on cause and effect
Weaker consistency model
Some things can be concurrent – branching and merging
Less sensitive to network issues than linearizability; lacks the coordination overhead of linearizability

Consensus

Solves atomic problems in causal models, e.g. signing up for a username requires that username to not already be taken
All nodes must agree on what was decided, irrevocably

Consensus decision problems include:

Linearizable compare-and-set registers: set a register based on a parameter
Atomic transaction commit: commit or abort
Total order broadcast: decide on order to deliver messages
Locks and leases: only one client can grab a lock
Membership/coordination: decide which nodes are alive and dead
Uniqueness constraint: which txn is allowed and failed due to constraint violation

Single-leader failure resolutions:

Wait for leader to recover
Human does manual failover
Algorithm chooses new leader
Even if a leader can be chosen algorithmically, we still need consensus to select the new leader
Leaderless and multi-leader replication systems don’t use global consensus

Designing Data-Intensive Applications Cheat Sheet