Designing Data-Intensive Applications Cheat Sheet

Thursday, August 4, 2022

This is a summary I wrote of each chapter of the excellent Designing Data-Intensive Applications (affiliate link) book. I use this to prepare for remote job interviews in infrastructure and devops. Please enjoy!

Chapter 1

  • Reliability: Systems must work correctly even when faults occur
  • Scalability: Systems have strategies for maintaining performance, even when load increases
  • Response time percentiles can measure performance
  • Maintainability: Makes life better for eng/ops teams who work sith the system
  • Good abstractions reduce complexity and make the system easier to modify and adapt
  • Good operability means visibility into system health and having ways to manage it

Chapter 2

  • Document DBs target use cases where data is self-contained and rarely related
  • Graph DBs target use cases where all data is deeply interlinked
  • Schema can be explicit (enforced on write) or implicit (enforced on read)

Chapter 3

OLTP: Optimized for transaction processing

  • User-facing, large volume of requests
  • Small record count per query
  • Key-indexed lookup, dependent on disk seek time

Log-structured:

  • Append-only to files; delete obsolete files; do not update a written file
  • Random-access writes are turned into sequential writes on disk, enabling higher write throughput on HDD/SSD
  • LevelDB, Cassandra, HBase, Lucene
  • Update-in-place: disk is a set of fixed-size pages that can be overwritten
  • B-trees, used in all major relational DBs

OLAP: Optimized for analytics processing

  • Used by biz analysts, not end users
  • Lower query volume, but they demand millions of records scanned
  • Disk bandwidth is the bottleneck
  • One solution: column-oriented storage
  • Data warehouses: when your queries require sequential scans, indexes matter a lot less
  • It’s more important to encode data compactly to minimize the amount of data that a query must read from disk

Chapter 4

  • Rolling upgrades allow new versions of a service to be released without downtime
  • This promotes frequent small releases over rare big releases
  • This derisks deployments by allowing faulty releases to be rolled back before large user impact
  • This improves evolvability, the ease of making changes to an app
  • During rolling upgrades, different versions of the app are running at the same time
  • Encoding must be backward-compatible (new code reading old data) and forward-compatible (old code reading new data)
  • JSON, XML, CSV have optional schemas and are vague about datatypes (e.g. numbers)
  • Binary schema formats (Thrift, Protobuf, Avro, gRPC) provide compact, efficient encoding, with explicit forward- and backward-compatibility semantics, but are not human-readable

Chapter 5

  • High availability: Keep system running even if 1+ machines goes down
  • Disconnected operation: App keeps working even if network unavailable
  • Latency: Place data closer to users so they can use it faster
  • Scalability: Handle higher volume than any one machine could using read replicas

Replication approaches:

  • Single-leader: Clients send all writes to a leader which streams data change events to followers; reads from followers may be stale.
    Easy to understand, no conflict resolution
  • Multi-leader: Clients send a write to any leader node; leaders stream events to each other and any followers
  • Leaderless replication: Clients send each write to several nodes and read from several nodes in parallel to detect stale data.
    Can be more robust to faulty nodes, network outage, latency, but harder to reason about, weak consistency guarantees

Replication lag causes issues:

  • Read-after-write: Users must always see data they submitted
  • Monotonic reads: Users must always see data in chronological order (not see earlier point-in-time data)
  • Consistent prefix reads: Users must always see data in a state that makes causal sense: a question is followed by its reply
  • In multi-leader and leaderless schemes, conflicts may occur and must be resolved

Chapter 6

  • Partitioning is necessary when data cannot fit onto a single machine
  • Goal is to spread data and query load evenly among multiple machines, avoiding hot spots
  • Must choose a partition scheme that fits data, and rebalance partitions when nodes are added or removed

Approaches:

  • Key range partitioning: e.g. node owns keys from A through F
  • May cause hot spots if application frequently accesses keys that are close together in the sorted order
  • Hash partitioning: keys are assigned to a node corresponding to their hashed value
  • Distributes load evenly but destroys ordering of keys, so range queries are inefficient
  • Common approach: Create a fixed number of partitions in advance, assign several to a node, and move entire partitions when a node is added/removed

Secondary indices must also be partitioned:

  • Document-partitioned (local): secondary indices are stored in same partition as primary K/V
  • Only update a single partition on write; read of secondary index requires scatter/gather across all partitions
  • Term-partitioned (global): secondary indices are partitioned separately using indexed values
  • Entry in secondary index may include records from all partitions of the primary key
  • Write updates several partitions; reads from a single partition

Chapter 7

  • Tranasactions allow an app to pretend that some concurrency problems and SW/HW faults don’t exist – lots of errors become “transaction aborts”
  • Txns hugely reduce the number of potential error cases you need to worry about
  • Without txns, hardware errors (power outage, disk crash) cause various data inconsistencies – hard to reason about effects of concurrent access

Race conditions include:

  • Dirty reads: Client sees another client’s writes before they are committed. Solvable with read-committed isolation level.
  • Dirty writes: Client overwrites another client’s data that has been written but not committed. Solvable with snapshot isolation.
  • Read skew: Client sees different parts of the DB at different points in time
  • Lost updates: Two clients perform a concurrent read-modify-write (e.g. bank balance problem). Solvable with snapshot isolation.
  • Write skew: Txn reads something, makes decision, writes decision – by the time the write is made, the premise is no longer true. Solvable with serializable isolation.
  • Phantom reads: Txn reads objects matching a search condition; someone else writes data that modifies those search results. Write skew issues may require index-range locks.

Approaches to implementing serializable transactions:

  • Literally executing transactions in serial order on a single CPU core
  • Two-phase locking: standard approach; may have poor performance
  • Serializable snapshot isolation: optimistically allow txns to proceed without blocking; commits are aborted if not serializable

Chapter 8

Some kinds of partial failures in distributed systems:

  • Network: Packets may be lost or artificially delayed
  • Time: Node clock may jump forward or backward and be out of sync with other nodes
  • Pauses: GC may pause a process; other nodes declare it dead; it resumes and is unaware it was paused
  • Any software that interacts with other nodes may fail, go slow, or time out

Detecting faults:

  • Most systems con’t know if a node has failed
  • Most distributed algorithms rely on timeouts to detect node failure
  • But timeouts might be network issues, not node failures
  • A limping node might cause more issues than a dead one
  • Once a fault is detected, info must flow over unreliable network between nodes – we rely on quorum protocols to make decisions
  • Distributed systems enable scalability, fault tolerance, and low latency

Chapter 9

  • Linearizability makes a database behave like an atomic variable, but is slow, especially across high-latency networks
  • Causality imposes an ordering on events, based on cause and effect
  • Weaker consistency model
  • Some things can be concurrent – branching and merging
  • Less sensitive to network issues than linearizability; lacks the coordination overhead of linearizability

Consensus

  • Solves atomic problems in causal models, e.g. signing up for a username requires that username to not already be taken
  • All nodes must agree on what was decided, irrevocably

Consensus decision problems include:

  • Linearizable compare-and-set registers: set a register based on a parameter
  • Atomic transaction commit: commit or abort
  • Total order broadcast: decide on order to deliver messages
  • Locks and leases: only one client can grab a lock
  • Membership/coordination: decide which nodes are alive and dead
  • Uniqueness constraint: which txn is allowed and failed due to constraint violation

Single-leader failure resolutions:

  • Wait for leader to recover
  • Human does manual failover
  • Algorithm chooses new leader
  • Even if a leader can be chosen algorithmically, we still need consensus to select the new leader
  • Leaderless and multi-leader replication systems don’t use global consensus

I'd love to hear what you think about this post. Email me!