Technical Notes on Hemant Sethi

High Performance IO For Large Scale Deep Learning

Sun, 15 Mar 2026 00:00:00 +0000

Paper: High Performance IO For Large Scale Deep Learning

High Performance I/O For Large Scale Deep Learning

Ideas Explored(TLDR)

WebDataset(Large Sharded Datasets instead of smaller Random Reads)
AIStore(S3 compatible Object store w/ Caching) instead of Distributed File Systems like GFS/HDFS.

Background

Deep learning training needs Petascale datasets.
Existing Distributed File systems not suited for access patterns of DL jobs.
DL Workloads: Repeated random access of training datasets(not High throughput sequential IO).
DL datasets are transformed to shard collections from the original dataset to change access patterns from random reads to sequential IO.
DL Model Training Steps
Traditional Big Data ML Storage solutions
Requirements for Large scale Deep Learning Storage Solutions

AI Store

Provide infinitely scalable namespace over arbitrary numbers of Disks(SSDs & HDDs).
Conventional Metadata servers related bottlenecks eliminated by having data flowing directly b/w Compute Clients and clustered storage Targets.
Lightweight, scale-out object store w/ S3 semantics. Also Integrates natively w/ S3 storages.
High performance via HTTP Redirects, client receives objects via direct connection to the storage server holding the object.
Serve as Large scale caching tier(Performance tier b/w Cold data & DL jobs) w/ Scale-out & no Limitations.
Open Source, written in GO, runs on commodity hardware(K8s, Linux nodes or clusters).
Other features

WebDataSet

Defined Storage convention for existing File-based Datasets on existing Formats for adoption of Sharded Sequential Storage.
Storage Format: WebDataset datasets are represented as standard POSIX tar files in which all the files that comprise a training sample are stored adjacent to each other in the tar archive.
Library : WebDataSet Library is a drop-in replacement for Python DataSet API to access Record-Sequential(Sharded Key-Value) storage w/ minimal client changes.
Server: Can read from any Input stream LocalFiles, Web Servers, Cloud Storage servers.

Small File Problem

Large Scale DL Datasets have Billions of small files.
Solution 1: Archival tools transform a small-file dataset into a dataset of larger shards.
Solution 2: Scalable storage access
AIStore gateways (aka AIS proxies)

Shards and Records

DL Datasets = Billions/Trillions of small samples leading to small file problems.
4KB random read throughput = SSD = 200 MB/s - 900 MB/s = HDD Sequential read performance.
Performance at scale for DL workloads require optimized read access patterns and not 4KB random reads.
AIS supports automated user-defined pre-sharding and offloading to storage clusters.
Best Toolchain, IO Friendly Sharding Format?

Core Ideas To support new DL Pipelines

Large Sharded Reads.
Highly scalable storage access protocol
Independently scalable pipeline stages, I/O, Decoding, Augmentation, Deep Learning.
Easy to assign any component(Compute or Storage) to K8s Node.

Benchmarking

Lack of established/standardized deep learning benchmarks at the time that run DDL(Distributed Deep Learning) while Isolating Storage IO contribution to DDL performance.
Didn’t want to go w/ Artificial “DDL-Like” synthetic workloads.
Decided to benchmark end-to-end w/ Training and Inference particular DL framework w/ a fixed DL model.
Metric of interest: how quickly the training/evaluation loop iterates and consumes data from the DataLoader pipeline.
Hardware

Result:

Storage Format Change(WebDataSet) even on HDD gets the performance as close to Non-Sharded(DataSet) w/ SSD.
AI Store’s performance w/ WebDataSet is better than Standard HDFS based WebDataset.
AI Store scales better compared to HDFS especially for Large number of Small files use cases as well as supporting a large number of Clients(GPU Nodes) as it avoids centralized NameNode like In HDFS.
AIStore delivers 18GB/s aggregated throughput, or 150MB/s per each of the 120 hard drives – effectively, a hardware-imposed limit. HDFS, on the other hand, falls below AIS, with the gap widening as the number of DataLoader workers grows.

Paper Link: https://arxiv.org/pdf/2001.01858

Last updated: March 15, 2026

Questions or discussion? Email me

Characterizing Deep-Learning IO Workloads in TensorFlow

Sat, 14 Mar 2026 00:00:00 +0000

Paper: Characterizing Deep-Learning IO Workloads in TensorFlow

Characterizing Deep-Learning I/O Workloads in TensorFlow(2018)

Paper Link: https://arxiv.org/abs/1810.03035

Ideas Explored (TLDR)

Three levers to fix TensorFlow I/O

Background

DL I/O != Traditional HPC I/O
HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes.
DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing

TensorFlow Data Pipeline

Dataset API - supports POSIX, HDFS, GCS, S3
Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes
Embarrassingly parallel - each file used by one worker, no collective I/O needed
tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms
tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples)
dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold
prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory.

Checkpointing

tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights)
No guaranteed flush to disk, no async checkpoint support (at the time)
Each snapshot ~600MB - bursty writes stall training
Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately

Benchmarks & Setup

Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10
Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10
Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s
Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset)
AlexNet mini-app: Caltech-101, median 12KB images

Results

Threading

1→2 threads doubles bandwidth
HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads
Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets)
Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself

Prefetching

Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time
With prefetch: runtime same regardless of storage type or thread count
Without prefetch: bursty read pattern, GPU idles waiting for data

Checkpointing

~15% of total execution time w/o optimization
Lustre fastest, HDD slowest
Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint
Background copy to HDD completes while training continues

Key Takeaways

DL I/O bottleneck is reads not writes - opposite of traditional HPC
Threading helps but is limited; Prefetching is the highest-leverage knob.
Burst buffer essential for checkpointing at scale - NVMe as staging tier
Prefetch hierarchy likely needed at scale: stage training data in burst buffer too.

Paper Link: https://arxiv.org/pdf/1810.03035

Megastore

Sun, 15 Dec 2024 00:00:00 +0000

Paper: Megastore

Megastore

Google 2011
No SQL Database with strong consistency*(Within 1 partition).
Replicas spread across data centers for fault tolerance.
Built on top of BigTable(+ GFS), Paxos(Distributed Consensus) is used for strong consistency.
We have seen multiple Paxos implementations before, Chubby, Single Sign On(SSO).
Motivations behind building Megastore

Single Leader Paxos

Pros

PiggyBacking:Prepare phase of write n+1 can be piggybacked on the commit of write n.
Local reads can be served from the master.

Cons

Follower replicas are just wasting resources.
Master failover takes a while and we need to wait for master lease timeout.
Servers that are not close to the master but are close to the end user still have to go through the masters.

Megastore proposed Improvements

Writes can be proposed by ANY replica.
Reads can be initiated by any replica
No more need to handle Master Failover.

Entity Groups(Partitions)

Paxos maintains a distributed log across computers. We use it to create a database write ahead log.
If we use one log for the entire database, every write would compete to be the next spot in the log.
Partition the log by Entity Group.

Entity Group Example

Megastore allows you to do a ton of data denormalization because BigTable provides a flexible schema.

Helps keep the related data co-located(contiguous) on 1 single computer , so that you don’t need to perform any distributed Joins(across partitions) or 2PC.

Cross Entity Group Semantics

Megastore allows doing Cross Entity Group(Partitions) writes.

Two Phase commit: Provides ACID guarantees but slow and not recommended.
Asynchronous via Message Queue: Preferred but not serializable.

Replication Overview

Recall: Prepare phase is used to determine which replica gets to write the current log entry.
Writes:

Write Leader

Since the leader replica of the next log cell is already determined during the previous commit requests, when the actual next write comes for the log cell n+1 to leader replica B, it is already known to replica A and C to only accept writes from B for that Log Cell.
This way, we don;t need to do a 2PC for establishing which is the next log item to write via distributed consensus.
Writers tend to make many writes at a time, so the next leader replica is chosen to be the one closest to the previous writer.
What if the Leader Replica goes down before the write for that phase goes through?

Write Leader Failure

What if B goes down before it actually makes a commit?
We use the concept of a Proposal number(Generation Number/Epoch Number).
Node A/C would instantaneously accept write if they would see write from Node B for Proposal 0.
However, if Node B goes down, Proposal 0 would never come.
Coordinator(down below in picture), sees that Node B is down(Failure detection), and decides to ask A, let’s start Proposal 1 for this log cell.
A can then reach out to C with Proposal 1.
C sees that this proposal 1 is higher than proposal 0, so it can accept this thing.
Now A needs to do a fresh 2PC to C.

Invalidating Coordinators

If a write at a replica doesn’t go through, that replica cannot serve the local reads.
Every replica has a process in its local datacenter, called the coordinator.
Coordinator keeps track of all entity groups for which this replica is up to date.
If a write fails on a replica, we must alert its coordinator to proceed.
One of each replica or coordinator must respond to unblock write.

Coordinator Failure

What happens if a Coordinator and its replica goes down(Data Center Failure)?
Solution:
Coordinator grabs chubby locks in multiple other data centers.
If the coordinator loses connection(locks) to the majority of data centers, then it knows it is probably partitioned from the rest of the replicas.
Writes can then proceed w/o the replica since the other failed coordinator/replica are aware that they can’t serve local reads.
Not perfect. Edge case.

Reading Data

The responsibility of the coordinator is to answer, can my local replica be read from for that particular entity group, or do you have to read from another replica or do a quorum read?
If the coordinator says, Replica is up to date, read locally from there. Otherwise
Figure out the last known log entry by doing quorum read.
Pick a replica(either most up to date, or most responsive).
If the selected replica is behind, we are going to read from another replica to catch up its log, and then tell its coordinator that it is valid now(after catch up).
Perform local reads on that replica.

Catching up Stale Replica

Production Experience

Paper Link: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Spanner

Sat, 14 Dec 2024 00:00:00 +0000

Paper: Spanner

Spanner: Google’s Globally-Distributed Database

Abstract

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database which supports externally-consistent(Linearizable) distributed transactions.
Paper describes how Spanner is Structured, feature set, rationale behind various design decisions, and a Novel Time API that exposes clock certainty.

Introduction

Spanner shards data across many sets of Paxos state machines in DCs spread across the world.
Replication for global availability and geographic locality, clients automatically failover between replicas.
Automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures.
Designed to scale up to millions of machines across hundreds of data centers and trillions of database rows.
Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents.
BigTable problems
Megastore supports semi-relational data model and synchronous replication, despite its relatively poor write throughput.
Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database.
Globally distributed features:
TrueTime API and its implementation(Key enabler of the above properties)

Spanner Implementation

Directory abstraction(unit of data movement) to manage replication and locality.
Data model. Spanner looks like a relational database instead of a key-value store.
Applications can control data locality.
A Spanner deployment is called a universe.
Spanner is organized as a set of zones.
A zone has **one zonemaster(**assigns data to spanservers) and between one hundred and several thousand spanservers(serve data to clients).
Per-zone location proxies are used by clients to locate the spanservers assigned to serve their data.
Universe master(Singleton) is primarily a console that displays status information about all the zones for interactive debugging
Placement driver(Singleton) handles automated movement of data across zones on the timescale of minutes.

SpanServer Software Stack

Spanserver implementation to illustrate how replication and distributed transactions have been layered onto our Bigtable-based implementation.
Each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet.
A tablet is similar to Bigtable’s tablet abstraction, in that it implements a bag of the following mappings
Unlike Bigtable, Spanner assigns timestamps to data which is why
A Spanner’s tablet’s state is stored in a set of B-tree-like files and a write-ahead log, all on a distributed file system called Colossus (the successor to the Google File System.
To support replication, each spanserver implements a single Paxos state machine on top of each tablet.
Each state machine stores its metadata and log in its corresponding tablet
Paxos implementation supports long-lived leaders with time-based leases(D: 10s).
Current Spanner implementation logs every Paxos write twice:tablet’s & Paxos log.
Implementation of Paxos is pipelined, so as to improve Spanner’s throughput in the presence of WAN latencies; but writes are applied by Paxos in order.
The Paxos state machines are used to implement a consistently replicated bag of mappings.
Writes must initiate the Paxos protocol at the leader;
Reads access state directly from the underlying tablet(sufficiently up-to-date).
Set of replicas is collectively a Paxos group.
At leader replica, each spanserver implements a lock table for concurrency control.
Bigtable and Spanner are designed for long-lived transactions(e.g. for report generation, which might take on the order of minutes) which perform poorly under optimistic concurrency control in the presence of conflicts.(What?)
Operations that require synchronization, such as transactional reads, acquire locks in the lock table; other operations bypass the lock table.
Each spanserver(at leader replica) implements a transaction manager to support distributed transactions.
If a transaction involves only one Paxos group (as is the case for most transactions), it can bypass the transaction manager, since the lock table and Paxos together provide transactionality.
If a transaction involves more than one Paxos group, those groups’ leaders coordinate to perform a two-phase commit.
The state of each transaction manager is stored in the underlying Paxos group (and therefore is replicated).

Directories and Placement

On top of the bag of key-value mappings, the Spanner implementation supports a bucketing abstraction(Directory), which is a set of contiguous keys that share a common prefix.
A directory is the unit of data placement.
The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from a Bigtable tablet. Former is not necessarily a single lexicographically contiguous partition of the row space.
Movedir is the background task used to move directories between Paxos groups.
Application specifies a directory’s geographic-replication placement.
The design of placement-specification language separates responsibilities for managing replication configurations.
An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options.
Spanner will Shard/Partition a directory into multiple fragments if it grows too large.

Data Model

Spanner offers a
DataModel Use Case:
This interleaving of tables to form directories is significant because it allows clients to describe the locality relationships that exist between multiple tables, which is necessary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships.

TrueTime

TrueTime explicitly represents time as a TTinterval, which is an interval with bounded time uncertainty(unlike standard time interfaces that give clients no notion of uncertainty).
The endpoints of a TTinterval are of type TTstamp.
The time epoch is analogous to UNIX time with leap-second smearing.
The underlying time references used by TrueTime are GPS and atomic clocks because they have different failure modes.
TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine.
All masters’ time references are regularly compared against each other.
Every daemon polls a variety of masters to reduce vulnerability to errors from any one master.
Uncertainty Range 1-7 ms with 4 ms most of the time at a Daemon poll interval of 30 sec and current drift applied rate is 200 microseconds/second.

Concurrency Control

TrueTime API is used to guarantee correctness properties around concurrency control, and how those properties are used to implement features such as externally consistent transactions, lock-free read-only transactions, and non-blocking reads in the past.
Important to distinguish writes as seen by Paxos Writes vs Spanner client writes.

Timestamp Management

Spanner supports:

Paxos Leader Leases

Spanner’s Paxos implementation uses timed leases to make leadership long-lived (10 seconds by default).
Leader Election

Assigning Timestamps to RW Transactions

Transactional reads and writes use two-phase locking(Reads & Writes block each other).
As a result, they can be assigned timestamps at any time when all locks have been acquired, but before any locks have been released.
For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos write that represents the transaction commit.
Spanner depends on the following monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders.
Spanner also enforces the following external-consistency invariant: if the start of a transaction T2 occurs after commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.

Serving Reads at a Timestamp

Assigning Timestamps to RO Transactions

Details

Read Write Transactions

Like Bigtable, writes that occur in a transaction are buffered at the client until commit.
As a result, reads in a transaction do not see the effects of the transaction’s writes. This design works well in Spanner because a read returns the timestamps of any data read, and uncommitted writes have not yet been assigned timestamps.
Reads within read-write transactions use wound-wait to avoid deadlocks.

Jordan: Google Spanner(2013)

Strongly consistent SQL Database via Paxos.
Supports causally consistent Non-Blocking read-only Snapshots over multiple nodes at a time even though they are distributed. This is not something you could do in your traditional database. Causal Consistency
Write B is causally dependent on Write A if
Can be achieved by using Lamport Clocks.
Spanner is both externally and causally consistent.
The order of writes to the database is the order in which the events actually happened.
Formally:

Spanner Details

Spanner’s Design looks similar to Megastore to ensure strong consistency.

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

DynamoDB

Wed, 11 Dec 2024 00:00:00 +0000

Paper: DynamoDB

DynamoDB

Summary/Abstract

Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale.
Fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience.
In 2021, during the 66-hour Amazon Prime Day shopping event, 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance.
Design and implementation of DynamoDB have evolved since the first launch in 2012. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance.

Introduction

The goal of the design of DynamoDB is to complete all requests with low single-digit millisecond latencies.
DynamoDB uniquely integrates the following six fundamental system properties:
DynamoDB is a fully managed cloud service.
DynamoDB employs a multi-tenant architecture.
DynamoDB provides predictable performance
DynamoDB is highly available.
DynamoDB supports flexible use cases.
DynamoDB evolved as a distributed database service to meet the needs of its customers without losing its key aspect of providing a single-tenant experience to every customer using a multi-tenant architecture.
The paper explains the challenges faced by the system and how the service evolved to handle those challenges while connecting the required changes to a common theme of durability, availability, scalability, and predictable performance.

History

Design of DynamoDB was motivated by our experiences with its predecessor Dynamo. Dynamo was created in response to the need for a highly scalable, available, and durable key-value database for shopping cart data
Amazon learned that providing applications with direct access to traditional enterprise database instances led to scal- ing bottlenecks such as connection management, interference between concurrent workloads, and operational problems with tasks such as schema upgrades.
Service Oriented Architecture was adopted to encapsulate an application’s data behind service-level APIs that allowed sufficient decoupling to address tasks like reconfiguration without having to disrupt clients.
DynamoDB took the principles from Dynamo(which was being run as Self-hosted DB but created operational burden for developers) & Simple DB, a fully managed elastic NoSQL database service, but the data model couldn’t scale to the demands of the large Tables which DDB needed.
Dynamo Limitations:
SimpleDB limitations:
Amazon concluded that a better solution would combine the best parts of the original Dynamo design (incremental scalability and predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store)

Architecture

A DynamoDB table is a collection of items.
Each item is a collection of attributes. Uniquely identified by a primary key.
Schema of the primary key is specified at the table creation time.
The partition key’s value is always used as an input to an internal hash function.
The output from the hash function and the sort key value (if present) determines where the item will be stored.
Multiple items can have the same partition key value in a table with a composite primary key. However, those items must have different sort key values.
Supports secondary indexes to provide enhanced querying capability, which allows querying the data in the table using an alternate key.
DynamoDB provides a simple interface to store or retrieve items from a table or an index.
DynamoDB supports ACID transactions for multi-item updates w/o affecting scalability/availability/performance.
A DynamoDB table is divided into multiple partitions.
Each partition of the table hosts a disjoint and contiguous part of the table’s key-range and has multiple replicas(replication Group) distributed across different Availability Zones for high availability and durability.
The Replication Group uses Multi-Paxos for leader election and consensus.
Any replica can trigger a round of the election.
Once elected leader, a replica can maintain leadership as long as it periodically renews its leadership lease.
Only the leader replica can serve write and strongly consistent read requests.
Leader generates a write-ahead log record and sends it to its peers.
Write is acknowledged to the application once a quorum of peers persists the log record to their local write-ahead logs.
DynamoDB supports strong(Leader Read) and eventually consistent(Replica read) reads.
The leader of the group extends its leadership using a lease mechanism.
If the leader of the group is detected as a failure (considered unhealthy or unavailable) by any of its peers, the peer can propose a new round of the election to elect itself as the new leader. The new leader won’t serve any writes or consistent reads until the previous leader’s lease expires.
Partitioning/Replication Group
Log Replica/Node - Write Ahead Log(replicated) for High Availability and Durability.
Multi-Paxos Leader Election and Consensus.
Writes and Strongly/Eventually Consistent Reads
Microservice architecture
Metadata Service
Request Router Service
Auto-Admin Service(Central Nervous System of DDB)
Storage Service
Features supported by other Services

Journey from Provisioned to On-Demand

DDB was launched with Partitions as an internal abstraction, as a way to dynamically scale both the capacity and performance of tables.
Customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). RCUs and WCUs collectively are called provisioned throughput.
As the demands from a table changed (because it grew in size or because the load increased), partitions could be further split and migrated to allow the table to scale elastically. Partition abstraction proved to be really valuable and continues to be central to the design of DynamoDB.
[Challenge] This early version tightly coupled the assignment of both capacity and performance to individual partitions, which led to challenges
DynamoDB uses admission control to ensure that storage nodes don’t become overloaded, to avoid interference between co-resident table partitions, and to enforce the throughput limits requested by customers.
Admission control was the shared responsibility of all storage nodes for a table. Storage nodes independently performed admission control based on the allocations of their locally stored partitions.
Allocated throughput of each partition was used to isolate the workloads. DynamoDB enforced a cap on the maximum throughput that could be allocated to a single partition. Total throughput of all the partitions hosted by a storage node is less than or equal to the maximum allowed throughput on the node as determined by the physical characteristics of its storage drives.
The throughput allocated to partitions was adjusted when the overall table’s throughput was changed or its partitions were split into child partitions.
When a partition was split for size, the allocated throughput of the parent partition was equally divided among the child partitions and was allocated based on the table’s provisioned throughput.
E.g. Assume that a partition can accommodate a maximum provisioned throughput of 1000 WCUs. When a table is created with 3200 WCUs, DynamoDB created four partitions that each would be allocated 800 WCUs. If the table’s provisioned throughput was increased to 3600 WCUs, then each partition’s capacity would increase to 900 WCUs. If the table’s provisioned throughput was increased to 6000 WCUs, then the partitions would be split to create eight child partitions, and each partition would be allocated 750 WCUs. If the table’s capacity was decreased to 5000 WCUs, then each partition’s capacity would be decreased to 675 WCUs
The uniform distribution of throughput across partitions is based on the assumptions that an application uniformly accesses keys in a table and the splitting a partition for size equally splits the performance.
However, it was discovered that application workloads frequently have non-uniform access patterns both over time and over key ranges.
Hot Partition Worsening with Split: When the request rate within a table is non-uniform, splitting a partition and dividing performance allocation proportionately can result in the hot portion of the partition having less available performance than it did before the split.
[Single Hot Partition] Since throughput was allocated statically and enforced at a partition level, these non-uniform workloads occasionally resulted in an application’s reads and writes being rejected, called throttling, even though the total provisioned throughput of the table was sufficient to meet its needs. Common Challenges faced by the applications were:
Hot Partition
Throughput Dilution.
Customers would increase the provisioned throughput of the table(even if they were under the limit overall), which caused poor performance. It was difficult to estimate the correct provisioned throughput.
Hot partitions and throughput dilution stemmed from tightly coupling a rigid performance allocation to each partition, and dividing that allocation as partitions split. Bursting and Adaptive Capacity to address these concerns.

Improvements to Admission Control:

Key Observations:

Partitions had non-uniform access/traffic.
Not all partitions hosted by a storage node used their allocated throughput simultaneously.

Bursting

The idea behind Bursting was to let applications tap into the unused capacity at a partition level on a best effort basis to absorb short-lived spikes.
DynamoDB retained a portion of a partition’s unused capacity for later bursts of throughput usage for up to 300 seconds and utilized it when consumed capacity exceeded the provisioned capacity of the partition.
DynamoDB still maintained workload isolation by ensuring that a partition could only burst if there was unused throughput at the node level. The capacity was managed on the storage node using multiple token buckets to provide admission control:
**[Partition Token + Node Token]**When a read or write request arrives on a storage node, if there were tokens in the partition’s allocated token bucket, then the request was admitted and tokens were deducted from the partition and node level bucket.
[Burst Token + Node Token] Once a partition had exhausted all the provisioned tokens, requests were allowed to burst only when tokens were available both in the burst token bucket and the node level token bucket.
Read requests were accepted based on the local token buckets.
[Replica node’s Token Bucket for Write] Write requests using burst capacity require an additional check on the node-level token bucket of other member replicas of the partition.
The leader replica of the partition periodically collected information about each of the members node-level capacity.

Adaptive Capacity

DynamoDB launched adaptive capacity to better absorb long-lived spikes that cannot be absorbed by the burst capacity.
Better absorb work-loads that had heavily skewed access patterns across partitions.
Adaptive capacity actively monitored the provisioned and consumed capacity of all the tables.
If a table experienced throttling and the table level throughput was not exceeded, then it would automatically increase (boost) the allocated throughput of the partitions of the table using a proportional control algorithm.
The autoadmin system ensured that partitions receiving boost were relocated to an appropriate node that had the capacity to serve the increased throughput, however like bursting, adaptive capacity was also best-effort but eliminated over 99.99% of the throttling due to skewed access pattern.

Global Admission Control

Even though Bursting and Adaptive Capacity significantly reduced throughput problems for non-uniform access, they had limitations.
Takeaway from bursting and adaptive capacity was that we had tightly coupled partition level capacity to admission control.
Admission control was distributed and performed at a partition level.
DynamoDB realized it would be beneficial to remove admission control from the partition and let the partition always burst while providing workload isolation.
DynamoDB replaced adaptive capacity with global admission control (GAC).
GAC builds on the same idea of Token Bucket.
The GAC service centrally tracks the total consumption of the table capacity in terms of tokens.
Each request router maintains a local token bucket to make admission decisions and communicates with GAC to replenish tokens at regular intervals (in the order of a few seconds).
[Important Design Consideration] Each GAC server can be stopped and restarted without any impact on the overall operation of the service.
Each GAC server can track one or more token buckets configured independently.
All the GAC servers are part of an independent hash ring.
Request routers manage several time-limited tokens locally. When a request from the application arrives, the request router deducts tokens. Eventually, the request router will run out of tokens because of consumption or expiry. When the request router runs off of tokens, it requests more tokens from GAC.
The GAC instance uses the information provided by the client to estimate the global token consumption and vends tokens available for the next time unit to the client’s share of overall tokens.
Thus, it ensures that non-uniform workloads that send traffic to only a subset of items can execute up to the maximum partition capacity.
In addition to the global admission control scheme, the partition-level token buckets were retained for defense in-depth. The capacity of these token buckets is then capped to ensure that one application doesn’t consume all or a significant share of the resources on the storage nodes.

Balancing Consumed Capacity

Letting partitions burst(always) required DynamoDB to manage burst capacity effectively.
Colocation was a straightforward problem with provisioned throughput tables because of static partitions.

Splitting for Consumption

[Problem] Even with GAC and the ability for partitions to always burst, tables could experience throttling if their traffic was skewed to a specific set of items.
[Solution]
DynamoDB automatically scales out partitions once the consumed throughput of a partition crosses a certain threshold.
The split point in the key range is chosen based on key distribution the partition has observed.
The observed key distribution serves as a proxy for the application’s access pattern and is more effective than splitting the key range in the middle.
Partition splits usually complete in the order of minutes.
[Catch] Still class of workloads exist that cannot benefit from split for consumption. E.g. a partition receiving high traffic to a single item or a partition where the key range is accessed sequentially will not benefit from split. DDB avoids splitting the partition.

On Demand Provisioning

[Context]
Initially, applications migrated to DDB, were on self provisioned servers either on-prem or on self-hosted databases.
DynamoDB provides a simplified serverless operational model and a new model for provisioning - read and write capacity units.
[Problem]
The concept of capacity units was new to customers, some found it challenging to forecast the provisioned throughput.
Customers either over provisioned(Low utilization) or under provisioned(Throttling).
[Solution] To improve the customer experience for spiky workloads, DDB launched On-Demand Tables.
DynamoDB provisions the on-demand tables based on the consumed capacity by collecting the signal of reads and writes and instantly accommodates up to double the previous peak traffic on the table.
On-demand scales a table by splitting partitions for consumption. The split decision algorithm is based on traffic.
GAC allows DynamoDB to monitor and protect the system from one application consuming all the resources.

Durability and Correctness

Data loss can occur because of hardware failures, software bugs, or hardware bugs.
DynamoDB is designed for high durability by having mechanisms to prevent, detect, and correct any potential data losses.

Hardware Failures

Write-ahead logs(WAL) in DynamoDB are central for providing durability and crash recovery. Write ahead logs are stored in all three replicas of a partition.
For higher durability, the write ahead logs are periodically archived to S3, an object store that is designed for 11 nines(99.999999999) of durability.
The unarchived logs are typically a few hundred megabytes in size.
When a node fails, all replication groups hosted on the node are down to two copies.
The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree and write-ahead logs.
[Solution] Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability.
Adding a log replica takes only a few seconds because the system has to copy only the recent write-ahead logs from a healthy replica to the new replica without the B-tree. Quick healing of impacted replication groups using log replicas ensures high durability of most recent writes.

Silent Data Errors

[Problem] Some hardware failures can cause incorrect data to be stored . These errors can happen because of the storage media, CPU, or memory.
It’s very difficult to detect these and they can happen anywhere in the system.
[Solution] DynamoDB makes extensive use of checksums to detect silent errors.
By maintaining checksums within every log entry, message, and log file, DynamoDB validates data integrity for every data transfer between two nodes.
Checksums serve as guardrails to prevent errors from spreading to the rest of the system.
Every log file that is archived to S3 has a manifest that contains information about the log, such as a table, partition and start and end markers for the data stored in the log file.
The agent responsible for archiving log files to S3 performs various checks before uploading the data. These include and are not limited to verification of every log entry to ensure that it belongs to the correct table and partition, verification of checksums to detect any silent errors, and verification that the log file doesn’t have any holes in the sequence numbers.
Once all the checks are passed, the log file and its manifest are archived. Log archival agents run on all three replicas of the replication group. If one of the agents finds that a log file is already archived, the agent downloads the uploaded file to verify the integrity of the data by comparing it with its local write-ahead log.
Every log file and manifest file are uploaded to S3 with a content checksum. The content checksum is checked by S3 as part of the put operation, which guards against any errors during data transit to S3.

Continuous Verification

DynamoDB also continuously verifies data at rest. Our goal is to detect any silent data errors or bit rot in the system. An example of such a continuous verification system is the scrub process.
The goal of scrub is to detect errors that we had not anticipated, such as bit rot.
The scrub process runs and verifies two things:
The verification is done by computing the checksum of the live replica and matching that with a snapshot of one generated from the log entries archived in S3.
Scrub acts as a defense in depth to detect divergences between the live storage replicas with the replicas built using the history of logs from the inception of the table.
A similar technique of continuous verification is used to verify replicas of global tables.
We have learned that continuous verification of data-at-rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Software Bugs

[Problem] DDB is a complex Distributed Key Value store. High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on.
[Solution] DDB uses formal methods extensively to ensure the correctness of our replication protocols. The core replication protocol was specified using TLA+.
When new features that affect the replication protocol are added, they are incorporated into the specification and model checked.
Model checking has allowed us to catch subtle bugs that could have led to durability and correctness issues before the code went into production. S3 also uses Model Checking.
Extensive failure injection testing and stress testing to ensure the correctness of every piece of software deployed.
In addition to testing and verification of the replication protocol of the data plane, formal methods have also been used to verify the correctness of our control plane and features such as distributed transactions.

Backups and Restore

In addition to guarding against physical media corruption, DynamoDB also supports backup and restore to protect against any logical corruption due to a bug in a customer’s application. Backups or restores don’t affect performance or availability of the table as they are built using the write-ahead logs that are archived in S3.
The backups are consistent across multiple partitions up to the nearest second.
The backups are full copies of DynamoDB tables and are stored in an Amazon S3 bucket.
DynamoDB also supports point-in-time restore where customers can restore the contents of a table that existed at any time in the previous 35 days to a different DynamoDB table in the same region.
For tables with the point-in-time restore enabled, DynamoDB creates periodic(based on the amount of write-ahead logs accumulated for the partition) snapshots of the partitions that belong to the table and uploads them to S3.
Snapshots, in conjunction to write-ahead logs, are used to do point-in-time restore.
[Workflow] When a point-in-time restore is requested for a table,

Availability

To achieve high availability, DynamoDB tables are distributed and replicated across multiple Availability Zones (AZ) in a Region. DynamoDB regularly tests resilience to node, rack, and AZ failures.
To test the availability and durability of the overall service, power-off tests are exercised. Using realistic simulated traffic, random nodes are powered off using a job scheduler. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Write and Consistent Read Availability

A partition’s write availability depends on its ability to have a healthy leader and a healthy write quorum.
A healthy write quorum in the case of DynamoDB consists of two out of the three replicas from different AZs.
A partition remains available as long as there are enough healthy replicas for a write quorum and a leader
A partition will become unavailable for writes if the number of replicas needed to achieve the minimum quorum are unavailable
The leader replica serves consistent reads.
Introducing log replicas was a big change to the system, and the formally proven implementation of Paxos provided us the confidence to safely tweak and experiment with the system to achieve higher availability
Eventually consistent reads can be served by any of the replicas.
In case a leader replica fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Failure Detection

[Problem] A newly elected leader will have to wait for the expiry of the old leader’s lease before serving any traffic. While this only takes a couple of seconds, the elected leader cannot accept any new writes or consistent read traffic during that period, thus disrupting availability.
Failure detection must be quick and robust to minimize disruptions. False positives in failure detection can lead to more disruptions in availability. Failure detection works well for failure scenarios where every replica of the group loses connection to the leader.
However, nodes can experience gray network failures(Gray Failure).
Gray network failures can happen because of communication issues between a leader and follower, issues with outbound or inbound communication of a node, or front-end routers facing communication issues with the leader even though the leader and followers can communicate with each other.
Gray failures can disrupt availability because there might be a false positive in failure detection or no failure detection
For example, a replica that isn’t receiving heartbeats from a leader will try to elect a new leader. This can disrupt availability.
[Solution] To solve the availability problem caused by gray failures, a follower that wants to trigger a failover sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election. This change in the failure detection algorithm used by DynamoDB significantly minimized the number of false positives in the system, and hence the number of spurious leader elections.

Measuring Availability

DynamoDB is designed for 99.999(5-9s) percent availability for global tables and 99.99**(4-9s)** percent availability for regional tables.
To ensure these goals are being met, DynamoDB continuously monitors availability at service and table levels. The tracked availability data is used to analyze customer perceived availability trends and trigger alarms if customers see errors above a certain threshold. These alarms are called customer-facing alarms (CFA) to report any availability-related problems and proactively mitigate the problem either automatically or through operator intervention.
In addition to real time monitoring of availability, the system runs daily jobs that trigger aggregation to calculate aggregate availability metrics per customer.
DynamoDB also measures and alarms on availability observed on the client-side. There are two sets of clients used to measure the user-perceived availability.
Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers and catch gray failures.

Deployments

Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows and without impacting the performance and availability that customers experience.
The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment.
[Problem] Deployments are not atomic in a distributed system. At any given time, there will be software running the old code on some nodes and new code on other parts of the fleet.
New software might introduce a new type of message or change the protocol in a way that old software in the system doesn’t understand.
[Solution] DynamoDB handles these kinds of changes with read-write deployments. Read-write deployment is completed as a multi-step process.
The first step is to deploy the software to read the new message format or protocol. Once all the nodes can handle the new message, the software is updated to send new messages.
Read-write deployments ensure that both types of messages can coexist in the system. Even in the case of rollbacks, the system can understand both old and new messages.
[OneBox] Deployments are done on a small set of nodes before pushing them to the entire fleet of nodes. The strategy reduces the potential impact of faulty deployments.
[AutoRollback AlarmWatcher/ApprovalWorkflow] DynamoDB sets alarm thresholds on availability metrics. If error rates or latency exceed the threshold values during deployments, the system triggers automatic rollbacks.
**[Problem]**Software deployments to storage nodes trigger leader failovers that are designed to avoid any impact to availability.

Dependencies on External Services

To ensure high availability, all the services that DynamoDB depends on in the request path should be more highly available than DynamoDB.
Alternatively, DynamoDB should be able to continue to operate even when the services on which it depends are impaired.
Examples of services DynamoDB depends on for the request path include AWS Identity and Access Management Services (IAM), and AWS Key Management Service (AWS KMS) for tables encrypted using customer keys. DynamoDB uses IAM and AWS KMS to authenticate every customer request.
While these services are highly available, DynamoDB is designed to operate when these services are unavailable without sacrificing any of the security properties that these systems provide.
In the case of IAM and AWS KMS, DynamoDB employs a statically stable design, where the overall system keeps working even when a dependency becomes impaired.
Perhaps the system doesn’t see any updated information that its dependency was supposed to have delivered. However, everything before the dependency became impaired continues to work despite the impaired dependency.
DynamoDB caches result from IAM and AWS KMS in the request routers that perform the authentication of every request. DynamoDB periodically refreshes the cached results asynchronously.
If AWS IAM or KMS were to become unavailable, the routers will continue to use the cached results for a predetermined extended period.
Caches improve response times by removing the need to do an off-box call, which is especially valuable when the system is under high load.

Metadata Availability

One of the most important pieces of metadata the request routers needs is the mapping between a table’s primary keys and storage nodes.
[Metadata Storage]At launch, DynamoDB stored the metadata in DynamoDB itself.
[Routing Schema] This routing information consists of all the partitions for a table, the key range of each partition, and the storage nodes hosting the partition.
[Router Metadata Caching] When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.

DynamoDB Limits

**Per Partition Read and Write Capacity Units - **Ref
1 MB limit on the size of data returned by a single Query, Scan/GetItem Op.
BatchGetItem operation can return up to 16MB of data - Ref
Item Size Limit: Ref
**Secondary Indexes - **Ref
Transactions:

MicroBenchmarks

To show that scale doesn’t affect the latencies observed by applications, we ran YCSB [8] workloads of types A (50 percent reads and 50 percent updates) and B (95 percent reads and 5 percent updates)
Both benchmarks used a uniform key distribution and items of size 900 bytes.
The workloads were scaled from 100 thousand total operations per second to 1 million total operations per second.
The purpose of the graph is to show, even at different throughput, DynamoDB read latencies show very little variance and remain identical even as the throughput of the workload is increased.

Paper Link: https://www.usenix.org/conference/atc22/presentation/elhemali

Last updated: March 15, 2026

Questions or discussion? Email me

Raft

Tue, 10 Dec 2024 00:00:00 +0000

Paper: Raft

Raft

Paper -> https://raft.github.io/raft.pdf

Usenix -> https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14.pdf

Website -> https://raft.github.io/

Designing for Understandability Raft 2016 -> Video , Slides

Raft User Study, 2013 -> Video, Slides

Motivation: Replicated State Machines

Service that is replicated on multiple machines:

Raft Basics

Leader based:
Server states:
Time divided into terms:
Request-response protocol between servers (remote procedure calls, or RPCs). 2 request types:

Leader Election

All servers start as followers
No heartbeat (AppendEntries)? Start election:
Election outcomes:
Each server votes for at most one candidate in a given term
Election Safety: at most one server can be elected leader in a given term
Availability: randomized election timeouts reduce split votes

Log Replication

Handled by leader
When client request arrives:
Log entries: index, term, command
Logs can become inconsistent after leader crashes
Raft maintains a high level of coherency between logs (Log Matching Property):
AppendEntries consistency check preserves above properties.
Leader forces other logs to match its own:

Safety

Must ensure that the leader for new term always holds all of the log entries committed in previous terms (Leader Completeness Property).
Step 1: restriction on elections: don’t vote for a candidate unless candidate’s log is at least as up-to-date as yours.
Compare indexes and terms from last log entries.
Step 2: be very careful about when an entry is considered committed

Persistent Storage

Each server stores the following in persistent storage (e.g. disk or flash):
These must be recovered from persistent storage after a crash
If a server loses its persistent storage, it cannot participate in the cluster anymore

Implementing Raft

Client Interactions

Clients interact only with the leader
Initially, a client can send a request to any server
If leader crashes while executing a client request, the client retries (with a new randomly-chosen server) until the request succeeds
This can result in multiple executions of a command: not consistent!
Goal: linearizability: System behaves as if each operation is executed exactly once, atomically, sometime between sending of the request and receipt of the response.
Solution:

Other Issues

Cluster membership
Log compaction
See paper for details Paxos Vs Raft by John Kubiatowicz.

Paper Link: https://raft.github.io/raft.pdf

BigTable

Mon, 09 Dec 2024 00:00:00 +0000

Paper: BigTable

BigTable/Wide Column Storage System

Goal

Design a distributed and scalable system that can store a huge amount of semi-structured data. The data will be indexed by a row key where each row can have an unbounded number of columns.

What is BigTable

BigTable is a distributed and massively scalable wide-column store.
Designed to store huge sets of structured data.
Provides storage for very big tables (often in the terabyte range)
BigTable is a CP system, i.e., it has strongly consistent reads and writes.
BigTable can be used as an input source or output destination for MapReduce.

Background

Developed at Google in 2005 and used in dozens of Google services.
Google couldn’t use external commercial databases because of its large scale services, and costs would have been too high. So they built an in-house solution, custom built for their use case and traffic patterns.
BigTable is highly available(?? With consistency??) and high-performing database that powers multiple applications across Google — where each application has different needs in terms of the size of data to be stored and latency with which results are expected.
BigTable inspired various open source databases like Cassandra(borrow BigTable’s DataModel), HBase(Distributed Non-Relational Database) and HyperTable.

BigTable UseCases

Google built BigTable to store large amounts of data and perform thousands of queries per second on that data.
Examples of BigTable data are billions of URLs with many versions per page, petabytes of Google Earth data, and billions of users’ search data.
BigTable is suitable to store large datasets that are greater than one TB where each row is less than 10MB.
Since BigTable does not provide ACID properties or transaction support(Across Rows or Tables), OLTP applications should not use BigTable.
Data should be structured in the form of key-value pairs or rows-columns.
Non-structured data like images or movies should not be stored in BigTable.
Google examples:
BigTable can be used to store the following types of data:

Big Table Data Model

Agenda

Rows
Column families
Columns
Timestamps

Details

BigTable can be characterized as a sparse, distributed, persistent, multidimensional, sorted map.
Traditional DBs have a two-dimensional layout of the data, where each cell value is identified by the ‘Row ID’ and ‘Column Name’.
BigTable has a four-dimensional data model. The four dimensions are:
The data is indexed (or sorted) by row key, column key, and a timestamp. Therefore, to access a cell’s contents, we need values for all of them.
If no timestamp is specified, BigTable retrieves the most recent version.

Rows

Each row in the table is uniquely identified by an associated row key(internally represented as String) that is an arbitrary string of up to 64 kilobytes in size (although most keys are significantly smaller).
Every read or write of data under a single row is atomic.
Atomicity across rows is not guaranteed, e.g., when updating two rows, one might succeed, and the other might fail.
Each table’s data is only indexed by row key, column key, and timestamp. There are no secondary indices.
A column is a key-value pair where the key is represented as ‘column key’ and the value as ‘column value.’

Column families

Column keys are grouped into sets called column families. All data stored in a column family is usually of the same type. This is for compression purposes.
The number of distinct column families in a table should be small (in the hundreds at maximum), and families should rarely change during operation.
Access control as well as both disk and memory accounting are performed at the column-family level.
All rows have the same set of column families.
BigTable can retrieve data from the same column family efficiently.
Short Column family names are better as names are included in the data transfer.

Columns

Columns are units within a column family.
A BigTable may have an unbounded number of columns.
New columns can be added on the fly.
Short column names are better as names are passed in each data transfer, e.g., ColumnFamily:ColumnName => Work:Dept
BigTable is quite suitable for sparse data(Empty columns are not stored).

Timestamps

Each column cell can contain multiple versions of the content.
A 64-bit timestamp identifies each version that either represents real time or a custom value assigned by the client.
While reading, if no timestamp is specified, BigTable returns the most recent version.
If the client specifies a timestamp, the latest version that is earlier than the specified timestamp is returned.
BigTable supports two per-column-family settings to garbage-collect cell versions automatically

System APIs

BigTable provides APIs for two types of operations:

Metadata operations
Data operations

Metadata operations

APIs for creating and deleting tables and column families.
Functions for changing cluster, table, and column family metadata, such as access control rights.

Data operations

Clients can insert, modify, or delete values in BigTable.
Clients can also lookup values from individual rows or iterate over a subset of the data in a table.
BigTable supports single-row transactions(Single row atomic read/writes), which can be used to perform atomic read-modify-write sequences on data stored under a single row key.
Bigtable does not support transactions across row keys, but provides a client interface for batch writing across row keys.
BigTable allows cells to be used as integer counters.
A set of wrappers allow a BigTable to be used both as an input source and as an output target for MapReduce jobs.
Clients can also write scripts in Sawzall(a language developed at Google) to instruct server-side data processing (transform, filter, aggregate) prior to the network fetch.
APIs for write operations:
A read or scan operation can read arbitrary cells in a BigTable:

Partitioning and High Level Architecture

Table Partitioning

A single instance of a BigTable implementation is known as a cluster.
Each cluster can store a number of tables where each table is split into multiple Tablets, each around 100–200 MB in size.
Tables broken into Tablets(row boundary) which hold a contiguous range of rows.
Initially, each table consists of only one Tablet. As the table grows, multiple Tablets are created. By default, a table is split at around 100 to 200 MB.
Tablets are the unit of distribution and load balancing.
Since the table is sorted by row, reads of short ranges of rows(within a small number of Tablets) are always efficient. This means selecting a row key with a high degree of locality is very important.
Each Tablet is assigned to a Tablet server, which manages all read/write requests of that Tablet.

High Level Architecture

Big Table cluster consists of 3 major components:

Client Library: Application talks to BigTable using client library.
One master server: For doing metadata operations, managing Tablets and assigning Tablets to Tablet servers.
Many Tablet servers: Each Tablet server serves read and write of the data to the Tablets it is assigned. BigTable is built on top of several other pieces from Google infrastructure:
GFS: BigTable uses the Google File System to store its data and log files.
SSTable: Google’s Sorted String Table file format is used to store BigTable data.
Chubby: BigTable uses a highly available and persistent distributed lock service called Chubby to handle synchronization issues and store configuration information.
Cluster Scheduling System: Google has a cluster management system that schedules, monitors, and manages the Bigtable’s cluster.

SSTables

How are Tablets stored in GFS?

BigTable uses Google File System (GFS), a persistent distributed file storage system to store data as files.
The file format used by BigTable to store its files is called SSTable.
SSTables are persisted, ordered maps of keys to values, where both keys and values are arbitrary byte strings.
Each Tablet is stored in GFS as a sequence of files called SSTables.
An SSTable consists of a sequence of data blocks (typically 64KB in size).
A block index is used to locate blocks; the index is loaded into memory when the SSTable is opened.
An SSTable lookup can be performed with a single disk seek. We first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from the disk.
To read data from an SSTable, it can either be copied from disk to memory as a whole or can be done via just the index. The former approach avoids subsequent disk seeks for lookups, while the latter requires a single disk seek for each lookup.
SSTables provide two operations:
SSTable is immutable once written to GFS. If new data is added, a new SSTable is created. Once an old SSTable is no longer needed, it is set out for garbage collection.
SSTable immutability is at the core of BigTable’s data checkpointing and recovery routines.
Advantages of SSTable’s immutability:

Table vs Tablet vs SSTable

Multiple Tablets make up a table.
SSTables can be shared by multiple Tablets. [Why?]
Tablets do not overlap, SSTables can overlap.
To improve performance, BigTable uses an in-memory, mutable sorted buffer called MemTable to store recent updates.
As more writes are performed, MemTable size increases, and when it reaches a threshold, the MemTable is frozen, a new MemTable is created, and the frozen MemTable is converted to an SSTable and written to GFS.
Each data update is also written to a commit-log(Write Ahead Log WAL) which is also stored in GFS. This log contains redo records used for recovery if a Tablet server fails before committing a MemTable to SSTable.
While reading, the data can be in MemTables or SSTables. Since both these tables are sorted, it is easy to find the most recent data.

GFS and Chubby

GFS

GFS files are broken down into fixed-size blocks called chunks.
SSTables are divided into fixed-size blocks and these blocks are stored on the chunk servers. Each Chunk is replicated across multiple chunk servers for reliability.
Clients interact with master for metadata, and chunk servers directly for SSTable data files.

Chubby

Chubby Recap:

Chubby is a highly available and persistent distributed locking service.
Chubby usually runs with five active replicas, one of which is elected as the master to serve requests. To remain alive, a majority of Chubby replicas must be running.
BigTable depends on Chubby so much that if Chubby is unavailable for an extended period of time, BigTable will also become unavailable.
Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure.
Chubby provides a namespace consisting of files and directories. Each file or directory can be used as a lock. Read and write access to a Chubby file is atomic. In BigTable, Chubby is used to:
Allows a multi-thousand node Bigtable cluster to stay coordinated.
Ensure there is only one active master. The master maintains a session lease with Chubby and periodically renews it to retain the status of the master.
Store the bootstrap location of BigTable data.
Discover new Tablet servers as well as the failure of existing ones.
Store BigTable schema information (the column family information for each table)
Store Access Control Lists (ACLs).

BigTable Components

A BigTable cluster consists of three major components:

A library component that is linked into every client.
One master server.
Many Tablet servers.

BigTable Master Server

There is only one master server in a BigTable cluster, and it is responsible for:

Assigning Tablets to Tablet servers and ensuring effective load balancing.
Monitoring the status of Tablet servers and managing the joining or failure of Tablet Servers.
Garbage collection of the underlying files stored in GFS
Handling metadata operations such as table and column family creations.
Bigtable master is not involved in the core task of mapping tablets onto the underlying files in GFS (Tablet servers handle this).
This means that Bigtable clients do not have to communicate with the master at all.(What?)
This design decision significantly reduces the load on the master and the possibility of the master becoming a bottleneck.

Tablet Server

Each Tablet server is assigned ownership of a number of Tablets (typically 10-1000 Tablets per server) by the master server.
Each Tablet server serves read and write requests of the data of the Tablets it is assigned.
The client communicates directly with the Tablet servers for reads/writes.
Tablet servers can be added or removed dynamically from a cluster to accommodate changes in the workloads.
Tablet creation, deletion, or merging is initiated by the master server, while Tablet partition or splitting(too Large) is handled by Tablet servers who notify the master.

Working with Tablets

Agenda

Locating Tablets
Assigning Tablets
Monitoring Tablet Servers
Load-balancing Tablet servers

Locating Tablets

Since Tablets move around from server to server (due to load balancing, Tablet server failures, etc.), given a row, how do we find the correct Tablet server?
To answer this, we need to find the Tablet whose row range covers the target row.
BigTable maintains a 3-level hierarchy, analogous to that of a B+ tree, to store Tablet location information.
BigTable creates a special table, called Metadata table, to store Tablet locations.
This Metadata table contains one row per Tablet that tells us which Tablet server is serving this Tablet.
Each row in the METADATA table stores a Tablet’s location under a row key that is an encoding of the Tablet’s table identifier and its end row.
BigTable stores the information about the Metadata table in two parts:
A BigTable client seeking the location of a Tablet starts the search by looking up a particular file in Chubby that is known to hold the location of the Meta- 0 Tablet.
This Meta-0 Tablet contains information about other metadata Tablets, which in turn contain the location of the actual data Tablets.
With this scheme, the depth of the tree is limited to three. For efficiency, the client library caches Tablet locations and also prefetch metadata associated with other Tablets whenever it reads the METADATA table

Assigning Tablets

A Tablet is assigned to only one Tablet server at any time.
The master keeps track of the set of live Tablet servers and the mapping of Tablets to Tablet servers.
The master also keeps track of any unassigned Tablets and assigns them to Tablet servers with sufficient room.
When a Tablet server starts, it creates and acquires an exclusive lock on a uniquely named file in Chubby’s “servers” directory. This mechanism is used to tell the master that the Tablet server is alive.
During Master restarts(or startup), following things happens:

Monitoring Tablet servers(Tablet Failures or Network Partitions)

BigTable maintains a ‘Servers’ directory in Chubby, which contains one file for each live Tablet server.
Whenever a new Tablet server comes online, it creates a new file in this directory to signal its availability and obtains an exclusive lock on this file. As long as a Tablet server retains the lock on its Chubby file, it is considered alive.
BigTable’s master keeps monitoring the ‘Servers’ directory, and whenever it sees a new file in this directory, it knows that a new Tablet server has become available and is ready to be assigned Tablets.
Master regularly checks the status of the lock. If the lock is lost, the master assumes that there is a problem either with the Tablet server or the Chubby.
In such a case, the master tries to acquire the lock, and if it succeeds, it concludes that Chubby is working fine, and the Tablet server is having problems.
The master, in this case, deletes the Tablet server’s Chubby lock file and reassigns the tablets of the failing Tablet server
The deletion of the file works as a signal for the failing Tablet server to terminate itself and stop serving the Tablets.
It tries to acquire the lock again, and if it succeeds, it considers it a temporary network problem and starts serving the Tablets again.
If the file gets deleted, then the Tablet server terminates itself to start afresh.

Load-balancing Tablet servers

Master periodically asks Tablet servers about their current load. All this information gives the master a global view of the cluster and helps assign and load-balance Tablets.

Life of BigTables Read and Write Operations

Write Request

Upon receiving a write request, Tablet server performs the following steps

Validate request to be well formed
Does sender Authorization to perform mutation using ACLs in Chubby.
If authorized, mutation is written to commit-log in GFS that stores redo records.
Once committed to commit-log, request contents are stored in memory in a sorted buffer called MemTable.
After inserting data into MemTable, success acknowledgement is sent to the client.
Periodically, MemTables are flushed to SSTables, and SSTables in the background are merged using Compaction.

Read Request

Upon receiving a read request, Tablet server performs following steps:

Validate request is well formed and sender is authorized.
Return rows if they are available in cache.
Read MemTable to find the required rows.
Read SSTable Indexes that are loaded in memory to find SSTables that will have the required data, then read those rows from SSTables.
Merge rows read from MemTable and SSTable to find the required version of data.
Since MemTable and SSTables are sorted, merged view can be formed efficiently.

Fault Tolerance and Compaction

Agenda

Fault tolerance and replication
Compaction

Fault tolerance and replication

Fault tolerance in Chubby and GFS

Both the systems employ a replication strategy for fault tolerance and high availability, that minimizes downtime for Chubby. Similarly, GFS replication creates multiple copies of the data to avoid data loss.

Fault tolerance for Tablet server

BigTable’s master is responsible for monitoring the Tablet servers.
The master does this by periodically checking the status of the Chubby lock against each Tablet server.
When the master finds out that a Tablet server has gone dead, it reassigns the tablets of the failing Tablet server.

Fault tolerance for the Master

The master acquires a lock in a Chubby file and maintains a lease.
If, at any time, the master’s lease expires, it kills itself.
When Google’s Cluster Management System finds out that there is no active master, it starts one up.
The new master has to acquire the lock on the Chubby file before acting as the master.

Compaction

Mutations in BigTable take up extra space till compaction happens. BigTable manages compaction behind the scenes. List of compactions:

Minor Compaction(MemTable Written to SSTables)
Merging Compaction(SSTables + MemTable compacted to Larger SSTable)
Major Compaction(All SSTables - >Single SS Table)

BigTable implemented certain refinements to achieve high performance, availability, and reliability.

Agenda

Locality groups
Compression
Caching
Bloom Filters
Unified commit Log
Speeding up Tablet recovery

Locality groups

BigTable uses column-oriented storage.
Clients can club together multiple column families into a locality group.
BigTable generates separate SSTables for each locality group.
This has few benefits:

Compression

Clients can choose to compress the SSTable for a locality group to save space.
BigTable allows its clients to choose compression techniques based on their application requirements.
The compression ratio gets even better when multiple versions of the same data are stored.
Compression is applied to each SSTable block separately.

Caching

To improve read performance, Tablet servers employ two levels of caching:

Bloom Filters

Any read operation has to read from all SSTables that make up a Tablet.
These SSTables are not in memory, thus the read operation needs to do many disk accesses. To reduce the number of disk accesses BigTable uses Bloom Filters.
Bloom Filters are created for SSTables (particularly for the locality groups).
They help to reduce the number of disk accesses by predicting if an SSTable does “not” contain data corresponding to a particular (row, column) pair.
Bloom filters take a small amount of memory but can improve the read performance drastically.

Unified commit Log

Instead of maintaining separate commit log files for each Tablet, BigTable maintains one log file for a Tablet server. This gives better write performance.
Since each write has to go to the commit log, writing to a large number of log files would be slow as it could cause a large number of disk seeks.
One disadvantage of having a single log file is that it complicates the Tablet recovery process.
When a Tablet server dies, the Tablets that it served will be moved to other Tablet servers.
To recover the state for a Tablet, the new Tablet server needs to reapply the mutations for that Tablet from the commit log written by the original Tablet server.
However, the mutations for these Tablets were co-mingled in the same physical log file. One approach would be for each new Tablet server to read this full commit log file and apply just the entries needed for the Tablets it needs to recover.
However, under such a scheme, if 100 machines were each assigned a single Tablet from a failed Tablet server, then the log file would be read 100 times.
BigTable avoids duplicating log reads by first sorting the commit log entries in order of the keys .
In the sorted output, all mutations for a particular Tablet are contiguous and can therefore be read efficiently.
To further improve the performance, each Tablet server maintains two log writing threads — each writing to its own and separate log file.
Only one of the threads is active at a time. If one of the threads is performing poorly (say, due to network congestion), the writing switches to the other thread. Log entries have sequence numbers to allow the recovery process

Speeding up Tablet recovery

One of the complicated and time-consuming tasks while loading Tablets is to ensure that the Tablet server loads all entries from the commit log.
When the master moves a Tablet from one Tablet server to another, the source Tablet server performs compactions to ensure that the destination Tablet server does not have to read the commit log. This is done in 3 steps:

Tablet Splitting

Concurrency on MemTable

Want to avoid read-contention when writes are also happening on the same rows.
Use Copy-on-write semantics on a per-row basis.

Performance Observations

BigTable Characteristics

BigTable performance(and Popularity)

Distributed multi-level map: BigTable can run on a large number of machines.
Scalable means that BigTable can be easily scaled horizontally by adding more nodes to the cluster without any performance impact. No manual intervention or rebalancing is required. BigTable achieves linear scalability and proven fault tolerance on commodity hardware
Fault-tolerant and reliable: Since data is replicated to multiple nodes, fault tolerance is pretty high.
Durable: BigTable stores data permanently.
Centralized: BigTable adopts a single-master approach to maintain data consistency and a centralized view of the state of the system.
Separation between control and data: BigTable maintains a strict separation between control and data flow. Clients talk to the Master for all metadata operations, whereas all data access happens directly between the Clients and the Tablet servers.

Dynamo vs. BigTable

Datastores developed on the principles of BigTable

Google’s BigTable has inspired many NoSQL systems. Here is a list of a few famous ones:

HBase: HBase is an open-source, distributed non-relational database modeled after BigTable. It is built on top of the Hadoop Distributed File System (HDFS).
Hypertable: Similar to HBase, Hypertable is an open-source implementation of BigTable and is written in C++. Unlike BigTable, which uses only one storage layer (i.e., GFS), Hypertable is capable of running on top of any file system (e.g., HDFS, GlusterFS, or the CloudStore ). To achieve this, the system has abstracted the interface to the file system by sending all data requests through a Distributed File System broker process.
Cassandra: Cassandra is a distributed, decentralized, and highly available NoSQL database. Its architecture is based on Dynamo and BigTable. Cassandra can be described as a BigTable-like datastore running on a Dynamo-like infrastructure. Cassandra is also a wide-column store and utilizes the storage model of BigTable, i.e., SSTables and MemTables.

Summary

BigTable is a Distributed wide column storage system designed to manage large amounts of semi-structured data with High Availability, Low Latency, Scalability, and Fault tolerance.
It is a sparse, distributed, persistent, Multi Dimensional sorted map.
Map is indexed by a unique key made up of Row Key(up to 64 KB), Column key, and a timestamp(64-bit integer).
Columns are grouped into Column families. RowKey and Column key uniquely identifies a Column data cell. Within each cell, data is further indexed by timestamps to store multiple versions of the data.
Each read/write to a row is atomic. Atomicity across rows is not guaranteed.
A BigTable’s Table could be a multi-TB table. A Table is broken into a smaller range of rows called Tablets.
One Master server and multiple Tablet Servers.
Master does metadata management, Assigns Tablets to Tablet servers, does Tablet rebalancing etc.
Read/Write of data goes directly to the tablet servers.
Tablet servers store each tablet as a set of Immutable SSTable files, each of which is further divided into 64KB Data Blocks. SStables are stored as Chunks in GFS and replicated to different chunk servers.
To enhance read performance, especially reducing disk seeks while trying to check for existence of a Key from each of SSTable, Bloom filters are used to check for existence.
BigTable relies on Chubby for master server selection(and Failover), using Locks, and also master check if the Tablet servers are alive, since they take a lock on the Chubby’s server directory.
Writes first go to a Commit Log(WAL) for failure recovery, then to In-Memory MemTable(where it’s kept as a Sorted Map), and when it breaches threshold, its written to SSTable.
MemTables, SSTables merged and SSTables are compacted to bigger SSTable in background using compactions.
All the read operations are served from a Merged view of MemTable and All SSTables.

Reference

BigTable
SSTable(LSM Trees)
Amazon Dynamo
Cassandra
HBase
Jordan BigTable

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Google File System

Sun, 08 Dec 2024 00:00:00 +0000

Paper: Google File System

Google File System / Distributed File System

Goal

Design a distributed file system to store huge files (terabyte and larger). The system should be scalable, reliable, and highly available.

Developed by Google for its large data-intensive applications.

Background

GFS was built for handling batch processing on large data sets and is designed for system-to-system interaction, not user-to-system interaction.
Was designed with following goals in mind:

GFS Use Cases

Built for distributed data-intensive applications like Gmail or Youtube.
Google’s BigTable uses GFS to store log files and data files.

APIs

GFS doesn’t provide a standard posix-like API. Instead user-level APIs are provided.
Files organized hierarchically in directories and identified by their path names.
Supports usual file system operations:
Additional Special Operations

High Level Architecture

Agenda

Chunks
Chunk Handle
Cluster
Chunk Server
Master
Client A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients.

Chunk

As files stored in GFS tend to be very large, GFS breaks files into multiple fixed-size chunks where each chunk is 64 megabytes in size.

Chunk Handle

Each chunk is identified by an Immutable and globally unique 64-bit ID number called chunk handle. Allows 2^64 unique chunks.
Total allowed storage space = 2^64 * 64MB = 10^9 exabytes
Files are split into Chunks, so the job of GFS is to provide a mapping from files to Chunks, and then to support standard operations on Files, mapping down operations to individual chunks.

Cluster

GFS is organized into a network of computers(nodes) called a cluster. A GFS cluster contains 3 types of entities:

Chunk Server

Nodes which stores chunks on local disks as linux files
Read or write chunk data specified by chunk handle and byte-range.
For reliability, each chunk is replicated to multiple chunk servers.
By default, GFS stores three replicas, though different replication factors can be specified on a per-file basis.

Master

Coordinator of GFS cluster. Responsible for keeping track of filesystem metadata.
Metadata stored at master includes:
Master also controls system-wide activities such as:
Periodically communicates with each ChunkServer in HeartBeat messages to give it instructions and collect its state.
For performance and fast random access, all metadata is stored in the master’s main memory, i.e. entire filesystem namespace as well as all the name-to-chunk mappings.
For fault tolerance and to handle a master crash, all metadata changes(every operation to File System) are written to the disk onto an operation log(similar to Journal) which is replicated to remote machines.
The benefit of having a single, centralized master is that it has a global view of the file system, and hence, it can make optimum management decisions, for example, related to chunk placement.

Client

Application/Entity that makes read/write requests to GFS using GFS Client library.
This library communicates with the master for all metadata-related operations like creating or deleting files, looking up files, etc.
To read or write data, the client(library) interacts directly with the ChunkServers that hold the data.
Neither the client nor the ChunkServer caches file data.
ChunkServers rely on the buffer cache in Linux to maintain frequently accessed data in memory.

Single Master and Large Chunk Size

Agenda

Single Master
Chunk Size

Single Master

Having a single master vastly simplifies GFS design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge.
GFS minimizes the master’s involvement in reads and writes, so that it does not become a bottleneck.

Chunk Size

GFS has chosen 64 MB, which is much larger than typical filesystem block sizes (which are often around 4KB). One of the key design parameters.
Advantages of large chunk size:

Lazy space Allocation

Each chunk replica is stored as a plain Linux file on a ChunkServer. GFS does not allocate the whole 64MB of disk space when creating a chunk. Instead, as the client appends data, the ChunkServer, lazily extends the chunk
One disadvantage of having a large chunk size is the handling of small files.

Metadata

Let’s explore how GFS manages file system metadata.

Agenda

Storing Metadata in memory
Chunk Location
Operation Log Master stores 3 types of metadata:
File and Chunk name spaces(Directory hierarchy).
Mapping from files to chunks.
Location of each chunk’s replica. 3 Aspects of how master stores this metadata:
Keeps all the metadata in memory.
File and Chunk namespaces and file-to-Chunk mapping are also persisted on Master’s local disk.
Chunk’s replica locations are not persisted on to local disk.

Storing Metadata in Memory

Quick operations due to metadata being accessible in-memory.
Efficient for the master to periodically scan through its entire state in the background. Periodic scanning is used for three functions:
Capacity of the whole system(or How many chunks can the metadata store) is limited by how much memory the master has. Not a problem in practice.
If the need to support a larger file system arises, cost of adding extra memory to master is a smaller price to pay for reliability, simplicity, performance, and flexibility by storing metadata in-memory.

Chunk Location

The master does not keep a persistent record of which ChunkServers have a replica of a given chunk.
By having the ChunkServer as the ultimate source of truth of each chunk’s location, GFS eliminates the problem of keeping the master and ChunkServers in sync
It is not beneficial to maintain a consistent view of chunk locations on the master, because errors on a ChunkServer may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled, or ChunkServer is renamed or failed, etc.)

Operation Log

The master maintains an operation log that contains the namespace and file- to-chunk mappings and stores it on the local disk.
Specifically, this log stores a historical persistent record of all the metadata changes and serves as a logical timeline that defines the order of concurrent operations.
For fault tolerance and reliability, this operation log is synchronously replicated on multiple remote machines, and changes to the metadata are not made visible to clients until they have been persisted on all replicas.(Similar to the High Water Mark concept in Kafka).
The master batches several log records together before flushing, thereby reducing the impact of flushing and replicating on overall system throughput.
Upon restart, the master can restore its file-system state by replaying the operation log.
This log must be kept small to minimize the startup time, and that is achieved by periodically checkpointing it.(What does this mean?)

Checkpointing

Master’s state is periodically serialized to disk and then replicated, so that on recovery, a master may load the checkpoint into memory, replay any subsequent operations from the operation log, and be available very quickly.
To further speed up the recovery and improve availability, GFS stores the checkpoint in a compact B-tree like format that can be directly mapped into memory and used for namespace lookup without extra parsing.
The checkpoint process can take time, therefore, to avoid delaying incoming mutations, the master switches to a new log file and creates the new checkpoint in a separate thread.

Master Operations

Agenda

Namespace management and locking
Replica placement
Replica creation and re-replication
Replica rebalancing
Stale replica detection Master is responsible for:
Making replica placement decision
Creating new Chunks and assigning replicas
Making sure that the chunks are fully replicated as per replication factor
Balancing the load across chunk servers
Reclaimed unused storage.

Namespace management and locking

The master acquires locks over a namespace region to ensure proper serialization and to allow multiple operations at the master.
GFS does not have an i-node like tree structure for directories and files.
Instead, it has a hash-map that maps a filename to its metadata, and reader-writer locks are applied on each node of the hash table for synchronization.

Replica placement

To ensure maximum data availability and integrity, the master distributes replicas on different racks(“Rack Aware”), so that clients can still read or write in case of a rack failure.
As the in and out bandwidth of a rack may be less than the sum of the bandwidths of individual machines, placing the data in various racks can help clients exploit reads from multiple racks.
For ‘write’ operations, multiple racks are actually disadvantageous as data has to travel longer distances. It is an intentional tradeoff that GFS made.
Data is lost when all replicas of a chunk are lost.

Replica creation and re-replication

The goals of a master are to place replicas on servers with less-than-average disk utilization, and spread replicas across racks.
Reduce the number of ‘recent’ creations on each ChunkServer (even though writes are cheap, they are followed by heavy write traffic) which might create additional load.
Chunks need to be re-replicated as soon as the number of available replicas falls (due to data corruption on a server or a replica being unavailable) below the user-specified replication factor.
Instead of re-replicating all of such chunks at once, the master prioritizes the client operations re-replication to prevent these cloning operations from becoming bottlenecks.What?
Restrictions are placed on the bandwidth of each server for re-replication so that client requests are not compromised.
How are chunks prioritized for re-replication?

Replica rebalancing

Master rebalances replicas regularly to achieve load balancing and better disk space usage.
Any new ChunkServer added to the cluster is filled up gradually by the master rather than flooding it with a heavy traffic of write operations.

Stale replica detection

Chunk replicas may become stale if a ChunkServer fails and misses mutations to the chunk while it is down
For each chunk, the master maintains a chunk Version Number to distinguish between up-to-date and stale replicas.
The master increments the chunk version every time it grants a lease and informs all up-to-date replicas.
The master and these replicas all record the new version number in their persistent state.
Master removes stale replicas during regular garbage collection.
Stale replicas are not given to clients when they ask the master for a chunk location, and they are not involved in mutations either.
However, because a client caches a chunk’s location, it may read from a stale replica before the data is resynced.

Anatomy of a Read Operation

Let’s learn how GFS handles a read operation. A typical interaction with GFS Cluster goes like this:

Client translates the filename and byte offset specified by the application into a chunk index within the file.
Client sends RPC request with File Name and Chunk Index to the master.
Master replies with Chunk Handle and replica locations(holding chunk).
Client caches this metadata using FileName and ChunkIndex as the key.
Client sends request to one of the closest replicas specifying a chunk handle and a byte range within that chunk.
Replica chunk server replies with requested data.
Master is involved only at the start and is then completely out of loop, implementing a separation of control and data flows.

Anatomy of Write Operation

What is a chunk lease?

To safeguard against concurrent writes at two different replicas of a chunk, GFS makes use of chunk lease.
When a mutation (i.e., a write, append or delete operation) is requested for a chunk, the master finds the ChunkServers which hold that chunk and grants a chunk lease (for 60 seconds) to one of them.
The server with the lease is called the primary and is responsible for providing a serial order for all the currently pending concurrent mutations to that chunk.
There is only one lease per chunk at any time, so that if two write requests go to the master, both see the same lease denoting the same primary.
A global ordering is provided by the ordering of the chunk leases combined with the order determined by that primary.
The primary can request lease extensions if needed
When the master grants the lease, it increments the chunk version number and informs all replicas containing that chunk of the new version number.
Failure modes??

Data Writing?

Writing of data is split into two phases:

Sending
Writing Stepwise breakdown of data transfer:
Client asks master which chunk server holds the current lease of chunk and locations of other replicas.
Master replies with the identity and location of primary and secondary replicas.
Client pushes data to the closest replica.
Once all replicas have acknowledged receiving the data, the client sends the write request to the primary.
The primary assigns consecutive serial numbers to all the mutations it receives, providing serialization. It applies mutations in serial number order.
Primary forwards the write request to all secondary replicas. They apply mutations in the same serial number order.
Secondary replicas reply to primary indicating they have completed operation.
Primary replies to the client with success or error messages.
The key point to note is that the data flow is different from the control flow.
Chunk version numbers are used to detect if any replica has stale data which has not been updated because that ChunkServer was down during some update.

Another edge case with write operation is that, if we have two concurrent write operations spanning multiple chunks, and the chunks have two different primary chunk servers, which decide on the single order, it could happen that you may have interleaved concurrent writes in those cases. See example below. From the Jordan video here at time 22:30 onwards. Only solution for that is Distributed Locking Service(Distributed Consensus), which is going to be an expensive operation.

Anatomy of Append operation?

Record append operation is optimized in a unique way that distinguishes GFS from other distributed file systems.
In a normal write, the client specifies the offset at which data is to be written. Concurrent writes to the same region can experience race conditions, and the region may end up containing data fragments from multiple clients.
In a record append, however, the client specifies only the data(up to 1/4th of a chunk size ~~ 16 MB). GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.
Record Append is a kind of mutation that changes the contents of the metadata of a chunk.
[Data Transfer to Replicas] When an application tries to append data on a chunk by sending a request to the client, the client pushes the data to all replicas of the last chunk of the file just like the write operation.
[Command to serialize the write] When the client forwards the request to the master, the primary checks whether appending the record to the existing chunk will increase the chunk’s size more than its limit (maximum size of a chunk is 64MB).
[Pads the existing Chunk] If this happens, it pads the chunk to the maximum limit, commands the secondary to do the same, and requests the clients to try to append to the next chunk.
[Append to the primary replica’s chunk and notify secondary] If the record fits within the maximum size, the primary appends the data to its replica, tells the secondary to write the data at the exact offset where it has, and finally replies success to the client.
[Failure Mode] If an append operation fails at any replica, the client retries the operation.

Implications for Writes(Jordan Video:27:40)

Prefer appends to writes.
No Interleaving.
Readers need to be able to handle padding and/or duplicates(can happen due to failed retries or partial failures on some of the replicas).
If making multi-chunk writes, writers should take checkpoints as each of those individual write chunks goes through.

GFS consistency model and Snapshotting

GFS Consistency model

GFS has a relaxed consistency model.(Don’t know what that means)
Metadata operations (e.g., file creation) are atomic.
Namespace locking guarantees atomicity and correctness.
Master’s operation log defines a global total order of these operations.
In data mutations, there is an important distinction between write and append operations.
Write operations specify an offset at which mutations should occur, whereas appends are always applied at the end of the file.
This means that for the write operation, the offset in the chunk is predetermined, whereas for append , the system decides.
Concurrent writes to the same location are not serializable and may result in corrupted regions of the file.
With append operations, GFS guarantees the append will happen at-least-once and atomically (that is, as a contiguous sequence of bytes).
The system does not guarantee that all copies of the chunk will be identical (some may have duplicate data).

Snapshotting

A snapshot is a copy of some subtree of the global namespace as it exists at a given point in time.
GFS clients use snapshotting to efficiently branch two versions of the same data.
Snapshots in GFS are initially zero-copy.
When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files to snapshot.
It waits for leases to be revoked or expired and logs the snapshot operation to the operation log.
The snapshot is then made by duplicating the metadata for the source directory tree.
When a client makes a request to write to one of these chunks, the master detects that it is a copy-on-write chunk by examining its reference count (which will be more than one).
At this point, the master asks each ChunkServer holding the replica to make a copy of the chunk and store it locally.
Once the copy is complete, the master issues a lease for the new copy, and the write proceeds.

Fault Tolerance, High Availability, and Data Integrity

Agenda

Fault Tolerance
High Availability through chunk replication
Data Integrity through checksum.

Fault Tolerance

To make the system fault tolerant, and available, GFS uses two strategies:

Fast recovery in case of component failures.
Replication for high availability. Lets see how GFS recovers from Master or Replica Failure:
On Master Failure
On Primary Replica Failure
On Secondary Replica Failure
Stale replicas might be exposed to clients. It depends on the application programmer to deal with these stale reads.

High Availability through chunk replication

Each chunk is replicated on multiple ChunkServers on different racks.
Users can specify different replication levels(Default: 3) for different parts of the file namespace.
The master clones the existing replicas to keep each chunk fully replicated as ChunkServers go offline or when the master detects corrupted replicas through checksum verification.
A chunk is lost irreversibly only if all its replicas are lost before GFS can react. Even in this case, the data becomes unavailable, not corrupted, which means applications receive clear errors rather than corrupt data.

Data Integrity through checksum

Checksumming is used by each ChunkServer to detect the corruption of stored data.
The chunk is broken down into 64 KB blocks.
Each 64 KB block has a corresponding 32-bit checksum.
Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.
For Reads: the ChunkServer verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another ChunkServer. ChunkServers will not propagate corruptions to other machines.
For Writes:
For Appends:
During idle periods, ChunkServers can scan and verify the contents of inactive chunks (prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk).
Checksumming has little effect on read performance for the following reasons:

Garbage Collection

How does GFS implement Garbage Collection?

Agenda

Garbage collection through lazy deletion
Advantages of lazy deletion
Disadvantages of lazy deletion

Garbage collection through lazy deletion

When a file is deleted, GFS does not immediately reclaim the physical space used by that file. Instead, it follows a lazy garbage collection strategy.
When the client issues a delete file operation, GFS does two things:
The file can still be read under the new, special name and can also be undeleted by renaming it back to normal.
To reclaim the physical storage, the master, while performing regular scans of the file system, removes any such hidden files if they have existed for more than three days (this interval is configurable) and also deletes its in-memory metadata.
This lazy deletion scheme provides a window of opportunity to a user who deleted a file by mistake to recover the file.
The master, while performing regular scans of the chunk namespace, deletes the metadata of all chunks that are not part of any file.
Also, during the exchange of regular HeartBeat messages with the master, each ChunkServer reports a subset of the chunks it has, and the master replies with a list of chunks from that subset that are no longer present in the master’s database; such chunks are then deleted from the ChunkServer.

Advantages of lazy deletion

Simple and reliable: If the chunk deletion message is lost, the master does not have to retry. The ChunkServer can perform the garbage collection with the subsequent heartbeat messages.
GFS merges storage reclamation into regular background activities of the master, such as the regular scans of the filesystem or the exchange of HeartBeat messages. Thus, it is done in batches, and the cost is amortized.
Garbage collection takes place when the master is relatively free.
Lazy deletion provides safety against accidental, irreversible deletions.

Disadvantages of lazy deletion

As we know, after deletion, storage space does not become available immediately. Applications that frequently create and delete files may not be able to reuse the storage right away. To overcome this, GFS provides following options:

Criticism on GFS

Problems associated with single master

Google has started to see the following problems with the centralized master scheme:

Problems associated with large chunk size

Large chunk size (64MB) in GFS has its disadvantages while reading. Since a small file will have one or a few chunks, the ChunkServers storing those chunks can become hotspots if a lot of clients are accessing the same file.
As a workaround for this problem, GFS stores extra copies of small files for distributing the load to multiple ChunkServers. Furthermore, GFS adds a random delay in the start times of the applications accessing such files.

Summary

Scalable distributed file storage system for large data-intensive applications.
Uses commodity hardware to reduce infrastructure costs.
Was designed with Fault Tolerance in mind(Software/hardware faults).
Reading workload is large streaming reads and small random reads.
Writing workload is many large sequential writes that appends data to files.
Provides APIs for file operations like create, delete, open, close, read, write, snapshot and record append operations. Record append allows multiple clients to concurrently append data to the same file while guaranteeing atomicity.
GFS cluster is single master, multiple chunk servers & access by Multiple clients.
Files are broken into 64 MB chunks, identified by Immutable and Globally unique 64-bit Chunk Handle(assigned by master during chunk creation).
Chunk servers store chunks on local disks as Linux files. For Reliability, each chunk is replicated to multiple chunk servers.
Master is Coordinator for GFS cluster. Responsible for keeping track of all the filesystem metadata. Namespace, authorization, files-chunk mapping, chunk location.
Master keeps all metadata in memory for faster operation. For Fault tolerance, and to handle master crash, all metadata changes are written onto disk into Operation Log which is replicated to other machines.
Master doesn’t have a persistent record(only in-memory) of which chunk servers have replicas for a given chunk. Master asks each chunk server what chunks it holds at master startup, or whenever the chunk server joins the cluster.
For Quick recovery(Master failure), master’s state is periodically serialized to disk(Checkpointed) along with Operation log and is replicated. On recovery, master loads the checkpoint, and replays subsequent operations from Operation Log.
Master communicates with each chunk server via HeartBeat to collect state.
Applications use GFS Client code, which implements filesystem API, and communicates with the cluster. Clients interact with master for metadata(Control Flow), but all data transfer happens directly(Data Flow) between client and Chunk servers.
Data Integrity: Each Chunk server uses Checksumming to detect corruption of stored data.
Garbage Collection: Lazy Deletion.
Consistency:Master guarantees data consistency by ensuring order of mutations on all replicas and using chunk version numbers. If a replica has an incorrect version, it is garbage collected.
GFS guarantees at-least-once writes. It is the responsibility of readers to deal with duplicate chunks. This is achieved by having Checksums and serial numbers in the chunks, which help readers to filter and discard duplicate data.
Cache: Neither the client or chunk servers cache data. However, Clients do cache metadata.

System Design Patterns

Write-Ahead-Log - Operation Log
HeartBeat - B/w Master and Chunk servers.
CheckSum - Data Integrity
Copy-On-Write Snapshotting.
Lazy Garbage collection.

References

GFS Paper
BigTable Paper
GFS Evolution on Fast-Forward
Jordan Video would give a quick summary of the above.

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Hadoop Distributed File System

Sun, 08 Dec 2024 00:00:00 +0000

Paper: Hadoop Distributed File System

Hadoop Distributed File System

Goal

Design a distributed system that can store huge files (terabyte and larger). The system should be scalable, reliable, and highly available.

What is Hadoop Distributed File System

HDFS is a distributed file system and was built to store unstructured data. It is designed to store huge files reliably and stream those files at high bandwidth to user applications.
HDFS is a variant and a simplified version of the Google File System (GFS). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern.

Background

Apache Hadoop is a software framework that provides a distributed file storage system(HDFS) and distributed computing for analyzing and transforming very large data sets using the MapReduce programming model.
HDFS is the default file storage system in Hadoop. It is designed to be a distributed, scalable, fault-tolerant file system that primarily caters to the needs of the MapReduce paradigm.
Both HDFS and GFS were built to store very large files and scale to store petabytes of storage.
Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end- users.
Like GFS, HDFS is also not POSIX-compliant and is not a mountable file system on its own. It is typically accessed via HDFS clients or by using application programming interface (API) calls from the Hadoop libraries.
Given HDFS design, following applications are not a good fit for HDFS,

API

Provides user-level APIs(and not standard POSIX-like APIs).
Files are organized hierarchically in directories and identified by their pathnames.
Supports the usual file system operations on files and directories. Create, Delete, Rename, Move, and Symbolic Links(unlike GFS) etc.
All read and write operations are done in an append-only fashion.

High Level Architecture

HDFS Architecture

Files are broken into 128 MB fixed-size blocks (configurable on a per-file basis).
File has two parts: the actual file data and the metadata.
Metadata
HDFS cluster primarily consists of a NameNode(Master:GFS) that manages the file system metadata and DataNodes(Chunk Server:GFS) that store the actual data.
All blocks of a file are of the same size except the last one.
HDFS uses large block sizes because it is designed to store extremely large files to enable MapReduce jobs to process them efficiently.
Each block is identified by a unique 64-bit ID called BlockID(Similar to Chunk in GFS). All read/write operations in HDFS operate at the block level.
DataNodes store each block in a separate file on the local file system and provide read/write access.
When a DataNode starts up, it scans through its local file system and sends the list of hosted data blocks (called BlockReport) to the NameNode.(Similar to how Master gets state information in GFS from Chunk Servers).
The NameNode maintains two on-disk data structures to store the file system’s state: an FsImage file(Oplog Checkpoint:GFS) and an EditLog(Operation Log: GFS).
FsImage is a checkpoint of the file system metadata at some point in time, while the EditLog is a log of all of the file system metadata transactions since the image file was last created. These two files help NameNode to recover from failure.
User applications interact with HDFS through its client. HDFS Client interacts with NameNode for metadata, but all data transfers happen directly between the client and DataNodes.
To achieve high-availability, HDFS creates multiple copies of the data and distributes them on nodes throughout the cluster.

Comparison b/w GFS and HDFS

Deep Dive

Cluster Topology

Hadoop clusters typically have about 30 to 40 servers per rack.
Each rack has a dedicated gigabit switch that connects all of its servers and an uplink to a core switch or router, whose bandwidth is shared by many racks in the data center.
When HDFS is deployed on a cluster, each of its servers is configured and mapped to a particular rack. The network distance between servers is measured in hops, where one hop corresponds to one link in the topology.
Hadoop assumes a tree-style topology, and the distance between two servers is the sum of their distances to their closest common ancestor.

Rack aware replication

HDFS employs a rack-aware replica placement policy to improve data reliability, availability, and network bandwidth utilization.
The idea behind HDFS’s replica placement is to be able to tolerate node and rack failures.
If the replication factor is three, HDFS attempts to place the
This rack-aware replication scheme slows the write operation as the data needs to be replicated onto different racks, tradeoff between reliability and performance.

Synchronization Semantics

Early versions of HDFS followed strict immutable semantics. Once a file was written, it could never again be re-opened for writes; files could still be deleted.
Current versions of HDFS support append.
This design choice in HDFS was because most MapReduce workloads follow the write once and read many data-access patterns.
MapReduce is a restricted computational model with predefined stages. The reducers in MapReduce write independent files to HDFS as output. HDFS focuses on fast read access for multiple clients at a time.

HDFS Consistency Model

HDFS follows a strong consistency model.
To ensure strong consistency, a write is declared successful only when all replicas have been written successfully.
HDFS does not allow multiple concurrent writers to write to an HDFS file, so implementing strong consistency becomes relatively easy.

Anatomy of a Read Operation

HDFS Read Process

(1) When a file is opened for reading, the HDFS client initiates a read request, by calling the open() method of the Distributed FileSystem object. The client specifies the file name, start offset, and the read range length**.**
(2) The Distributed FileSystem object calculates what blocks need to be read based on the given offset and range length, and requests the locations of the blocks from the NameNode.
(3) NameNode has metadata for all blocks’ locations. It provides the client a list of blocks and the locations of each block replica. As the blocks are replicated, NameNode finds the closest replica to the client when providing a particular block’s location. The closest locality of each block is determined as follows:
(4) After getting the block locations, the client calls the read() method of FSData InputStream,which takes care of all the interactions with the DataNodes.
(5) Once the client invokes the read() method, the input stream object establishes a connection with the closest DataNode with the first block of the file.
(5b) The data is read in the form of streams and passed to the requesting application. Hence, the block does not have to be transferred in its entirety before the client application starts processing it.
(6) Once the FSData* *InputStream receives all data of a block, it closes the connection and moves on to connect the DataNode for the next block. It repeats this process until it finishes reading all the required blocks of the file.
(7) Once the client finishes reading all the required blocks, it calls the close() method of the input stream object.

Short Circuit Read

If the data and the client are on the same machine, HDFS can directly read the file bypassing the DataNode. This scheme is called short circuit read and is quite efficient as it reduces overhead and other processing resources.

Anatomy of a Write Process

HDFS client initiates a write request by calling the create() method of the Distributed FileSystem object.
Distributed FileSystem object sends a file creation request to the NameNode.
NameNode verifies that the file does not already exist and that the client has permission to create the file. If both these conditions are verified, the NameNode creates a new file record and sends an acknowledgment.
Client proceeds to write the file using FSData OutputStream.
FSData OutputStream writes data to a local queue called ‘Data Queue.’ The data is kept in the queue until a complete block of data is accumulated.
Once the queue has a complete block, another component called DataStreamer is notified to manage data transfer to the DataNode.
DataStreamer first asks the NameNode to allocate a new block on DataNodes, thereby picking desirable DataNodes to be used for replication.
The NameNode provides a list of blocks and the locations of each block replica.
Upon receiving the block locations from the NameNode, the DataStreamer starts transferring the blocks from the internal queue to the nearest DataNode.
Each block is written to the first DataNode, which then pipelines the block to other DataNodes in order to write replicas of the block.
Once the DataStreamer finishes writing all blocks, it waits for acknowledgments from all the DataNodes.
Once all acknowledgments are received, the client calls the close() method of the OutputStream.
Finally, the Distributed FileSystem contacts the NameNode to notify that the file write operation is complete. At this point, the NameNode commits the file creation operation, which makes the file available to be read.

Data Integrity(Block Scanner) & Caching

Data Integrity

Data Integrity refers to ensuring the correctness of the data.
When a client retrieves a block from a DataNode, the data may arrive corrupted. This corruption can occur because of faults in the storage device, network, or the software itself.
HDFS client uses checksum to verify the file contents.
When a client stores a file in HDFS, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another replica.

Block Scanner

A block scanner process periodically runs on each DataNode to scan blocks stored on that DataNode and verify that the stored checksums match the block data.
Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica.
Whenever a client or a block scanner detects a corrupt block, it notifies the NameNode.
The NameNode marks the replica as corrupt and initiates the process to create a new good replica of the block.

Caching

Normally, blocks are read from the disk, but for frequently accessed files, blocks may be explicitly cached in the DataNode’s memory, in an off-heap block cache.
HDFS offers a Centralized Cache Management scheme to allow its clients to specify to the NameNode file paths which need to be cached.
NameNode communicates with the DataNodes that have the desired blocks on disk and instructs them to cache the blocks in off-heap caches.
Advantages of Centralized Cache management in HDFS:

Fault Tolerance

Agenda

How does HDFS handle DataNode failures?
What happens when the NameNode fails?

How does HDFS handle DataNode failures?

Replication

As the blocks are replicated to multiple(Default 3) datanodes’ replicas, if one DataNode becomes inaccessible, its data can be read from other replicas.

HeartBeat

The NameNode keeps track of DataNodes through a heartbeat mechanism. Each DataNode sends periodic heartbeat messages (every few seconds) to the NameNode.
If a DataNode dies, the heartbeats will stop, and the NameNode will detect that the DataNode has died. The NameNode will then mark the DataNode as dead and will no longer forward any read/write request to that DataNode.
Because of replication, the blocks stored on that DataNode have additional replicas on other DataNodes.
The NameNode performs regular status checks on the file system to discover under-replicated blocks and performs a cluster rebalance process to replicate blocks that have less than the desired number of replicas.

What happens when the NameNode fails?

FsImage and EditLog

NameNode is a single point of failure (SPOF). Will bring the entire file system down.
Internally, the NameNode maintains two on-disk data structures that store the file system’s state: an FsImage file and an EditLog. FsImage is a checkpoint (or the image) of the file system metadata at some point in time, while the EditLog is a log of all of the file system metadata transactions since the image file was last created.
All incoming changes to the file system metadata are written to the EditLog.
At periodic intervals, the EditLog and FsImage files are merged to create a new image file snapshot, and the edit log is cleared out.

Metadata backup

On a NameNode failure, the metadata would be unavailable, and a disk failure on the NameNode would be catastrophic because the file metadata would be permanently lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
Thus, it is crucial to make the NameNode resilient to failure, and HDFS provides two mechanisms for this:

HDFS High Availability

Agenda

HDFS high availability architecture
Failover and fencing

HDFS high availability architecture

Problem

Although NameNode’s metadata is copied to multiple file systems to protect against data loss, it still does not provide high availability of the filesystem.
If the NameNode fails, no clients will be able to read, write, or list files, because the NameNode is the sole repository of the metadata and the file-to-block mapping.
In such an event, the whole Hadoop system would effectively be out of service until a new NameNode is brought online.
To recover from a failed NameNode scenario, an administrator will start a new primary NameNode with one of the filesystem metadata replicas and configure DataNodes and clients to use this new NameNode.
The new NameNode is not able to serve requests until it has
On large clusters with many files and blocks, it can take half an hour or more to perform a cold start of a NameNode.
Furthermore, this long recovery time is a problem for routine maintenance.

Solution

Hadoop 2.0 added support for High Availability(HA).
There are two (or more) NameNodes in an active-standby configuration.
The active NameNode is responsible for all client operations in the cluster,
Standby is simply acting as a follower of the active, maintaining enough state to provide a fast failover when required.
For the Standby nodes to keep their state synchronized with the active node, HDFS made a few architectural changes:

Quorum Journal Manager(QJM)

Provide a highly available EditLog.
QJM runs as a group(usually 3 where 1 can fail) of journal nodes, and each edit must be written to a quorum (or majority) of the journal nodes.
Similar to the way Zookeeper works except QJM doesn’t use ZooKeeper.
HDFS High Availability does use ZooKeeper for electing the active NameNode (Master Election).
The QJM process runs on all NameNodes and communicates all EditLog changes to journal nodes using RPC.
Since the Standby NameNodes have the latest state of the metadata available in memory (both the latest EditLog and an up-to-date block mapping), any standby can take over very quickly (in a few seconds) if the active NameNode fails.
However, the actual failover time will be longer in practice (around a minute) because the system needs to be conservative in deciding that the active NameNode has failed(Failure Detection).
In the unlikely event of the Standbys being down when the active fails, the administrator can still do a cold start of a Standby. This is no worse than the non-HA case.

Zookeeper

The ZKFailoverController (ZKFC) is a ZooKeeper client that runs on each NameNode and is responsible for coordinating with the Zookeeper and also monitoring and managing the state of the NameNode.

Failover and fencing

A Failover Controller manages the transition from the active NameNode to the Standby. The default implementation of the failover controller uses ZooKeeper to ensure that only one NameNode is active(Single Leader). Failover Controller runs as a lightweight process on each NameNode and monitors the NameNode for failures (Failure Detection using Heartbeat), and triggers a failover when the active NameNode fails(New Leader Election).
Graceful failover: For routine maintenance, an administrator can manually initiate a failover. This is known as a graceful failover, since the failover controller arranges an orderly transition from the active NameNode to the Standby.
Ungraceful failover: In the case of an ungraceful failover, however, it is impossible to be sure that the failed NameNode has stopped running. For example, a slow network or a network partition can trigger a failover transition, even though the previously active NameNode is still running and thinks it is still the active NameNode.
The HA implementation uses the mechanism of Fencing to prevent this “split-brain” scenario and ensure that the previously active NameNode is prevented from doing any damage and causing corruption.

Fencing

Fencing is the idea of putting a fence around a previously active NameNode(Old Leader) so that it cannot access cluster resources and hence stop serving any read/write request. Two Fencing techniques:

HDFS Characteristics

Explore some important aspects of HDFS architecture.

Agenda

Security and permission
HDFS federation
Erasure coding
HDFS in practice

Security and permission

Permission Model for files and director similar to POSIX.
Each file and directory is associated with an owner and a group and has separate permission for Owner, vs Group members, vs Others, similar to POSIX.
Similar 3 types of permissions R/W/X like POSIX:
Optional support for POSIX ACLs to augment file permissions with finer-grained rules for named specific users or groups.

HDFS federation(NameNode Partitioning)

Namenode keeps whole metadata in memory. Memory becomes a performance bottleneck for extremely large clusters and to server all metadata requests from a single node.
To solve this problem, HDFS Federation was Introduced in HDFS 2.x.
Allows a cluster to scale by adding NameNodes, each of which manages a portion of the filesystem namespace. /user & /root managed by NN1 and NN2.
Under federation:
Multiple NN can generate the same 64-bit BlockID for their blocks.
To avoid this problem, a namespace uses one or more Block Pools, where a unique ID identifies each block pool in a cluster.
A block pool belongs to a single namespace and does not cross the namespace boundary.
The extended block ID, which is a tuple of (Block Pool ID, Block ID), is used for block identification in HDFS Federation.

Erasure coding

By default, HDFS stores three copies of each block, resulting in a 200% overhead (to store two extra copies) in storage space and other resources (e.g., network bandwidth).
Erasure Coding (EC) provides the same level of fault tolerance with much less storage space. In a typical EC setup, the storage overhead is no more than 50%.
This fundamentally doubles the storage space capacity by bringing down the replication factor from 3x to 1.5x.
Under EC, data is broken down into fragments, expanded, encoded with redundant data pieces, and stored across different DataNodes.
If, at some point, data is lost on a DataNode due to corruption, etc., then it can be reconstructed using the other fragments stored on other DataNodes.
Although EC is more CPU intensive, it greatly reduces the storage needed for reliably storing a large data set.
References:

HDFS in practice

Was primarily designed to support Hadoop MapReduce jobs by providing Distributed File System for Map and Reduce Operations.
HDFS is now used with Many Big-Data Tools, e.g. in Several Apache Projects built on top of Hadoop, incl, Pig, Hive, HBase, Giraph etc.. Also GraphLab.
Advantages of HDFS?
Disadvantages of HDFS?

Summary

Scalable distributed file system for large distributed data intensive applications.
Uses commodity hardware to reduce infrastructure costs.
POSIX-like(but not compatible) APIs for file operations.
Random writes are not possible. Append-Only.
Doesn’t support multiple concurrent writers to append to the same chunk like GFS.
Single NameNode and Multiple DataNodes in initial architecture.
Files are broken into 128 MB Blocks identified by 64-bit Globally unique block ID.
Blocks are replicated to multiple machines(default 3,Configurable) to provide redundancy. 200% overhead on replication. Can be reduced to 50% by using Erasure Coding.
DataNodes stores blocks on local disks as Linux Files.
NameNode is coordinator for HDFS Cluster. Keeps track of all filesystem metadata.
NameNode keeps all metadata in memory(for faster access). For Fault Tolerance(Node Crash), in-memory metadata changes are written to a Write Ahead Log(EditLog). For Disk Crash Tolerance, Edit Log can be replicated to a Remote File System(NFS) or QJM(Quorum Journal Manager) V2, or secondary NameNode(V1).
NameNode doesn’t keep records of block replica locations on DataNodes. NameNode Heartbeats and Collects states from DataNodes and asks on which block replicas it holds at NameNode Startup or when DataNode joins the cluster.
FsImage: NameNode checkpoints the EditLog into FsImage and serialized to disk and replicated to other nodes, so in case of fail-over or NameNode start, it can quickly use the Checkpoint and subsequent EditLog to build the state again.
User applications Interact with HDFS using HDFS client, which interacts with NameNode for metadata, and directly talks to DataNode for read/write operations.
DataNode and Clients use Checksums to validate data integrity of Blocks. Informs the NameNode to repair the replica if corrupted.
Lazy Collection: Deleted file is renamed to hiddle name to be GC’ed later.
HDFS is a strongly Consistent FS. Write is declared successful only if it is replicated to all the replicas.
Cache: For Frequently accessed files, user specified file paths/blocks to the NameNode server, can be explicitly cached in DataNode’s memory in an Off-heap block cache.(GFS just uses Linux’s Buffer Cache).

System Design Patterns

Write Ahead Log - Fault Tolerance/Reliability.
HeartBeat
Split Brain
CheckSum - Data Integrity.

Reference

HDFS Paper
HDFS High Availability
HDFS Architecture
Distributed File Systems:A Survey

Paper Link: https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Chubby

Sat, 07 Dec 2024 00:00:00 +0000

Paper: Chubby

Chubby / Distributed Locking Service

Goal

Design a highly available and consistent service that can store small objects and provide a locking mechanism on those objects.

What is Chubby?

Chubby is a service that provides a distributed locking mechanism and also stores small files.
Internally, it is implemented as a key/value store that also provides a locking mechanism on each object stored in it.
Extensively used in various systems inside Google to provide storage and coordination services for systems like GFS and BigTable.
Apache ZooKeeper is the open-source alternative to Chubby.
Chubby is a centralized service offering developer-friendly interfaces (to acquire/release locks and create/read/delete small files).
It does all this with just a few extra lines of code to any existing application without a lot of modification to application logic.
At a high level, Chubby provides a framework for distributed consensus.

Chubby Use Cases

Primarily Chubby was developed to provide a reliable locking service. Other use cases evolved like:

Leader Election

Any lock service can be seen as a consensus service, as it converts the problem of reaching consensus to handing out locks.
A set of distributed applications compete to acquire a lock, and whoever gets the lock first gets the resource.
Similarly, an application can have multiple replicas running and wants one of them to be chosen as the leader. Chubby can be used for leader election among a set of replicas.

Naming Service(Like DNS)

It is hard to make faster updates to DNS due to its time-based caching nature, which means there is generally a potential delay before the latest DNS mapping is effective.

Storage(Small Objects that rarely change)

Chubby provides a Unix-style interface to reliably store small files that do not change frequently (complementing the service offered by GFS).
Applications can then use these files for any usage like DNS, configs, etc.

Distributed Locking Mechanism

Chubby provides a developer-friendly interface for coarse-grained distributed locks (as opposed to fine-grained locks) to synchronize distributed activities in a distributed environment.
Application needs a few lines, and chubby can take care of all lock management so that devs can focus on business logic, and not solve distributed Locking problems in a Distributed system’s setting.
We can say that Chubby provides mechanisms like semaphores and mutexes for a distributed environment.

When Not to Use Chubby?

Bulk Storage is needed
Data update rate is high.
Locks are acquired/released frequently.
Usage is more like a publish/subscribe model.

Background

Chubby is neither really a research effort nor does it claim to introduce any new algorithms.
Rather, Chubby describes a certain design and implementation done at Google in order to provide a way for its clients to synchronize their activities and agree(Consensus) on basic information about their environment

Chubby and Paxos

Chubby uses Paxos underneath to manage the state of the Chubby system at any point in time.
Getting all nodes in a distributed system to agree on anything (e.g., election of primary among peers) is basically a kind of distributed consensus problem.
Distributed consensus using Asynchronous Communication is already solved by Paxos protocol.

Chubby Common Terms

Chubby Cell

Chubby cell is a Chubby Cluster. Most Chubby Cells are single Data Center(DC) but there can be some configuration where Chubby replicas exist Cross DC as well.
Chubby cell has two main components, server and client, that communicate via remote procedure call (RPC).

Chubby Servers

A Chubby Cell consists of a small set of servers(typically 5) known as Replicas.
Using Paxos, one of the servers is selected as Master which handles all client requests. Fails over to another replica if the master fails.
Each replica maintains a small database to store files/directories/locks.
The master writes directly to its own local database, which gets synced asynchronously to all the replicas(Reliability).
For Fault Tolerance, replicas are placed on different racks.

Chubby Client Library

Client applications use a Chubby library to communicate with the replicas in the chubby cell using RPC.

Chubby API

Chubby exports a unix-like file system interface similar to POSIX but simpler.
It consists of a strict tree of files and directories with name components separated by slashes. E.g. File format: /ls/chubby_cell/directory_name/…/file_name
A special name, /ls/local, will be resolved to the most local cell relative to the calling application or service. What is the most local Cell?
Chubby can be used for locking or storing a small amount of data or both, i.e., storing small files with locks.
API Categories

General

Open() : Opens a given named file or directory and returns a handle.
Close() : Closes an open handle.
Poison() : Allows a client to cancel all Chubby calls made by other threads without fear of deallocating the memory being accessed by them.
Delete() : Deletes the file or directory.

File

GetContentsAndStat() : Returns (atomically) the whole file contents and metadata associated with the file. This approach of reading the whole file is designed to discourage the creation of large files, as it is not the intended use of Chubby.
GetStat() : Returns just the metadata.
ReadDir() : Returns the contents of a directory – that is, names and metadata of all children.
SetContents() : Writes the whole contents of a file (atomically).
SetACL() : Writes new access control list information.

Locking

Acquire() : Acquires a lock on a file.
TryAcquire() : Tries to acquire a lock on a file; it is a non-blocking variant of Acquire.
Release() : Releases a lock.

Sequencer

GetSequencer() : Get the sequencer of a lock. A sequencer is a string representation of a lock.
SetSequencer() : Associate a sequencer with a handle.
CheckSequencer() : Check whether a sequencer is valid. Chubby does not support operations like append, seek, move files between directories, or making symbolic or hard links.

Files can only be completely read or completely written/overwritten. This makes it practical only for storing very small files.

Design Rationale

Agenda

Why was chubby built as a service?
Why coarse-grained locks?
Why advisory locks?
Why does Chubby need storage?
Why does Chubby exports like a unix-like file system interface?
High Availability and reliability

Why was chubby built as a service rather than a distributed client library doing Paxos?

Reasons behind building a distributed service instead of having a client library that only provides Paxos distributed consensus? A lock service has clear advantages over a client library:

Why coarse-grained locks?

Chubby locks usage is not expected to be fine-grained in which they might be held for only a short period (i.e., seconds or less). For example, electing a leader is not a frequent event. Reasons why only coarse grained locks ar supported:

Less load on the lock server
Survive Lock server failures
Fewer lock servers are needed:
Implement a fine-grained locking system on top of this coarse grained locking system Chubby.

Why advisory locks?

Chubby locks are advisory, which means it is up to the application to honor the lock. Chubby doesn’t make locked objects inaccessible to clients not holding their locks.
Chubby gave following reasons for not having mandatory locks:

Why does Chubby need storage?

To provide a Consistent view of the system to various distributed entities in some use cases like:

Why does Chubby exports like a unix-like file system interface?

It significantly reduces the effort needed to write basic browsing and namespace manipulation tools, and reduces the need to educate casual Chubby users.

High Availability and reliability

Chubby compromises on performance in favor of availability and consistency. What?

How Chubby Works?

Agenda

Service Initialization
Client Initialization
Leader Election example using Chubby

Service Initialization

A master is chosen among chubby replicas using Paxos.
Current master information is persisted in storage and all replicas become aware of the master.

Client Initialization

Client contacts DNS to know listed Chubby replicas.
Client calls Chubby Server directly via Remote Procedure Call(RPC)
If that replica is not the master, it will return the address of the current master.
Once the master is located, the client maintains a session with it and sends all requests to it until it indicates that it is not the master anymore or stops responding.

Leader Election example using Chubby

Example of application that uses Chubby to elect a single master from a bunch of instances of the same application.

Sample Pseudocode for leader election from client application.

Files, Directories and Handles

Agenda

Nodes
Metadata
Handles Chubby file system interface is a tree of files and directories(which can have sub-directories but not files), each of which is called a node.

Nodes

Any node can act as an advisory reader/writer lock.
Nodes can be ephemeral or permanent.
Ephemeral files are used as temporary files and act as an indicator to others that a client is alive.
Ephemeral files are also deleted if no client has them open.
Ephemeral directories are also deleted if they are empty.
Any node can be explicitly deleted.

Metadata

Metadata for each node includes ACL(Access control list), 4 monotonically increasing 64-bit numbers, and a checksum.
ACL
Monotonically increasing 64-bit numbers: These numbers allow clients to detect changes easily.
Checksum : Chubby exposes a 64-bit file-content checksum so clients may tell whether files differ.

Handles

Clients open nodes to obtain handles(similar to Unix File Descriptors). Handles include:

Locks Sequencers and Lock-Delays

Agenda

Locks
Sequencer
Lock-Delay

Locks

Each chubby node can act as a reader-writer lock in the following two ways:

Sequencer

With distributed systems, receiving messages out of order is a problem.
Chubby uses sequence numbers to solve this problem.
So below what we are basically trying to do is, trying to do distributed consensus(total order broadcast) on a bunch of application servers(using Leader election on those servers) by using a Distributed Lock Service(Chubby) which uses Paxos to help provide distributed consensus within application servers.
After acquiring a lock on a file, a client can immediately request a Sequencer, which is an opaque byte string describing the state of the lock.
An application’s master server can generate a sequencer and send it with any internal order to other application servers.
Application servers that receive orders from a primary can check with Chubby if the sequencer is still good and does not belong to a stale primary (to handle the ‘Brain split’ scenario).

Lock-Delay

For file servers(or external services) that do not support sequencers**(or Fencing Tokens** to protect against delayed packets belonging to an older lock**)**, Chubby provides a lock-delay period to protect against message delays and server restarts.
If a client releases a lock in the normal way, it is immediately available for other clients to claim, as one would expect.
However, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a period called the lock- delay.
While imperfect, the lock-delay protects unmodified servers and clients from everyday problems caused by message delays and restarts.

Session and Events

Agenda

What is a Chubby Session?
Session Protocol
What is Keep Alive
Session Optimization
Failovers

What is a Chubby Session?

A relationship b/w Chubby Cell and a Client.
It exists for some interval of time and is maintained by periodic handshakes called keepalives.
Clients’ handles, locks, and cached data only remain valid provided its session remains valid.

Session Protocol

Client requests a new session from Chubby cells’s master.
Session ends if the client explicitly ends it or it has been idle.
Each session has an associated lease, which is the time interval during which the master guarantees not to terminate the session unilaterally. End of this interval is called Session Lease Timeout.
Master advances session lease timeout in 3 circumstances:

What is Keep Alive

Keepalive is a way for a client to maintain a constant session with Chubby Cell.
Steps:
Google experimentation showed that 93% of RPC requests are KeepAlives.
How can we reduce the keepalives?

Session Optimization

Piggybacking events(using a different event to transmit some additional detail)
Local Lease
Jeopardy
Grace Period
Original(Initial chubby session):
Optimization Attempt 1:

Failovers

Failover happens when the master fails or otherwise loses membership. Chubby typically takes b/w 5-30 seconds for fail-over.
Summary of things that happen in a master failover.
Client has lease M1 (& local lease C1) with master and pending KeepAlive request.
Master starts lease M2 and replies to the KeepAlive request.
Client extends the local lease to C2 and makes a new KeepAlive call. Master dies before replying to the next KeepAlive. So, no new leases can be assigned. Client’s C2 lease expires, and the client library flushes its cache and informs the application that it has entered jeopardy. The grace period starts on the client.
Eventually, a new master is elected and initially uses a conservative approximation M3 of the session lease that its predecessor may have had for the client. Client sends KeepAlive to new master (4).
The first KeepAlive request from the client to the new master is rejected (5) because it has the wrong master epoch number (described in the next section).
Client retries with another KeepAlive request.
Re-tried KeepAlive succeeds. Client extends its lease to C3 and optionally informs the application that its session is no longer in jeopardy (session is in the safe mode now).
Client makes a new KeepAlive call, and the normal protocol works from this point onwards.
Because the grace period was long enough to cover the interval between the end of lease C2 and the beginning of lease C3, the client saw nothing but a delay. If the grace period was less than that interval, the client would have abandoned the session and reported the failure to the application.

Master Election and Chubby Events?

Initializing a newly elected Master

A newly elected master proceeds as follows:
Picks a new Epoch Number: To differentiate itself from the previous master. Clients are required to present an epoch number on every call. Master rejects calls from clients using older epoch numbers. This ensures that the new master will not respond to a very old packet that was sent to the previous master.
Responds to master-location requests: but doesn’t respond to session related operations yet.
Build in-memory data structures:
Let clients perform keep-alives:
Emits a fail-over event to each session:
Wait: Master waits until each session acknowledges the fail-over event or lets its session expire.
Allow all operations to proceed.
Honor older handles by clients:
Delete Ephemeral files:

Chubby Events

Chubby supports a simple event mechanism to let its clients subscribe to events.
Events are delivered asynchronously via callbacks from the chubby library.
Clients subscribe to a range of events while creating a handle.
Example of events from Server to Chubby Client:
Additionally Chubby client sends the following session events to the application:

Caching

Chubby Cache

Caching is important since it is used for read heavy purposes rather than write heavy.
Chubby clients cache file contents, node metadata, and information on open handles in a consistent, write-through cache in clients’ memory.
Chubby must maintain consistency b/w file, its replicas, and cache as well.
Clients maintain their cache by a lease mechanism, and flush the cache when the lease expires.

Cache Invalidation

Protocol for cache invalidation when file data or metadata is changed:

Question: While the master is waiting for acknowledgments, are other clients allowed to read the file?

Answer: During the time the master is waiting for the acknowledgments from clients, the file is treated as ‘uncachable.’ This means that the clients can still read the file but will not cache it. This approach ensures that reads always get processed without any delay. This is useful because reads outnumber writes. Question: Are clients allowed to cache locks? If yes, how is it used?
Answer: Chubby allows its clients to cache locks, which means the client can hold locks longer than necessary, hoping that they can be used again by the same client. Question: Are clients allowed to cache open handles?
Answer: Chubby allows its clients to cache open handles. This way, if a client tries to open a file it has opened previously, only the first open() call goes to the master.

Database

Agenda

Backup
Mirroring How chubby uses a database for storage.
Initially, Chubby used a replicated version of Berkeley DB to store its data. Later, the Chubby team felt that using Berkeley DB exposes Chubby to more risks, so they decided to write a simplified custom database with the following characteristics:

Backup

For recovery in case of failure, all database transactions are stored in a transaction log (a write-ahead log).
As this transaction log can become very large over time, every few hours, the master of each Chubby cell writes a snapshot of its database to a GFS server in a different building.
The use of a separate building ensures both that the backup will survive building damage, and that the backups introduce no cyclic dependencies in the system;
Once a snapshot is taken, the previous transaction log is deleted. Therefore, at any time, the complete state of the system is determined by the last snapshot together with the set of transactions from the transaction log.
Backup databases are used for disaster recovery and to initialize the database of a newly replaced replica without placing a load on other replicas.

Mirroring

Mirroring is a technique that allows a system to automatically maintain multiple copies. Chubby allows a collection of files to be mirrored from one cell to another.
Mirroring is fast because the files are small.
A special “global” cell subtree /ls/global/master that is mirrored to the subtree /ls/cell/replica in every other Chubby cell.
Various files in which Chubby cells and other systems advertise their presence to monitoring services.
Pointers to allow clients to locate large data sets such as Bigtable cells, and many configuration files for other systems.

Scaling Chubby

Agenda

Proxies
Partitioning
Learning Chubby’s clients are individual processes, so Chubby handles more clients than expected. At Google, 90,000+ clients communicate with a single Chubby server.

Techniques used to reduce communication with the master(since read heavy):

Minimize request rate by creating more chubby cells so that clients almost always use a nearby cell(found with DNS) to avoid reliance on remote machines.
Minimize KeepAlives Load: KeepAlives are by far the dominant types of request.
Caching: Clients cache file data, metadata, handles, locks etc.
Simplified protocol conversions:

Proxies

A proxy is an additional server that can act on behalf of the actual server.
A Chubby proxy can handle KeepAlives and read requests.
All writes and first-time reads pass through the cache to reach the master
Proxy responsible for invalidating client’s cache as well.

Partitioning

Need to support 100K clients. How would chubby do that?
Chubby’s interface (files & directories) was designed such that namespaces can easily be partitioned between multiple Chubby cells if needed.
Chubby can partition nodes within a large directory(with lots of sub-directories).
Scenarios in which partitioning does not help scale:

Learning

Lack of aggressive caching: Initially, clients were not caching the absence of files or open file handles. An abusive client could write loops that retry indefinitely when a file is not present or poll a file by opening it and closing it repeatedly when one might expect they would open the file just once. Chubby educated its users to make use of aggressive caching for such scenarios.
Lack of quotas: Chubby was never intended to be used as a storage system for large amounts of data, so it has no storage quotas. In hindsight, this was naive. To handle this, Chubby later introduced a limit on file size (256kBytes).
Publish/subscribe: There have been several attempts to use Chubby’s event mechanism as a publish/subscribe system. Chubby is a strongly consistent system, and the way it maintains a consistent cache makes it a slow and inefficient choice for publish/subscribe. Chubby developers caught and stopped such uses early on.
Developers rarely consider availability: Developers generally fail to think about failure probabilities and wrongly assume that Chubby will always be available. Chubby educated its clients to plan for short Chubby outages so that it has little or no effect on their applications.

Chubby as a Name Service?

Authors were surprised to find that Chubby was most popular for DNS.
Hard to pick a good value for TTL, since DNS uses TTL and may serve stale values for some time(up-to 60 secs).
Chubby however, via Client side Cache invalidation, provides Consistent Reads.
E.g. If starting a n processes where each process looks each other up(via DNS), that’s N^2 DNS lookups.
Chubby sees a thundering herd from the reads at client startup(not cached). Summary:
Distributed Lock Service used inside Google.
Provides coarse-grained locking(for minutes, hours or days) and not recommended for fine-grained locking(seconds or less). Suited to read-heavy rather than write-heavy. Although you can build a fine-grained locking system on top of Chubby.
A Chubby cell is a Chubby Cluster(usually with 3 or 5 replicas).
Using Paxos, one replica in a Cell is chosen as master which handles all read/write requests. If the master fails, a fail-over is performed.
Each replica has a local database, for files/directories/locks etc. Master writes directly to its own database, which gets asynchronously replicated for Fault Tolerance.
Clients use a Chubby Library to communicate with Servers using RPC.
Chubby interface is a unix-like file system based, a tree of files and directories(which other sub-directories but not files).
Locks: Each node(file/directory) can act as an advisory reader(shared)-writer(exclusive) lock.
Ephemeral Nodes to indicate others that a client is alive.
Metadata includes ACL, Monotonically increasing 64-bit numbers, and CheckSum.
Events mechanism between Chubby Client and server and Client and application for a variety of events like, Lock Acquired, file edited, Jeopardy, Safe etc.
Client Caching to reduce read traffic. Need consistency b/w File, Replica, and its client cache. Client cache invalidation using KeepAlive request/responses.
Clients maintain Sessions using KeepAlive RPCs.
Backup Snapshot of database(Write-Ahead Log) to a GFS file server to different buildings.
Mirroring:Collection of files synced from one cell to another. System Design Patterns:
Write-Ahead Log: For Fault Tolerance, to handle master crash, all database transactions stored in a transaction log**(on local drive or on a distributed GFS?)**
Quorum: To ensure strong consistency. Master gets write ack from N replicas before responding back to client about write success.
Generation Clock: Newly elected master uses Epoch number(monotonically increasing) to avoid split brain.
Lease: Chubby Client maintains a Time bound session lease with Master. References:
Chubby Paper
Chubby Architecture video
Chubby vs ZooKeeper
Hierarchical Chubby
BigTable
GFS
Jordan Deep Dive

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Kafka

Wed, 04 Dec 2024 00:00:00 +0000

Paper: Kafka

Kafka/Distributed Messaging System

Goal

Design a distributed messaging system that can reliably transfer a high throughput of messages between different entities.

Background

One common challenge in distributed systems is handling continuous influx of data from multiple sources.
E.g. Imagine a log aggregation service that can receive hundreds of log entries per second from different sources. Function of this log aggregation service is to store these logs on a disk at a shared server and build an index on top of these logs so that they can be searched later.
Challenges of this service?
Distributed Messaging Systems(or Asynchronous processing paradigm) can help.

What is a messaging System?

System responsible for transferring data amongst various disparate systems like apps, services, processes, servers etc, w/o introducing additional coupling b/w producers and consumers, and by providing asynchronous way of communicating b/w sender and receiver.
Two types of Messaging Systems

Queue

A Particular message can be consumed by one consumer only.
Once a message is consumed, it’s removed from the queue.
Limits the system as the same messages can’t be read by the multiple consumers.

In the Pub-Sub model, messages are written into Partitions/Topics.
Producers write the messages to topics that get persisted in the messaging system.
Subscribers subscribe to those topics to receive each message that was published.
Pub-Sub model allows multiple consumers to read the same message.
Messaging system that stores and handles messages is called a Broker.
Provides a loose coupling b/w producers and consumers so they don’t need to be synchronized. They can read and write messages at different rates.
Also provides fault-tolerance. Messages don’t get lost.
A messaging system can be deployed for various reasons:

Kafka

Agenda

What is Kafka
Background
Kafka Use Cases

What is Kafka?

Open source pub-sub messaging system
Can work as a message-queue as well.
Distributed, Fault tolerant, highly scalable by design.
Fundamentally a system that takes streams of messages from producers, store reliably on a central cluster(with a set of brokers), and allows those messages to be delivered to consumers.

Background

Created at LinkedIn in 2010 to track Page Views(events), Messages from Messaging Systems, and Logs from Various services.
Kafka is also known as a Distributed Commit log or Write Ahead Log or a Transaction Log.
Commit Log is an append-only data structure that can persistently store a sequence of records.
Records are always appended to the end of the log, and once added, records cannot be deleted or modified. Reading from a commit log always happens from left to right (or old to new).
Stores all messages on disk and reads and writes take advantage of sequential disk reads/writes.

Kafka Use Cases

Can be used for collecting huge amounts(Big Data) events and do real-time stream processing of those events.
Metrics: Can collect and aggregate monitoring data. Different services can write their metrics which can be later pulled from Kafka to produce aggregate statistics.
Log Aggregation: Collect logs from various sources, and make them available in standard format to multiple consumers.
Stream Processing: Cases where data undergoes transformation after reading. E.g. Raw data consumed from the topic is transformed, enriched, aggregated, and pushed to a new topic for further consumption. Sort of creating a derived view of data from the source of record data.
Commit Log: Can be used as an external commit log for distributed systems which can keep track of their states.
Website Activity Tracking: One of the original use cases was to build a User-Activity tracking pipeline. Like Page clicks searches, are published to separate topics. These topics are made available for later processing like Loading data into Hadoop(for Batch Processing), Data Warehousing systems for Analytics or reporting. Can also be fed into Product Suggestion or Recommendations systems which can power, Similar Products that you may like, or people have bought etc.

High Level Architecture

Agenda

Kafka Common Terms
High-Level Architecture

Kafka Common Terms

Brokers
Records
Topics
Producers
Consumers
In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers.

High Level Architecture

Kafka Cluster

Kafka is run as a cluster of one or more servers, where each server is responsible for running one Kafka broker.

ZooKeeper

ZooKeeper is a highly read optimized distributed key-value store and is used for coordination and storing configurations.
In Original Version of Kafka, Kafka had used Zookeeper to coordinate between Kafka brokers; ZooKeeper maintains metadata information about the Kafka cluster

Kafka Deep Dive

Related Notes: Alex XU II

Agenda

Topic Partitions
High-Water Mark Kafka is simply a collection of topics. As topics can get quite big, they are split into partitions of a smaller size for better performance and scalability.

Topic Partitions

Kafka Topics are partitioned, and these partitions are placed on separate nodes/brokers.
When a new message is published to a topic, it gets appended to one of the topic’s partitions, usually decided by using the Customer specified Partition Key.
A partition is an ordered sequence of messages.
Kafka Guarantees FIFO Ordering between messages of a single partition. No ordering guarantees across partitions or at a topic level.
A Unique Sequence ID called a Partition Offset gets assigned to every message added to a partition. Used to identify a message’s sequential position within a partition.
Offset sequences are unique to a single partition. Messages are uniquely located using (Topic, Partition, Offset).
Producers can choose to publish messages to any partition. If Ordering within a partition is not needed, a Round-Robin strategy can be used for evenly partitioning data across nodes.
Placing partitions on separate brokers allows for multiple consumers to read from a topic in a parallel, i.e. Different consumers can concurrently read different partitions on separate brokers. However, for multiple consumers within the same Consumer Group, only 1 consumer from a consumer Group can read the data from a Partition at any time.
Messages once written to a partition are immutable(Append Only Log).
Producer specifies a Partition Key, to any message that it publishes so that data is written to the same partition.
Each broker can manage a set of partitions from across various topics.
Follows the principle of Dumb Broker and Smart Consumer.
Kafka doesn’t keep a record of what records are read by the consumer. Consumers poll kafka for new messages and specify which records(specified by partition Offset) they want to read from the topic.
Consumers are allowed to increment/decrement the offset to replay and reprocess the messages.
Each Topic partition has one leader broker and multiple replica(followers) brokers.

Leaders and Followers

A leader is the node responsible for all reads and writes for the given partition. Every partition has one Kafka broker acting as a leader.
To handle Single Point of Failure and to enable Fault Tolerance, Kafka replicates partitions and distributes them across multiple brokers.
Each follower’s responsibility is to replicate the leader’s data to serve as a backup partition.
A follower can take over the leadership if the leader of a partition goes down.
Kafka stores the location of the leader of each partition in ZooKeeper
As all writes/reads happen at/from the leader, producers and consumers directly talk to ZooKeeper to find a partition leader.

In Sync Replicas(ISR)

An in-sync replica (ISR) is a broker that has the latest data for a given partition.
A follower is an in-sync replica only if it has fully caught up to the partition it is following.
Only ISRs are eligible to become partition leaders.
Kafka can choose the minimum number of ISRs required before the data becomes available for consumers to read.

High Water Mark

To ensure data consistency, the leader broker never returns (or exposes) messages which have not been replicated to a minimum set of ISRs.
Broker uses High Water Mark which is the highest offset that all ISRs of a particular partition share.
The leader exposes data only up to the high-water mark offset and propagates the high-water mark offset to all followers.
This avoids the case of a Non-Repeatable read in case the Leader crashes before Replicas get the latest messages.

Consumer Groups

Agenda

What is a Consumer Group?
Distributing Partitions to a consumer within Consumer Groups.

What is a Consumer Group?

A consumer group is basically a set of one or more consumers working together in parallel to consume messages from topic partitions.
No two consumers within the same Consumer group can attach to the same partition at a time. Thus no two consumers within CG receive the same message.

Distributing Partitions to a consumer within Consumer Groups.

Kafka ensures that only a single consumer reads messages from any partition within a consumer group
Topic partitions are a unit of parallelism
If a consumer stops, Kafka spreads partitions across the remaining consumers in the same consumer group
Every time a consumer is added to or removed from a group, the consumption is rebalanced within the group.
Parallelizing processing across multiple partitions of a topic, helps support very high Throughput.
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
Kafka uses any unused consumers as failovers when there are more consumers than partitions. Extra Consumers are idle in the meantime.
Rebalancing happens as Consumers are added and removed from the ConsumerGroups.

Kafka Workflow

Agenda

Kafka Workflow as Pub-Sub messaging
Kafka Workflow for Consumer Group
Kafka provides both pub-sub and queue-based messaging systems in a fast, reliable, persisted, fault-tolerance, and zero downtime manner.
In both cases, producers simply send the message to a topic, and consumers can choose any one type of messaging system depending on their need

Kafka Workflow as Pub-Sub Messaging

Producer publishes a message to a topic.
Broker stores messages in the partitions configured for that topic. If no partition keys were specified, Broker spreads the messages evenly across partitions.
Consumer subscribes to a specific Topic. Broker provides the current offset of that Topic back to Consumer and saves that Offset to ZooKeeper.
Consumers will request Brokers at regular intervals for new messages and process it once kafka sends those messages.
Once the consumer processes the message, it sends an acknowledgement back to the broker. Broker Updates the processed offsets in the ZooKeeper.
Consumers can rewind/skip to the desired offset and read subsequent messages.

Role of Zookeeper

Agenda

What is ZooKeeper?
ZooKeeper as Central Coordinator.

What is ZooKeeper?

Distributed configuration and synchronization service.
Serves as the coordination interface between the Kafka brokers, producers, and consumers.
Kafka stores basic metadata in ZooKeeper, such as information about brokers, topics, partitions, partition leader/followers, consumer offsets.

ZooKeeper as the central coordinator(Might be Stale info)

Kafka brokers are stateless; they rely on ZooKeeper to maintain and coordinate brokers, such as notifying consumers and producers of the arrival of a new broker or failure of an existing broker, as well as routing all requests to partition leaders.
Stores all sorts of Metadata about the Kafka Cluster

How do producers or consumers find out who the leader of a partition is?

In the older versions of Kafka, all clients (i.e., producers and consumers) used to directly talk to ZooKeeper to find the partition leader.
Kafka has moved away from this coupling, and in Kafka’s latest releases, clients fetch metadata information from Kafka brokers directly;
All the critical information is stored in the ZooKeeper and ZooKeeper replicates this data across its cluster, therefore, failure of Kafka broker (or ZooKeeper itself) does not affect the state of the Kafka cluster.
Zookeeper is also responsible for coordinating the partition leader election between the Kafka brokers in case of leader failure.

Controller Broker

Agenda

What is a Controller Broker?
Split Brain.
Generation Clock.

What is a Controller Broker?

Within the Kafka cluster, one broker is elected as the Controller.
Controller broker is responsible for admin operations, such as creating/deleting a topic, adding partitions, assigning leaders to partitions, monitoring broker failures by doing health checks on other brokers.
Communicates the result of the partition leader election to other brokers in the system.

Split Brain

When a controller node dies, kafka elects a new controller. One of the problems is that we cannot truly know if the leader has stopped for good(Crash Stop) or has experienced intermittent failures like Stop the World GC or process Pause, or a temporary network disruption.
Two split-brain controllers would be giving out conflicting commands in parallel. If something like this happens in a cluster, it can result in major inconsistencies. How do we handle this?

Generation Clock?

Split-brain is commonly solved with a generation clock, which is simply a monotonically increasing number to indicate a server’s generation.
In Kafka, the generation clock is implemented through an epoch number, Old leader = epoch 1, and new leader = epoch 2.
This epoch is included in every request that is sent from the Controller to other brokers.
Brokers can now easily differentiate the real Controller by simply trusting the Controller with the highest number.
This epoch number is stored in ZooKeeper.

Kafka Delivery Semantics?

Agenda

Producer Delivery Semantics
Consumer Delivery Semantics

Producer Delivery Semantics

A producer writes only to the leader broker, and the followers asynchronously replicate the data.
How can a producer know that the data is successfully stored at the leader or that the followers are keeping up with the leader?
Kafka offers three options to denote the number of brokers that must receive the record before the producer considers the write as successful:

Consumer Delivery Semantics

A consumer can read only those messages that have been written to a set of in-sync replicas(High Water Mark).
There are three ways of providing consistency to the consumer:

Kafka Characteristics

Agenda

Storing messages to disks
Record Retention in Kafka
Client Quota
Kafka Performance

Storing messages to disks

Kafka writes its messages to the local disk and does not keep anything in RAM. Disk storage is important for durability so that the messages will not disappear if the system dies and restarts.
Even though disk access is generally considered to be slow, there is a huge performance difference b/w Random Block Access and Sequential Access.
Random block access is slower because of numerous disk seeks, whereas the sequential nature of writing or reading, enables disk operations to be thousands of times faster than random access.
Because all writes and reads happen sequentially, Kafka has a very high throughput.
Writing or reading sequentially from disks are heavily optimized by the OS, via read-ahead (prefetch large block multiples) and write-behind (group small logical writes into big physical writes) techniques.
Also, modern operating systems cache the disk in free RAM. This is called Pagecache.
Since Kafka stores messages in a standardized binary format unmodified throughout the whole flow (producer → broker → consumer), it can make use of the zero-copy optimization.
Kafka has a protocol that groups messages together. This allows network requests to group messages together and reduces network overhead.

Record Retention in Kafka

By default, Kafka retains records until it runs out of disk space. We can set time-based limits (configurable retention period), size-based limits (configurable based on size), or compaction (keeps the latest version of record using the key).
For example, we can set a retention policy of three days, or two weeks, or a month, etc.
The records in the topic are available for consumption until discarded by time, size, or compaction.

Client Quota

Heavy Hitters(Noisy Neighbours) can exhaust broker resources, or can cause network saturation to multi-tenant kafka clusters, which can deny service to other clients and broker themselves.
In Kafka, quotas are byte-rate thresholds defined per client-ID(application).
The broker does not return an error when a client exceeds its quota but instead attempts to slow the client down by holding the client’s response for enough time to keep the client under the quota.
This also prevents clients from having to implement special back-off and retry behavior.

Kafka Performance

Scalability
Fault Tolerance and Reliability
Throughput
Low Latency?

System Design Pattern:

High Water Mark - To deal with Non-Repeatable reads and data consistency.
Leader and Follower - Leader serves read/writes. Followers do replication.
Split-Brain - Multiple Controller nodes active at a time(due to Zombie Controller). Generational Epoch number to resolve.
Segmented Log - Log segmentation to implement storage for its partitions. References:
Confluent Docs
NYTimes usecase
Kafka Summit 2019
Kafka Acks explained(TODO)
Kafka as distributed log
Minimizing Kafka Latency(TODO)
Kafka Internal Storage(TODO)
Exactly once semantics(TODO)
Split Brain(TODO) Open Questions:
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
In At-most-once consumer delivery semantics, Why can’t the consumer read from the previous offset? Why are messages said to be lost?[Research]
Exactly once semantics? How would transactions happen across 2 systems(consumer processing + Kafka Offset Commit). How are they suggesting the transaction would be rolled back?
Zero Copy Optimization
Page Cache optimization Kafka
How does replication internal work b/w leader follower?
Tombstoning in Kafka.

1️⃣ Zero Copy Optimizations & Page Cache in Kafka and Other Systems

📌 What is Zero Copy?

Zero Copy is a kernel-level optimization that allows data to be transferred between disk and network without passing through user-space memory, reducing CPU overhead and increasing throughput.

🚀 Why is Zero Copy important? ✔ Reduces CPU usage (since data isn’t copied multiple times).

✔ Minimizes context switches (between user and kernel space).

✔ Improves I/O throughput (as memory copying is avoided).

📌 How Kafka Uses Zero Copy (Sendfile Optimization)

Kafka uses Zero Copy via the sendfile system call in Linux.

🔹 Without Zero Copy (Traditional Path)

Kafka reads a log file from disk → (Disk → Kernel Space).
The kernel copies data to Kafka’s user-space buffer → (Kernel Space → User Space).
Kafka writes the buffer to a network socket → (User Space → Kernel Space → Network).
The kernel sends data over the network. 🔹 With Zero Copy (Optimized Path)
Kafka calls sendfile() → Kernel directly transfers a log file to the network socket.
No user-space buffer required → Data goes directly from disk to network. ✔ Avoids unnecessary copies in user-space.✔ Greatly improves throughput (Kafka can achieve millions of messages per second).

📌 Zero Copy Optimizations in Other Systems

[Table content - requires manual formatting]

📌 What is Page Cache and How Kafka Optimizes It?

Kafka doesn’t need a traditional database cache. Instead, it relies on the OS page cache for fast reads.

✔ Page Cache: The Linux kernel automatically caches recently read disk pages in memory.

✔ Kafka uses the Page Cache to serve reads directly from memory without hitting disk.

🔹 How Page Cache Works in Kafka:

When a consumer reads a message, Kafka first checks the OS page cache.
If the data is cached, it is served directly from memory (zero disk I/O).
If the data isn’t in cache, Kafka reads it from disk, and the OS automatically caches it. 🚀 Optimizations in Kafka: ✔ Uses sendfile() to directly transfer from Page Cache to network.✔ Leverages sequential disk access (append-only logs) for high read efficiency.✔ Minimizes JVM heap memory usage by relying on OS caching.

2️⃣ How Replication Works Between Leaders and Followers in Kafka

Kafka ensures fault tolerance and high availability using replication.

📌 Basics of Kafka Replication

✔ Each Kafka topic is partitioned, and each partition has:

One Leader (handles all reads & writes).
One or more Followers (replicas of the leader’s data).

📌 Steps in Kafka Replication

1️⃣ Producer writes data to the Leader Partition.

2️⃣ Leader appends data to its local log segment.

3️⃣ Followers fetch new data from the leader.

4️⃣ Followers append data to their own log segment.

5️⃣ **Followers send an acknowledgment (ACK) once they persist the data.**6️⃣ If a majority of followers acknowledge, Kafka considers the message committed.

📌 Leader and Follower Sync Mechanism

✔ Kafka uses a pull-based replication model → Followers poll the leader to fetch new data.

✔ Offset tracking: Followers maintain an offset to track the latest committed message.

✔ ISR (In-Sync Replicas): Only replicas in sync with the leader are part of the ISR.

📌 How a New Leader is Elected?

✔ If the Leader fails, one of the ISR replicas is promoted.

✔ The new Leader starts serving read and write requests.

✔ If no ISR exists, the partition becomes temporarily unavailable until a new Leader is available.

📌 Replication Strategies

[Table content - requires manual formatting]

🚀 Tuning Replication Settings for Performance ✔ min.insync.replicas = 2 → Ensures durability (at least two replicas must ACK).

✔ unclean.leader.election = false → Prevents unsafe leader elections (data loss risk).

✔ replica.lag.time.max.ms = 10,000 → Defines when a slow follower is removed from ISR.

3️⃣ Tombstoning in Kafka

Kafka Tombstoning is used for deleting records in log-compacted topics.

📌 Why is Tombstoning Needed?

✔ Kafka never deletes data immediately.

✔ Instead, Kafka marks the record as deleted (tombstone message).

✔ The actual data is removed later during log compaction.

📌 How Tombstoning Works

Producer sends a null value for a key (marks it as deleted).
Kafka appends this tombstone message to the log.
The consumer sees the tombstone event and removes the record from its own storage.
Kafka’s log compaction eventually purges the tombstone message and the original record.

📌 Example of Tombstone Message

{

“key”: “user_123”,

“value”: null,

“timestamp”: 1700000000

}

✔ This soft deletes “user_123”.

✔ Log compaction later removes both the original record and the tombstone.

📌 How Log Compaction Works

Log compaction keeps only the latest value for each key.
Tombstones stay in the log until Kafka compacts the segment.
Kafka guarantees at least one copy of the latest record is retained (even after compaction). 🚀 Tuning Log Compaction ✔ log.cleanup.policy = compact → Enables log compaction.

✔ delete.retention.ms = 86400000 → Keeps tombstones for 24 hours before purging.

✔ log.segment.bytes → Defines when segments are compacted.

🔹 Summary & Key Takeaways

Zero Copy & Page Cache

✔ Kafka uses sendfile() for Zero Copy, avoiding unnecessary memory copies.✔ Page Cache stores recent messages, reducing disk I/O.

Replication Between Leaders and Followers

✔ Kafka uses asynchronous, pull-based replication for performance.✔ ISR (In-Sync Replicas) ensures durability.✔ New leaders are elected from ISR in case of failure.

Tombstoning & Log Compaction

✔ Kafka uses tombstones (null values) for soft deletes.✔ Log compaction removes older records but keeps the latest one per key.

Paper Link: https://notes.stephenholiday.com/Kafka.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Cassandra

Tue, 03 Dec 2024 00:00:00 +0000

Paper: Cassandra

Cassandra / Distributed Wide Column NoSQL Database

Goal

Design a distributed and scalable system that can store a huge amount of semi-structured data, which is indexed by a row key where each row can have an unbounded number of columns.

Background

Open source Apache Project developed at FB in 2007 for Inbox Search feature.
Designed to provide Scalability, Availability, Reliability to store large amounts of data.
Combines distributed nature of Amazon’s Dynamo(K-V store) and DataModel for Google’s BigTable which is a Column based store.
Decentralized architecture with no Single Point of Failure(SPOF), Performance can scale linearly with addition of nodes.

What is Cassandra?

Cassandra is typically classified as an AP (i.e., Available and Partition Tolerant) system which means that availability and partition tolerance are generally considered more important than the consistency. Eventually Consistent
Similar to Dynamo, Cassandra can be tuned with replication-factor and consistency levels to meet strong consistency requirements, but this comes with a performance cost.
Uses peer-to-peer architecture where each node communicates to all other nodes.

Cassandra Use Cases

Any application where eventual consistency is not a concern can utilize Cassandra.
Cassandra is optimized for high throughput writes.
Can be used for collecting big data for performing real-time analysis.
Storing key-value data with high availability(Reddit/Dig) because of linear scaling w/o downtime.
Time Series Data Model
Write Heavy Applications
NoSQL

High Level Architecture

Agenda

Cassandra Common Terms
High Level Architecture

Cassandra Common Terms

Column: A Key-Value pair. Most basic unit of data structure in Cassandra.
Row: Container for columns referenced by the primary key.
Table: Container of rows.
KeySpace: Container for tables that span over one or more cassandra nodes.
Cluster : Container of KeySpaces.
Node: Computer system running cassandra instance. Physical Host, or VM, or even Docker container.

Data Partitioning

Cassandra uses Consistent Hashing similar to Dynamo.

Cassandra Keys

Mechanisms used by Cassandra to uniquely identify the rows.
Primary Key uniquely identifies each row of a table.
Primary Key = Partition Key + Clustering Key

Clustering Keys

Clustering keys define how the data is stored within a node. Can have multiple clustering keys.

Partitioner

Component which is responsible for determining how the data is distributed on the consistent hashing ring.
When cassandra inserts some data, partitioning applies a hashing algorithm to the partition Key to determine which range(and the corresponding node) the data lies.
Cassandra uses Murmer3 hashing function(Default).
In cassandra’s default configuration, a token is a 64-bit integer. Gives possible token ranges from [-2^63 , 2^63 + 1]. How does it differ from Dynamo?
All nodes learn about token assignment of other nodes through Gossip.

Replication

Agenda

Replication Factor
Replication Strategy.
Each node in Cassandra serves as a replica for a different range of data. Replication factor decides how many replicas the system would have, which is the number of nodes that will receive the copy of the same data.
The node that owns the range in which hash of the partition key falls is the first replica. All additional replicas are placed on the consecutive nodes in a clockwise manner.
Simple Replication Strategy
Network Topology Strategy

Cassandra Consistency Levels

Agenda

Cassandra Consistency Levels
Write Consistency Levels
Read consistency level
Snitch

Cassandra Consistency Levels

Minimum number of Cassandra nodes that must fulfill a read or write operation before the operation can be considered successful.
Has Tunable Consistency levels for reads and writes.
Tradeoff b/w Consistency and performance.

Write Consistency Levels

One or Two or Three : Success acknowledgement from specified number of nodes**.**
Quorum: Data must be written to at least the Majority Quorum of nodes.
All: Data is written to all nodes.
Local Quorum: Data is written to the Quorum of nodes in the same data center as the coordinator. Don’t wait for responses from other Data Centers.
Each Quorum: Data written to the Quorum of nodes in each data center.
Any: Data written to at least one node**.**
Performing Write Operation? Hinted Handoff?

When the node where the data was supposed to be written for Quorum was down comes online again, how should we write data to it? Cassandra accomplishes this through a Hinted handoff.
[FAILURE MODE] When a node is down or does not respond to a write request, the coordinator node writes a hint in a text file on the local disk. This hint contains the data itself along with information about which node the data belongs to. When the coordinator node discovers(using Gossip Protocol) that a node for which it holds hints has recovered, it forwards the write requests for each hint to the target. Furthermore, each node every ten minutes checks to see if the failing node, for which it is holding any hints, has recovered.
[FAILURE MODE] If a node is offline for some time, the hints can build up considerably on other nodes. Now, when the failed node comes back online, other nodes tend to flood that node with write requests. This can cause issues on the node, as it is already trying to come back after a failure.
Cassandra by default stores hints for 3 hours. After 3 hours, older hints are removed , if the failed node comes back up(hinted handoff won’t happen), and the node would contain stale data. Stale data can be fixed by Read-Repair(Read Path)
When the cluster cannot meet the client’s consistency level, cassandra fails the write request, and doesn’t store a hint.

Read Consistency Levels

Specifies how many replica nodes must respond to a read request before returning the data.
Same levels as Write operations except(Each Quorum) because Expensive.
R + W > Replication Factor can give Strong consistency levels in Cassandra?[Research]
Cassandra uses Snitch, an application that determines the proximity of nodes within the ring and also tells which nodes are faster and cassandra uses this to route read/write requests.

How does Cassandra perform a Read Operation?

Coordinator sends the read request to the fastest node(using Snitch).
E.g. Quorum R = 2, sends the request to the fastest node, and digest of the data from the second fastest node.
If the digest does not match, it means some replicas do not have the latest version of the data. In this case, the coordinator reads the data from all the replicas to determine the latest data.
The coordinator then returns the latest data to the client and initiates a read repair request.
The latest write-timestamp is used as a marker for the correct version of data[Research?] in Cassandra? Conflict resolution? Last write wins or Vector Clocks? Data Loss?
The read repair operation is performed only in a portion of the total reads to avoid performance degradation.
By default, Cassandra tries to read-repair 10% of all requests with DC local read repair.

Snitch

Snitch keeps track of network topology of Cassandra nodes. It determines which data center and racks nodes belong to and uses this info to route requests efficiently.
Functions of Snitch in Cassandra?

Gossiper

How does Cassandra use the Gossip protocol?
Node failure detection?

How does Cassandra use the Gossip Protocol?

Allows each node to keep track of state information about other nodes in the cluster.
Gossip protocol is a peer-to-peer communication mechanism in which nodes periodically exchange state information about themselves and other nodes they know about.
Each node initiates a gossip round every second to exchange state information about themselves (and other nodes) with one to three other random nodes.
Each Gossip message has a version associated with it, so that during gossip exchange, older information is overwritten with the most current state for a particular node.
Generation Number: Each node tracks a generation number which increments every time a node restarts.
Seed Nodes?

Node Failure Detection?

Accurately detecting failures is a hard problem to solve. We cannot say with 100% accuracy if a node is actually down or is just very slow to respond due to heavy load, network congestion, GC/process pauses etc.
Heart Beating(Boolean Failure detector, Yes or No) uses a fixed timeout, and if there is no heartbeat from a server, the system, after the timeout, assumes that the server has crashed. Here the value of the timeout is critical.
Cassandra uses an Adaptive failure detection mechanism, Phi Accrual Failure Detector
A generic Accrual Failure Detector, instead of telling if the server is alive or not, outputs the suspicion level about a server; a higher suspicion level means there are higher chances that the server is down.
Phi Accrual Failure Detector, if a node does not respond, its suspicion level is increased and could be declared dead later.

Anatomy of Cassandra’s write operation

Agenda

CommitLog
MemTable
SSTable
Cassandra stores data both in-memory and on-disk to provide both high performance and durability. Every write includes a timestamp. The Write-Path involves a lot of components.
Cassandra’s Write path Summary:

Commit Log

When a node receives a write request, it immediately writes the data to a commit log.
The commit log is a write-ahead log and is stored on disk.
Used as a Crash-Recovery mechanism for Cassandra’s Durability goals.
A write on the node isn’t considered successful until it’s written to the commit log.

MemTable

After a Write is persisted to CommitLog, it is then written to the memory-resident data structure which is MemTable.
Each Cassandra node has an In-Memory MemTable for each Table. It resembles that data in that Table it represents.
Accrues writes and provides reads for data not yet flushed to disk.
Commit Log stores all the writes in sequential Order(append only log) whereas MemTable stores data in sorted order of PartitionKey, and Clustering Columns.
After data is written to Commit-Log and MemTable, node sends success acknowledgement to the Coordinator.

SSTable(Sorted String Table)

When the number of objects stored in the MemTable reaches a Threshold, the contents of the MemTable are flushed to disk in a file called SSTable.
New MemTable is created to serve in-memory requests for subsequent data.
Flushing of MemTables is a Non-Blocking operation.
Multiple MemTables may exist for a single Table, one current, and others waiting to be flushed.
SSTable contains data for a specific Table.
When the MemTable is flushed to SStables, corresponding entries in the Commit Log are removed.
The Term SSTable first appeared in Google’s Bigtable which is also a storage system. Cassandra borrowed this term even though it does not store data as strings on the disk.
Once a MemTable is flushed to disk as an SSTable, it is immutable and cannot be changed by the application.
If we are not allowed to update SSTables, how do we delete or update a column?
The current data state of a Cassandra table consists of its MemTables in memory and SSTables on the disk.
On reads, Cassandra will first read MemTables, and then subsequently SSTables(if MemTables Does Not contain the key) to find data values, as the MemTable may still contain values that have not yet been flushed to the disk.
MemTable works as a WriteBack cache that Cassandra looks up by Key.
Generation Number: an Index number that is incremented every time a new SSTable is created for a Table. Uniquely identifies an SSTable.

Anatomy of Cassandra’s read operation

Agenda

Caching
Reading from MemTable
Reading from SSTable

Caching

To boost read performance, Cassandra provides 3 optional forms of caching:

Reading from MemTable

Data is sorted by the partition key and the clustering columns.
When a read request comes in, the node performs a binary search on the partition key to find the required partition and then returns the row.

Reading from SSTables

Bloom Filters

Each SStable has a Bloom filter associated with it, which tells(probabilistic) if a particular key is present in it or not for boosting read performance.
Bloom filters are very fast, non-deterministic algorithms for testing whether an element is a member of a set.
Bloom filters work by mapping the values in a data set into a bit array and condensing a larger data set into a digest string using a hash function.
The filters are stored in memory and are used to improve performance by reducing the need for disk access on key lookups since disk access is much slower.
Because false negatives are not possible:

How are SSTables Stored on Disk?

Each SSTable Consists of Two Files:
Partition Index Summary File
If we want to read data for key=12, here are the steps we need to follow (also shown in the figure below):

Reading SSTable through Key Cache

As the Key Cache stores a map of recently read partition keys to their SSTable offsets, it is the fastest way to find the required row in the SSTable.
Summary of Read Operation:

Compaction

Agenda

How does compaction work in Cassandra?
Compaction Strategies?
Sequential Writes?

How does compaction work in Cassandra?

SSTables are immutable(Append Only Log), which helps Cassandra achieve such high write speeds.
Flushing of MemTable to SStable is a continuous process. This means we can have a large number of SStables lying on the disk. While reading, it is tedious to scan all these SStables. So, to improve the read performance, we need compaction.
Compaction refers to the operation of merging multiple related SSTables into a single new one.
During compaction, the data in SSTables is merged: the keys are merged, columns are combined, obsolete values are discarded, and a new index is created.
On compaction, the merged data is sorted, a new index is created over the sorted data, and this freshly merged, sorted, and indexed data is written to a single new SSTable.
Compaction will reduce the number of SSTables to consult and therefore improve read performance.
Compaction will also reclaim space taken by obsolete(Tombstoned or overwritten) data in SSTables.

Compaction Strategies

Size Tiered(Default, Write Optimized)
Levelled(Read Optimized)
Time Window(Time Series Optimized)

Sequential Writes

Sequential writes are the primary reason that writes perform so well in Cassandra.
No reads or seeks of any kind are required for writing a value to Cassandra because writes are append-only operations.
Write speed of the disk becomes a performance bottleneck.
Compaction is intended to amortize the reorganization of data, but it uses sequential I/O to do so, which makes it efficient.
If Cassandra naively inserted values where they ultimately belonged, writing clients would pay for seeks upfront.

Tombstones

An interesting case with Cassandra can be when we delete some data for a node that is down or unreachable, that node could miss a delete. When that node comes back online later and a repair occurs, the node could “resurrect” the data that had been previously deleted by re-sharing it with other nodes.
To prevent deleted data from being reintroduced, Cassandra uses a concept called a tombstone Which is similar to a “soft delete” from the Relational databases world.
When we delete data, Cassandra does not delete it right away, instead associates a tombstone with it, with a time to expiry.
The purpose of this delay is to give a node that is unavailable time to recover.
Tombstones are removed as part of compaction. During compaction, any row with an expired tombstone will not be propagated further.

Common Problems associated with Tombstones?

Tombstones make Cassandra’s writes efficient because the data is not removed right away when deleted. Instead, it is removed later during compaction.
Problems?
Slower Reads Indexes?
Cassandra uses clustering keys to create indexes of data within a partition.
These are only local indexes, not global indexes.
If you have many clustering keys in order to achieve multiple different sort orders, Cassandra will de-normalize the data such that it keeps two copies of it. Cassandra Pitfalls?
Lack of Strong Consistency even with Quorums(say Sloppy Quorum or hinted handoffs) which can create race conditions amongst concurrent writes.
Lack of ability to support data relationships(outside of sorting data within a partition)
Lack of Global Secondary Indexes if needed for Read Heavy applications where read cache may not work.

Summary

Cassandra is Distributed, Decentralized(Leaderless), Scalable, Highly Available, Eventually Consistent NoSQL datastore.
Was designed with Fault-Tolerance in mind(hardware/software failures can and do happen).
Peer-to-Peer(Gossip) distributed System, with no Leader/Follower nodes. All nodes are equal except some are tagged seed nodes, for bootstrapping gossip to the nodes added to the cluster.
Data is automatically Partitioned across nodes using Consistent Hashing as well as Replicated for Fault Tolerance and redundancy.
Combines Distributed Nature of Amazon’s Dynamo(Consistent Hashing, Replication, Partitioning), with DataModel of Google’s BigTable, i.e. SSTable/MemTable.
Offers Tunable Consistency(Default AP) but can be made strongly consistent(CP) but with performance implications.
Uses Gossip protocol for Inter-Node communication.
Supports Geographical Distribution of data across multiple clouds and data centers?

System Design Patterns Used?

Consistent Hashing : Data Partitioning
Quorum : Data Consistency
Write Ahead Log : Durability
Segmented Log: Splits its commit log into multiple smaller files instead of single large file for easier operation.
Gossip Protocol: Membership or Cluster State information, Failure Detection?
Phi Accrual Failure Detector: Adaptive Failure Detection using suspicion levels.
Bloom Filters: Check for partition Key presence in SSTable(Read Optimized).
Hinted Handoff: Sloppy Quorum?? and High Availability.
Read Repair: Fix Stale values on Read?

References:

DataStax Docs
Cassandra Tombstone issues
BigTable
Dynamo
PhiAccrual Failure Detector(Akka) Open Questions?
What is a Murmer3 hashing function? How does it compare to MD5? Why Murmur?
Why 64 bit Token range? How does that compare to Dynamo?
R + W > Replication Factor can give Strong consistency levels in Cassandra?[Research]
What happens if the coordinator node which wrote the Hint on the local disk crashes? How does the hinted handoff process complete? [Research]
The latest write-timestamp is used as a marker for the correct version of data[Research?] in Cassandra? Conflict resolution? Last write wins or Vector Clocks? Data Loss?
Phi Accrual Failure Detector?
Write Ahead Log? Cassandra?
KeyCache and Row Cache in Cassandra? How is it used? How is it invalidated or kept in Sync?
Bloom Filters details?
Why is each compaction Strategy Size-Tiered or Levelled Compaction a good strategy for its corresponding workload?
Anti-Entropy in Cassandra?
Geographical replication of data?
Read up on Various company blogs on Cassandra?
Last Write Wins and Conflict Resolution?

Paper Link: https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Dynamo

Mon, 02 Dec 2024 00:00:00 +0000

Paper: Dynamo

Dynamo / Distributed Key Value Store

Problem: Design a distributed key-value store(or Distributed Hash Table) that is highly available (i.e., reliable), highly scalable, and completely decentralized.

Features

Highly available Key-Value Store.
Shopping Cart, Bestseller Lists, Sales Rank, Product Catalog, etc which needs only primary-key access to data.
Multi-table RDBMS would limit scalability and availability.
Can choose desired Level of Availability and Consistency.

Background?

Designed for **high availability(**at a massive scale) and partition tolerance at the expense of strong consistency.
Primary Motivation for being optimized for High Availability(Over consistency) was to be always up for serving customer requests to provide better customer experience.
Dynamo design inspired various NoSQL Databases, Cassandra, Riak, VoldemortDB, DynamoDB.

Design Goals?

Highly Available
Reliability
Highly Scalable
Decentralized
Eventually Consistent(EC) - Weaker Consistency model than Strong Consistency(Linearizability)
(Notes: ) Latency Requirements?
(Notes: ) Geographical Distribution of Data?

Use cases

Dynamo can achieve strong consistency, but it comes with a performance impact. If Strong Consistency is a requirement, Dynamo is not the best option.
Applications that need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance.
Services that need only Primary Key access to the data.

System APIs:

get(key) : T… Object, Context
put(key, context, object)
Dynamo treats both the object and the key as an arbitrary array of bytes (typically less than 1 MB).
Uses MD5 Hashing algorithm on the key to generate 128-bit HashID, which is used to determine the storage nodes that are responsible for serving the key.

High Level Architecture

Agenda

Data Distribution(Partitioning)
Data Replication and Consistency
Handing Temporary Failures(Fault Tolerance)
Inter-Node communication(Unreliable Network) and Failure Detection
High Availability
Conflict resolution and handling permanent failures.

Data Partitioning

Distributing data across a set of nodes is called data partitioning.
Challenges with Partitioning?
Naive Approach(Modulo Hashing)
Better Approach(Consistent Hashing)
Consistent hashing represents the data managed by a cluster as a ring. The ring is divided into smaller predefined ranges. Each node in the ring is assigned a range of data. The start of the range is called a token(each node is assigned one token).
Above works great when a node is added or removed from the ring; as only the next node is affected in these scenarios
The basic Consistent Hashing algorithm assigns a single token (or a consecutive hash range) to each physical node and does a static division of ranges that requires calculating tokens based on a given number of nodes.
Dynamo efficiently handles these scenarios(node addition/removal) through the use of Virtual Nodes(or Vnodes). New scheme for distributing Tokens to physical nodes.
Instead of assigning a single token to a node, the hash range is divided into multiple smaller ranges, and each physical node is assigned multiple of these smaller ranges. Each of these subranges is called a Vnode.
Vnodes are randomly distributed across the cluster and are generally non-contiguous so that no two neighboring Vnodes are assigned to the same physical node.
Nodes also carry replicas of other nodes for fault-tolerance.
Since there can be heterogeneous machines in the clusters, some servers might hold more Vnodes than others.
Advantages of VNodes:

Data Replication

Agenda

Optimistic replication
Preference List
Sloppy Quorum and Handling of Temporary failures
Hinted Handoff Optimistic replication
Replicates each data item on N nodes(N = Replication Factor, configurable per Dynamo instance).
Each key is assigned a Coordinator node(node that falls first in the hash range), which stores the data locally and replicates asynchronously(What?? or Synchronously?) to N-1 Clockwise successor nodes in the ring(eventually consistent) called Optimistic replication.
As Dynamo stores N copies of data spread across different nodes, if one node is down, other replicas can respond to queries for that range of data.
If a client cannot contact the coordinator node, it sends the request to a node holding a replica. Preference List
The list of nodes responsible for storing a particular key is called the preference list.
Dynamo is designed so that every node in the system can determine which nodes should be in this list for any specific key.
This list contains more than N nodes to account for failure and skip virtual nodes on the ring so that the list only contains distinct physical nodes. Sloppy Quorum and handling of temporary failures
Following traditional/strict quorum approaches, any distributed system becomes unavailable during server failures or network partitions and would have reduced availability even under simple failure conditions. Dynamo uses Sloppy Quorums.
With this approach, all read/write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while moving clockwise on the consistent hashing ring.
Fault Tolerance with Sloppy Quorum.
Hinted Handoff

Vector Clocks and Conflicting Data(Conflict Resolution)

Agenda:

Clock Skew?
Vector Clock?
Conflict Free Replicated Data Types(CRDTs)
Last Write Wins(LWW) Clock Skew
Physical clocks have clock skews, which is okay in single node systems, but can create concurrency updates in distributed systems, due to clock skews across different nodes.
Physical clocks are synchronized using NTP, but that still has skew, and 2 different nodes’ physical clocks can’t be accurately synchronized.
Using special hardware like GPS clocks and Atomic Clocks can reduce the clock skews, but doesn’t entirely eliminate it.
Physical clock has a problem with Causal Ordering of events(happens-before relationship). Vector Clock?
Captures Causal ordering between events.
Vector clock is a (node, counter) pair. What? Isn’t it Lamport Clocks?
Vector timestamps are attached to every version of the object stored in Dynamo.
One can determine whether two versions of an object are on parallel branches or have a causal ordering by examining their vector clocks.
If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation. Dynamo resolves these conflicts at read-time.
Version branching may happen in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object.
Dynamo truncates vector clocks (oldest first) when they grow too large. If Dynamo ends up deleting older vector clocks that are required to reconcile an object’s state, Dynamo would not be able to achieve eventual consistency. Conflict Free Replicated Data Types?
To make use of CRDTs, we need to model our data in such a way that concurrent changes can be applied to the data in any order and will produce the same end result. This way, the system does not need to worry about any ordering guarantees.
The idea that any two nodes that have received the same set of updates will see the same end result is called strong eventual consistency. Last Write Wins
Dynamo(and Cassandra) also offer a way to do server side conflict resolution, LWW.
Uses Physical(Wall Clock/Time-Of-the-Day) Clocks.
Can potentially lead to Data-Loss during concurrent writes.

Life of Dynamo’s put() and get() operations.

Agenda:

Strategies for Coordinator selection
Consistency protocol
put() process
get() process
Request handling through a state machine.

Strategies for choosing coordinator

Clients route request using Generic Load Balancer.
Clients use a partition-aware client library that routes requests to the appropriate coordinator with lower latency.

Consistency Protocol

Uses a consistency protocol similar to quorum systems.
R + W > N ( R /W = minimum number of nodes to participate in Read/Write)
Common configurations(N, R, W) for Dynamo (3, 2, 2)
Latency of get() and put() depends upon the slowest of replicas. Put() Process
Coordinator generates new data version and vector timestamp.
Saves data locally.
Sends write requests to N-1 highest ranked healthy nodes from the preference list.
Put() is considered successful after receiving W-1 confirmations. Get() process
Coordinator requests the data version from N-1 highest ranked healthy nodes from the preference list.
Waits until R - 1 replies.
Coordinator handles causal data versioning using vector clocks/timestamps.
Returns all data versions to the caller.

Request handling through the state machine

Each client request results in creating a state machine on the node that received the client request.
The state machine contains all the logic for
Each state machine instance handles exactly one client request.
A read operation implements following state machine:
Writes:

Anti-Entropy through Merkle Trees

Dynamo uses Vector clocks to remove write conflicts(Read Repair) while serving read requests if it receives stale responses from some of the replicas.
If a replica fell significantly behind others, it might take a very long time to resolve conflicts using read repair(vector clocks), depending upon if those keys were read or not. It may happen that some of the keys are never accessed, and they cold remain stale for longer.
We need a mechanism to automatically reconcile replicas in the background(and do conflict resolution if any).
To do this, we need to quickly compare two copies of a range of data residing on different replicas and figure out exactly which parts are different.
Naively splitting up the entire data range for checksums is not very feasible; there is simply too much data to be transferred.(Transferred? How?)
Dynamo uses Merkle trees to compare replicas of a range.
A Merkle tree is a binary tree of hashes, where each internal node is the hash of its two children, and each leaf node is a hash of a portion of the original data.
Now comparing the ranges of data on two replicas is equivalent to comparing two Merkle Trees
The principal advantage of using a Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the whole data set.
Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.
The disadvantage of using Merkle trees is that many key ranges can change when a node joins or leaves, and as a result, the trees need to be recalculated.

Gossip Protocol

What is a Gossip Protocol?

How does Node Failure Detection happen in Dynamo?
Since we do not have any central node that keeps track of all nodes to know if a node is down or not, how does a node know every other node’s current state?
Naive Approach: Each Node broadcast HeatBeat message to every other Node
Optimized Approach: Gossip Protocol

External Discovery Through Seed Nodes?

Dynamo nodes use gossip protocol to find the current state of the ring. This can result in a logical partition of the cluster in a particular scenario.
An administrator joins node A to the ring and then joins node B to the ring. Nodes A and B consider themselves part of the ring, yet neither would be immediately aware of each other. To prevent these logical partitions, Dynamo introduced the concept of seed nodes.
Seed nodes are fully functional nodes and can be obtained either from a static configuration or a configuration service. This way, all nodes are aware of seed nodes.
Each node communicates with seed nodes through gossip protocol to reconcile membership changes; therefore, logical partitions are highly unlikely.

Characteristics and Criticism of Dynamo

Responsibilities of a Dynamo Node

Managing get() and put() requests via acting as a Coordinator(or request Forwarder).
Keeping track of membership(Hash ranges in a Ring) and detecting failures(Gossip)
Local Persistent Storage

Characteristics of Dynamo

Distributed(Can run across several machines)
Decentralized(No external coordinator, all nodes identical)
Scalable(Horizontally scaled on commodity hardware with Fault Tolerance. No Manual intervention/rebalancing required)
Highly Available
Fault Tolerant and Reliable
Tunable Consistency(Trade Offs b/w Availability and Consistency by adjusting the replication factor 3,2,2, or 3,1,3, or 3,3,1 etc).

Criticism on Dynamo Design?

Each Dynamo node contains the entire routing table. Could affect scalability of the system as this routing table gets larger as more nodes are added to the system.
Dynamo seems to imply that it strives for symmetry(all nodes have the same set of responsibilities). But it does specify some nodes as seed nodes for external discovery to avoid logical partition. May violate Dynamo’s symmetry principle.
DHTs can be susceptible to Several different types of attack?[Research More?]
Dynamo’s design can be described as a Leaky Abstraction.

DataStores developed on Principles of Dynamo

Riak is a distributed NoSQL key-value data store that is highly available, scalable, fault-tolerant, and easy to operate.
Cassandra is a distributed, decentralized, scalable, and highly available NoSQL wide-column database.

Summary

Paper reading Video.

References:

https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
https://docs.riak.com/riak/kv/2.2.0/developing/data-types/
https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/
https://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
https://www.allthingsdistributed.com/2017/10/a-decade-of-dynamo.html
https://news.ycombinator.com/item?id=915212 Open Questions:
Anti-Entropy and Merkle Trees
DHTs can be susceptible to Several different types of attack?[Research More?]
Underlying storage for Dynamo store. Berkeley, in-memory + persistent + more.
Why does it use MD5 hashing? Why not something else?
Logical partitioning and seed nodes?
New Features and revision in the design?

Paper Link: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Last updated: March 15, 2026

Questions or discussion? Email me