Nosql on Hemant Sethi

Spanner

Sat, 14 Dec 2024 00:00:00 +0000

Paper: Spanner

Spanner: Google’s Globally-Distributed Database

Abstract

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database which supports externally-consistent(Linearizable) distributed transactions.
Paper describes how Spanner is Structured, feature set, rationale behind various design decisions, and a Novel Time API that exposes clock certainty.

Introduction

Spanner shards data across many sets of Paxos state machines in DCs spread across the world.
Replication for global availability and geographic locality, clients automatically failover between replicas.
Automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures.
Designed to scale up to millions of machines across hundreds of data centers and trillions of database rows.
Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents.
BigTable problems
Megastore supports semi-relational data model and synchronous replication, despite its relatively poor write throughput.
Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database.
Globally distributed features:
TrueTime API and its implementation(Key enabler of the above properties)

Spanner Implementation

Directory abstraction(unit of data movement) to manage replication and locality.
Data model. Spanner looks like a relational database instead of a key-value store.
Applications can control data locality.
A Spanner deployment is called a universe.
Spanner is organized as a set of zones.
A zone has **one zonemaster(**assigns data to spanservers) and between one hundred and several thousand spanservers(serve data to clients).
Per-zone location proxies are used by clients to locate the spanservers assigned to serve their data.
Universe master(Singleton) is primarily a console that displays status information about all the zones for interactive debugging
Placement driver(Singleton) handles automated movement of data across zones on the timescale of minutes.

SpanServer Software Stack

Spanserver implementation to illustrate how replication and distributed transactions have been layered onto our Bigtable-based implementation.
Each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet.
A tablet is similar to Bigtable’s tablet abstraction, in that it implements a bag of the following mappings
Unlike Bigtable, Spanner assigns timestamps to data which is why
A Spanner’s tablet’s state is stored in a set of B-tree-like files and a write-ahead log, all on a distributed file system called Colossus (the successor to the Google File System.
To support replication, each spanserver implements a single Paxos state machine on top of each tablet.
Each state machine stores its metadata and log in its corresponding tablet
Paxos implementation supports long-lived leaders with time-based leases(D: 10s).
Current Spanner implementation logs every Paxos write twice:tablet’s & Paxos log.
Implementation of Paxos is pipelined, so as to improve Spanner’s throughput in the presence of WAN latencies; but writes are applied by Paxos in order.
The Paxos state machines are used to implement a consistently replicated bag of mappings.
Writes must initiate the Paxos protocol at the leader;
Reads access state directly from the underlying tablet(sufficiently up-to-date).
Set of replicas is collectively a Paxos group.
At leader replica, each spanserver implements a lock table for concurrency control.
Bigtable and Spanner are designed for long-lived transactions(e.g. for report generation, which might take on the order of minutes) which perform poorly under optimistic concurrency control in the presence of conflicts.(What?)
Operations that require synchronization, such as transactional reads, acquire locks in the lock table; other operations bypass the lock table.
Each spanserver(at leader replica) implements a transaction manager to support distributed transactions.
If a transaction involves only one Paxos group (as is the case for most transactions), it can bypass the transaction manager, since the lock table and Paxos together provide transactionality.
If a transaction involves more than one Paxos group, those groups’ leaders coordinate to perform a two-phase commit.
The state of each transaction manager is stored in the underlying Paxos group (and therefore is replicated).

Directories and Placement

On top of the bag of key-value mappings, the Spanner implementation supports a bucketing abstraction(Directory), which is a set of contiguous keys that share a common prefix.
A directory is the unit of data placement.
The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from a Bigtable tablet. Former is not necessarily a single lexicographically contiguous partition of the row space.
Movedir is the background task used to move directories between Paxos groups.
Application specifies a directory’s geographic-replication placement.
The design of placement-specification language separates responsibilities for managing replication configurations.
An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options.
Spanner will Shard/Partition a directory into multiple fragments if it grows too large.

Data Model

Spanner offers a
DataModel Use Case:
This interleaving of tables to form directories is significant because it allows clients to describe the locality relationships that exist between multiple tables, which is necessary for good performance in a sharded, distributed database. Without it, Spanner would not know the most important locality relationships.

TrueTime

TrueTime explicitly represents time as a TTinterval, which is an interval with bounded time uncertainty(unlike standard time interfaces that give clients no notion of uncertainty).
The endpoints of a TTinterval are of type TTstamp.
The time epoch is analogous to UNIX time with leap-second smearing.
The underlying time references used by TrueTime are GPS and atomic clocks because they have different failure modes.
TrueTime is implemented by a set of time master machines per datacenter and a timeslave daemon per machine.
All masters’ time references are regularly compared against each other.
Every daemon polls a variety of masters to reduce vulnerability to errors from any one master.
Uncertainty Range 1-7 ms with 4 ms most of the time at a Daemon poll interval of 30 sec and current drift applied rate is 200 microseconds/second.

Concurrency Control

TrueTime API is used to guarantee correctness properties around concurrency control, and how those properties are used to implement features such as externally consistent transactions, lock-free read-only transactions, and non-blocking reads in the past.
Important to distinguish writes as seen by Paxos Writes vs Spanner client writes.

Timestamp Management

Spanner supports:

Paxos Leader Leases

Spanner’s Paxos implementation uses timed leases to make leadership long-lived (10 seconds by default).
Leader Election

Assigning Timestamps to RW Transactions

Transactional reads and writes use two-phase locking(Reads & Writes block each other).
As a result, they can be assigned timestamps at any time when all locks have been acquired, but before any locks have been released.
For a given transaction, Spanner assigns it the timestamp that Paxos assigns to the Paxos write that represents the transaction commit.
Spanner depends on the following monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders.
Spanner also enforces the following external-consistency invariant: if the start of a transaction T2 occurs after commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.

Serving Reads at a Timestamp

Assigning Timestamps to RO Transactions

Details

Read Write Transactions

Like Bigtable, writes that occur in a transaction are buffered at the client until commit.
As a result, reads in a transaction do not see the effects of the transaction’s writes. This design works well in Spanner because a read returns the timestamps of any data read, and uncommitted writes have not yet been assigned timestamps.
Reads within read-write transactions use wound-wait to avoid deadlocks.

Jordan: Google Spanner(2013)

Strongly consistent SQL Database via Paxos.
Supports causally consistent Non-Blocking read-only Snapshots over multiple nodes at a time even though they are distributed. This is not something you could do in your traditional database. Causal Consistency
Write B is causally dependent on Write A if
Can be achieved by using Lamport Clocks.
Spanner is both externally and causally consistent.
The order of writes to the database is the order in which the events actually happened.
Formally:

Spanner Details

Spanner’s Design looks similar to Megastore to ensure strong consistency.

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

DynamoDB

Wed, 11 Dec 2024 00:00:00 +0000

Paper: DynamoDB

DynamoDB

Summary/Abstract

Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale.
Fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience.
In 2021, during the 66-hour Amazon Prime Day shopping event, 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance.
Design and implementation of DynamoDB have evolved since the first launch in 2012. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance.

Introduction

The goal of the design of DynamoDB is to complete all requests with low single-digit millisecond latencies.
DynamoDB uniquely integrates the following six fundamental system properties:
DynamoDB is a fully managed cloud service.
DynamoDB employs a multi-tenant architecture.
DynamoDB provides predictable performance
DynamoDB is highly available.
DynamoDB supports flexible use cases.
DynamoDB evolved as a distributed database service to meet the needs of its customers without losing its key aspect of providing a single-tenant experience to every customer using a multi-tenant architecture.
The paper explains the challenges faced by the system and how the service evolved to handle those challenges while connecting the required changes to a common theme of durability, availability, scalability, and predictable performance.

History

Design of DynamoDB was motivated by our experiences with its predecessor Dynamo. Dynamo was created in response to the need for a highly scalable, available, and durable key-value database for shopping cart data
Amazon learned that providing applications with direct access to traditional enterprise database instances led to scal- ing bottlenecks such as connection management, interference between concurrent workloads, and operational problems with tasks such as schema upgrades.
Service Oriented Architecture was adopted to encapsulate an application’s data behind service-level APIs that allowed sufficient decoupling to address tasks like reconfiguration without having to disrupt clients.
DynamoDB took the principles from Dynamo(which was being run as Self-hosted DB but created operational burden for developers) & Simple DB, a fully managed elastic NoSQL database service, but the data model couldn’t scale to the demands of the large Tables which DDB needed.
Dynamo Limitations:
SimpleDB limitations:
Amazon concluded that a better solution would combine the best parts of the original Dynamo design (incremental scalability and predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store)

Architecture

A DynamoDB table is a collection of items.
Each item is a collection of attributes. Uniquely identified by a primary key.
Schema of the primary key is specified at the table creation time.
The partition key’s value is always used as an input to an internal hash function.
The output from the hash function and the sort key value (if present) determines where the item will be stored.
Multiple items can have the same partition key value in a table with a composite primary key. However, those items must have different sort key values.
Supports secondary indexes to provide enhanced querying capability, which allows querying the data in the table using an alternate key.
DynamoDB provides a simple interface to store or retrieve items from a table or an index.
DynamoDB supports ACID transactions for multi-item updates w/o affecting scalability/availability/performance.
A DynamoDB table is divided into multiple partitions.
Each partition of the table hosts a disjoint and contiguous part of the table’s key-range and has multiple replicas(replication Group) distributed across different Availability Zones for high availability and durability.
The Replication Group uses Multi-Paxos for leader election and consensus.
Any replica can trigger a round of the election.
Once elected leader, a replica can maintain leadership as long as it periodically renews its leadership lease.
Only the leader replica can serve write and strongly consistent read requests.
Leader generates a write-ahead log record and sends it to its peers.
Write is acknowledged to the application once a quorum of peers persists the log record to their local write-ahead logs.
DynamoDB supports strong(Leader Read) and eventually consistent(Replica read) reads.
The leader of the group extends its leadership using a lease mechanism.
If the leader of the group is detected as a failure (considered unhealthy or unavailable) by any of its peers, the peer can propose a new round of the election to elect itself as the new leader. The new leader won’t serve any writes or consistent reads until the previous leader’s lease expires.
Partitioning/Replication Group
Log Replica/Node - Write Ahead Log(replicated) for High Availability and Durability.
Multi-Paxos Leader Election and Consensus.
Writes and Strongly/Eventually Consistent Reads
Microservice architecture
Metadata Service
Request Router Service
Auto-Admin Service(Central Nervous System of DDB)
Storage Service
Features supported by other Services

Journey from Provisioned to On-Demand

DDB was launched with Partitions as an internal abstraction, as a way to dynamically scale both the capacity and performance of tables.
Customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). RCUs and WCUs collectively are called provisioned throughput.
As the demands from a table changed (because it grew in size or because the load increased), partitions could be further split and migrated to allow the table to scale elastically. Partition abstraction proved to be really valuable and continues to be central to the design of DynamoDB.
[Challenge] This early version tightly coupled the assignment of both capacity and performance to individual partitions, which led to challenges
DynamoDB uses admission control to ensure that storage nodes don’t become overloaded, to avoid interference between co-resident table partitions, and to enforce the throughput limits requested by customers.
Admission control was the shared responsibility of all storage nodes for a table. Storage nodes independently performed admission control based on the allocations of their locally stored partitions.
Allocated throughput of each partition was used to isolate the workloads. DynamoDB enforced a cap on the maximum throughput that could be allocated to a single partition. Total throughput of all the partitions hosted by a storage node is less than or equal to the maximum allowed throughput on the node as determined by the physical characteristics of its storage drives.
The throughput allocated to partitions was adjusted when the overall table’s throughput was changed or its partitions were split into child partitions.
When a partition was split for size, the allocated throughput of the parent partition was equally divided among the child partitions and was allocated based on the table’s provisioned throughput.
E.g. Assume that a partition can accommodate a maximum provisioned throughput of 1000 WCUs. When a table is created with 3200 WCUs, DynamoDB created four partitions that each would be allocated 800 WCUs. If the table’s provisioned throughput was increased to 3600 WCUs, then each partition’s capacity would increase to 900 WCUs. If the table’s provisioned throughput was increased to 6000 WCUs, then the partitions would be split to create eight child partitions, and each partition would be allocated 750 WCUs. If the table’s capacity was decreased to 5000 WCUs, then each partition’s capacity would be decreased to 675 WCUs
The uniform distribution of throughput across partitions is based on the assumptions that an application uniformly accesses keys in a table and the splitting a partition for size equally splits the performance.
However, it was discovered that application workloads frequently have non-uniform access patterns both over time and over key ranges.
Hot Partition Worsening with Split: When the request rate within a table is non-uniform, splitting a partition and dividing performance allocation proportionately can result in the hot portion of the partition having less available performance than it did before the split.
[Single Hot Partition] Since throughput was allocated statically and enforced at a partition level, these non-uniform workloads occasionally resulted in an application’s reads and writes being rejected, called throttling, even though the total provisioned throughput of the table was sufficient to meet its needs. Common Challenges faced by the applications were:
Hot Partition
Throughput Dilution.
Customers would increase the provisioned throughput of the table(even if they were under the limit overall), which caused poor performance. It was difficult to estimate the correct provisioned throughput.
Hot partitions and throughput dilution stemmed from tightly coupling a rigid performance allocation to each partition, and dividing that allocation as partitions split. Bursting and Adaptive Capacity to address these concerns.

Improvements to Admission Control:

Key Observations:

Partitions had non-uniform access/traffic.
Not all partitions hosted by a storage node used their allocated throughput simultaneously.

Bursting

The idea behind Bursting was to let applications tap into the unused capacity at a partition level on a best effort basis to absorb short-lived spikes.
DynamoDB retained a portion of a partition’s unused capacity for later bursts of throughput usage for up to 300 seconds and utilized it when consumed capacity exceeded the provisioned capacity of the partition.
DynamoDB still maintained workload isolation by ensuring that a partition could only burst if there was unused throughput at the node level. The capacity was managed on the storage node using multiple token buckets to provide admission control:
**[Partition Token + Node Token]**When a read or write request arrives on a storage node, if there were tokens in the partition’s allocated token bucket, then the request was admitted and tokens were deducted from the partition and node level bucket.
[Burst Token + Node Token] Once a partition had exhausted all the provisioned tokens, requests were allowed to burst only when tokens were available both in the burst token bucket and the node level token bucket.
Read requests were accepted based on the local token buckets.
[Replica node’s Token Bucket for Write] Write requests using burst capacity require an additional check on the node-level token bucket of other member replicas of the partition.
The leader replica of the partition periodically collected information about each of the members node-level capacity.

Adaptive Capacity

DynamoDB launched adaptive capacity to better absorb long-lived spikes that cannot be absorbed by the burst capacity.
Better absorb work-loads that had heavily skewed access patterns across partitions.
Adaptive capacity actively monitored the provisioned and consumed capacity of all the tables.
If a table experienced throttling and the table level throughput was not exceeded, then it would automatically increase (boost) the allocated throughput of the partitions of the table using a proportional control algorithm.
The autoadmin system ensured that partitions receiving boost were relocated to an appropriate node that had the capacity to serve the increased throughput, however like bursting, adaptive capacity was also best-effort but eliminated over 99.99% of the throttling due to skewed access pattern.

Global Admission Control

Even though Bursting and Adaptive Capacity significantly reduced throughput problems for non-uniform access, they had limitations.
Takeaway from bursting and adaptive capacity was that we had tightly coupled partition level capacity to admission control.
Admission control was distributed and performed at a partition level.
DynamoDB realized it would be beneficial to remove admission control from the partition and let the partition always burst while providing workload isolation.
DynamoDB replaced adaptive capacity with global admission control (GAC).
GAC builds on the same idea of Token Bucket.
The GAC service centrally tracks the total consumption of the table capacity in terms of tokens.
Each request router maintains a local token bucket to make admission decisions and communicates with GAC to replenish tokens at regular intervals (in the order of a few seconds).
[Important Design Consideration] Each GAC server can be stopped and restarted without any impact on the overall operation of the service.
Each GAC server can track one or more token buckets configured independently.
All the GAC servers are part of an independent hash ring.
Request routers manage several time-limited tokens locally. When a request from the application arrives, the request router deducts tokens. Eventually, the request router will run out of tokens because of consumption or expiry. When the request router runs off of tokens, it requests more tokens from GAC.
The GAC instance uses the information provided by the client to estimate the global token consumption and vends tokens available for the next time unit to the client’s share of overall tokens.
Thus, it ensures that non-uniform workloads that send traffic to only a subset of items can execute up to the maximum partition capacity.
In addition to the global admission control scheme, the partition-level token buckets were retained for defense in-depth. The capacity of these token buckets is then capped to ensure that one application doesn’t consume all or a significant share of the resources on the storage nodes.

Balancing Consumed Capacity

Letting partitions burst(always) required DynamoDB to manage burst capacity effectively.
Colocation was a straightforward problem with provisioned throughput tables because of static partitions.

Splitting for Consumption

[Problem] Even with GAC and the ability for partitions to always burst, tables could experience throttling if their traffic was skewed to a specific set of items.
[Solution]
DynamoDB automatically scales out partitions once the consumed throughput of a partition crosses a certain threshold.
The split point in the key range is chosen based on key distribution the partition has observed.
The observed key distribution serves as a proxy for the application’s access pattern and is more effective than splitting the key range in the middle.
Partition splits usually complete in the order of minutes.
[Catch] Still class of workloads exist that cannot benefit from split for consumption. E.g. a partition receiving high traffic to a single item or a partition where the key range is accessed sequentially will not benefit from split. DDB avoids splitting the partition.

On Demand Provisioning

[Context]
Initially, applications migrated to DDB, were on self provisioned servers either on-prem or on self-hosted databases.
DynamoDB provides a simplified serverless operational model and a new model for provisioning - read and write capacity units.
[Problem]
The concept of capacity units was new to customers, some found it challenging to forecast the provisioned throughput.
Customers either over provisioned(Low utilization) or under provisioned(Throttling).
[Solution] To improve the customer experience for spiky workloads, DDB launched On-Demand Tables.
DynamoDB provisions the on-demand tables based on the consumed capacity by collecting the signal of reads and writes and instantly accommodates up to double the previous peak traffic on the table.
On-demand scales a table by splitting partitions for consumption. The split decision algorithm is based on traffic.
GAC allows DynamoDB to monitor and protect the system from one application consuming all the resources.

Durability and Correctness

Data loss can occur because of hardware failures, software bugs, or hardware bugs.
DynamoDB is designed for high durability by having mechanisms to prevent, detect, and correct any potential data losses.

Hardware Failures

Write-ahead logs(WAL) in DynamoDB are central for providing durability and crash recovery. Write ahead logs are stored in all three replicas of a partition.
For higher durability, the write ahead logs are periodically archived to S3, an object store that is designed for 11 nines(99.999999999) of durability.
The unarchived logs are typically a few hundred megabytes in size.
When a node fails, all replication groups hosted on the node are down to two copies.
The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree and write-ahead logs.
[Solution] Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability.
Adding a log replica takes only a few seconds because the system has to copy only the recent write-ahead logs from a healthy replica to the new replica without the B-tree. Quick healing of impacted replication groups using log replicas ensures high durability of most recent writes.

Silent Data Errors

[Problem] Some hardware failures can cause incorrect data to be stored . These errors can happen because of the storage media, CPU, or memory.
It’s very difficult to detect these and they can happen anywhere in the system.
[Solution] DynamoDB makes extensive use of checksums to detect silent errors.
By maintaining checksums within every log entry, message, and log file, DynamoDB validates data integrity for every data transfer between two nodes.
Checksums serve as guardrails to prevent errors from spreading to the rest of the system.
Every log file that is archived to S3 has a manifest that contains information about the log, such as a table, partition and start and end markers for the data stored in the log file.
The agent responsible for archiving log files to S3 performs various checks before uploading the data. These include and are not limited to verification of every log entry to ensure that it belongs to the correct table and partition, verification of checksums to detect any silent errors, and verification that the log file doesn’t have any holes in the sequence numbers.
Once all the checks are passed, the log file and its manifest are archived. Log archival agents run on all three replicas of the replication group. If one of the agents finds that a log file is already archived, the agent downloads the uploaded file to verify the integrity of the data by comparing it with its local write-ahead log.
Every log file and manifest file are uploaded to S3 with a content checksum. The content checksum is checked by S3 as part of the put operation, which guards against any errors during data transit to S3.

Continuous Verification

DynamoDB also continuously verifies data at rest. Our goal is to detect any silent data errors or bit rot in the system. An example of such a continuous verification system is the scrub process.
The goal of scrub is to detect errors that we had not anticipated, such as bit rot.
The scrub process runs and verifies two things:
The verification is done by computing the checksum of the live replica and matching that with a snapshot of one generated from the log entries archived in S3.
Scrub acts as a defense in depth to detect divergences between the live storage replicas with the replicas built using the history of logs from the inception of the table.
A similar technique of continuous verification is used to verify replicas of global tables.
We have learned that continuous verification of data-at-rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Software Bugs

[Problem] DDB is a complex Distributed Key Value store. High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on.
[Solution] DDB uses formal methods extensively to ensure the correctness of our replication protocols. The core replication protocol was specified using TLA+.
When new features that affect the replication protocol are added, they are incorporated into the specification and model checked.
Model checking has allowed us to catch subtle bugs that could have led to durability and correctness issues before the code went into production. S3 also uses Model Checking.
Extensive failure injection testing and stress testing to ensure the correctness of every piece of software deployed.
In addition to testing and verification of the replication protocol of the data plane, formal methods have also been used to verify the correctness of our control plane and features such as distributed transactions.

Backups and Restore

In addition to guarding against physical media corruption, DynamoDB also supports backup and restore to protect against any logical corruption due to a bug in a customer’s application. Backups or restores don’t affect performance or availability of the table as they are built using the write-ahead logs that are archived in S3.
The backups are consistent across multiple partitions up to the nearest second.
The backups are full copies of DynamoDB tables and are stored in an Amazon S3 bucket.
DynamoDB also supports point-in-time restore where customers can restore the contents of a table that existed at any time in the previous 35 days to a different DynamoDB table in the same region.
For tables with the point-in-time restore enabled, DynamoDB creates periodic(based on the amount of write-ahead logs accumulated for the partition) snapshots of the partitions that belong to the table and uploads them to S3.
Snapshots, in conjunction to write-ahead logs, are used to do point-in-time restore.
[Workflow] When a point-in-time restore is requested for a table,

Availability

To achieve high availability, DynamoDB tables are distributed and replicated across multiple Availability Zones (AZ) in a Region. DynamoDB regularly tests resilience to node, rack, and AZ failures.
To test the availability and durability of the overall service, power-off tests are exercised. Using realistic simulated traffic, random nodes are powered off using a job scheduler. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Write and Consistent Read Availability

A partition’s write availability depends on its ability to have a healthy leader and a healthy write quorum.
A healthy write quorum in the case of DynamoDB consists of two out of the three replicas from different AZs.
A partition remains available as long as there are enough healthy replicas for a write quorum and a leader
A partition will become unavailable for writes if the number of replicas needed to achieve the minimum quorum are unavailable
The leader replica serves consistent reads.
Introducing log replicas was a big change to the system, and the formally proven implementation of Paxos provided us the confidence to safely tweak and experiment with the system to achieve higher availability
Eventually consistent reads can be served by any of the replicas.
In case a leader replica fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Failure Detection

[Problem] A newly elected leader will have to wait for the expiry of the old leader’s lease before serving any traffic. While this only takes a couple of seconds, the elected leader cannot accept any new writes or consistent read traffic during that period, thus disrupting availability.
Failure detection must be quick and robust to minimize disruptions. False positives in failure detection can lead to more disruptions in availability. Failure detection works well for failure scenarios where every replica of the group loses connection to the leader.
However, nodes can experience gray network failures(Gray Failure).
Gray network failures can happen because of communication issues between a leader and follower, issues with outbound or inbound communication of a node, or front-end routers facing communication issues with the leader even though the leader and followers can communicate with each other.
Gray failures can disrupt availability because there might be a false positive in failure detection or no failure detection
For example, a replica that isn’t receiving heartbeats from a leader will try to elect a new leader. This can disrupt availability.
[Solution] To solve the availability problem caused by gray failures, a follower that wants to trigger a failover sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election. This change in the failure detection algorithm used by DynamoDB significantly minimized the number of false positives in the system, and hence the number of spurious leader elections.

Measuring Availability

DynamoDB is designed for 99.999(5-9s) percent availability for global tables and 99.99**(4-9s)** percent availability for regional tables.
To ensure these goals are being met, DynamoDB continuously monitors availability at service and table levels. The tracked availability data is used to analyze customer perceived availability trends and trigger alarms if customers see errors above a certain threshold. These alarms are called customer-facing alarms (CFA) to report any availability-related problems and proactively mitigate the problem either automatically or through operator intervention.
In addition to real time monitoring of availability, the system runs daily jobs that trigger aggregation to calculate aggregate availability metrics per customer.
DynamoDB also measures and alarms on availability observed on the client-side. There are two sets of clients used to measure the user-perceived availability.
Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers and catch gray failures.

Deployments

Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows and without impacting the performance and availability that customers experience.
The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment.
[Problem] Deployments are not atomic in a distributed system. At any given time, there will be software running the old code on some nodes and new code on other parts of the fleet.
New software might introduce a new type of message or change the protocol in a way that old software in the system doesn’t understand.
[Solution] DynamoDB handles these kinds of changes with read-write deployments. Read-write deployment is completed as a multi-step process.
The first step is to deploy the software to read the new message format or protocol. Once all the nodes can handle the new message, the software is updated to send new messages.
Read-write deployments ensure that both types of messages can coexist in the system. Even in the case of rollbacks, the system can understand both old and new messages.
[OneBox] Deployments are done on a small set of nodes before pushing them to the entire fleet of nodes. The strategy reduces the potential impact of faulty deployments.
[AutoRollback AlarmWatcher/ApprovalWorkflow] DynamoDB sets alarm thresholds on availability metrics. If error rates or latency exceed the threshold values during deployments, the system triggers automatic rollbacks.
**[Problem]**Software deployments to storage nodes trigger leader failovers that are designed to avoid any impact to availability.

Dependencies on External Services

To ensure high availability, all the services that DynamoDB depends on in the request path should be more highly available than DynamoDB.
Alternatively, DynamoDB should be able to continue to operate even when the services on which it depends are impaired.
Examples of services DynamoDB depends on for the request path include AWS Identity and Access Management Services (IAM), and AWS Key Management Service (AWS KMS) for tables encrypted using customer keys. DynamoDB uses IAM and AWS KMS to authenticate every customer request.
While these services are highly available, DynamoDB is designed to operate when these services are unavailable without sacrificing any of the security properties that these systems provide.
In the case of IAM and AWS KMS, DynamoDB employs a statically stable design, where the overall system keeps working even when a dependency becomes impaired.
Perhaps the system doesn’t see any updated information that its dependency was supposed to have delivered. However, everything before the dependency became impaired continues to work despite the impaired dependency.
DynamoDB caches result from IAM and AWS KMS in the request routers that perform the authentication of every request. DynamoDB periodically refreshes the cached results asynchronously.
If AWS IAM or KMS were to become unavailable, the routers will continue to use the cached results for a predetermined extended period.
Caches improve response times by removing the need to do an off-box call, which is especially valuable when the system is under high load.

Metadata Availability

One of the most important pieces of metadata the request routers needs is the mapping between a table’s primary keys and storage nodes.
[Metadata Storage]At launch, DynamoDB stored the metadata in DynamoDB itself.
[Routing Schema] This routing information consists of all the partitions for a table, the key range of each partition, and the storage nodes hosting the partition.
[Router Metadata Caching] When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.

DynamoDB Limits

**Per Partition Read and Write Capacity Units - **Ref
1 MB limit on the size of data returned by a single Query, Scan/GetItem Op.
BatchGetItem operation can return up to 16MB of data - Ref
Item Size Limit: Ref
**Secondary Indexes - **Ref
Transactions:

MicroBenchmarks

To show that scale doesn’t affect the latencies observed by applications, we ran YCSB [8] workloads of types A (50 percent reads and 50 percent updates) and B (95 percent reads and 5 percent updates)
Both benchmarks used a uniform key distribution and items of size 900 bytes.
The workloads were scaled from 100 thousand total operations per second to 1 million total operations per second.
The purpose of the graph is to show, even at different throughput, DynamoDB read latencies show very little variance and remain identical even as the throughput of the workload is increased.

Paper Link: https://www.usenix.org/conference/atc22/presentation/elhemali

Last updated: March 15, 2026

Questions or discussion? Email me

BigTable

Mon, 09 Dec 2024 00:00:00 +0000

Paper: BigTable

BigTable/Wide Column Storage System

Goal

Design a distributed and scalable system that can store a huge amount of semi-structured data. The data will be indexed by a row key where each row can have an unbounded number of columns.

What is BigTable

BigTable is a distributed and massively scalable wide-column store.
Designed to store huge sets of structured data.
Provides storage for very big tables (often in the terabyte range)
BigTable is a CP system, i.e., it has strongly consistent reads and writes.
BigTable can be used as an input source or output destination for MapReduce.

Background

Developed at Google in 2005 and used in dozens of Google services.
Google couldn’t use external commercial databases because of its large scale services, and costs would have been too high. So they built an in-house solution, custom built for their use case and traffic patterns.
BigTable is highly available(?? With consistency??) and high-performing database that powers multiple applications across Google — where each application has different needs in terms of the size of data to be stored and latency with which results are expected.
BigTable inspired various open source databases like Cassandra(borrow BigTable’s DataModel), HBase(Distributed Non-Relational Database) and HyperTable.

BigTable UseCases

Google built BigTable to store large amounts of data and perform thousands of queries per second on that data.
Examples of BigTable data are billions of URLs with many versions per page, petabytes of Google Earth data, and billions of users’ search data.
BigTable is suitable to store large datasets that are greater than one TB where each row is less than 10MB.
Since BigTable does not provide ACID properties or transaction support(Across Rows or Tables), OLTP applications should not use BigTable.
Data should be structured in the form of key-value pairs or rows-columns.
Non-structured data like images or movies should not be stored in BigTable.
Google examples:
BigTable can be used to store the following types of data:

Big Table Data Model

Agenda

Rows
Column families
Columns
Timestamps

Details

BigTable can be characterized as a sparse, distributed, persistent, multidimensional, sorted map.
Traditional DBs have a two-dimensional layout of the data, where each cell value is identified by the ‘Row ID’ and ‘Column Name’.
BigTable has a four-dimensional data model. The four dimensions are:
The data is indexed (or sorted) by row key, column key, and a timestamp. Therefore, to access a cell’s contents, we need values for all of them.
If no timestamp is specified, BigTable retrieves the most recent version.

Rows

Each row in the table is uniquely identified by an associated row key(internally represented as String) that is an arbitrary string of up to 64 kilobytes in size (although most keys are significantly smaller).
Every read or write of data under a single row is atomic.
Atomicity across rows is not guaranteed, e.g., when updating two rows, one might succeed, and the other might fail.
Each table’s data is only indexed by row key, column key, and timestamp. There are no secondary indices.
A column is a key-value pair where the key is represented as ‘column key’ and the value as ‘column value.’

Column families

Column keys are grouped into sets called column families. All data stored in a column family is usually of the same type. This is for compression purposes.
The number of distinct column families in a table should be small (in the hundreds at maximum), and families should rarely change during operation.
Access control as well as both disk and memory accounting are performed at the column-family level.
All rows have the same set of column families.
BigTable can retrieve data from the same column family efficiently.
Short Column family names are better as names are included in the data transfer.

Columns

Columns are units within a column family.
A BigTable may have an unbounded number of columns.
New columns can be added on the fly.
Short column names are better as names are passed in each data transfer, e.g., ColumnFamily:ColumnName => Work:Dept
BigTable is quite suitable for sparse data(Empty columns are not stored).

Timestamps

Each column cell can contain multiple versions of the content.
A 64-bit timestamp identifies each version that either represents real time or a custom value assigned by the client.
While reading, if no timestamp is specified, BigTable returns the most recent version.
If the client specifies a timestamp, the latest version that is earlier than the specified timestamp is returned.
BigTable supports two per-column-family settings to garbage-collect cell versions automatically

System APIs

BigTable provides APIs for two types of operations:

Metadata operations
Data operations

Metadata operations

APIs for creating and deleting tables and column families.
Functions for changing cluster, table, and column family metadata, such as access control rights.

Data operations

Clients can insert, modify, or delete values in BigTable.
Clients can also lookup values from individual rows or iterate over a subset of the data in a table.
BigTable supports single-row transactions(Single row atomic read/writes), which can be used to perform atomic read-modify-write sequences on data stored under a single row key.
Bigtable does not support transactions across row keys, but provides a client interface for batch writing across row keys.
BigTable allows cells to be used as integer counters.
A set of wrappers allow a BigTable to be used both as an input source and as an output target for MapReduce jobs.
Clients can also write scripts in Sawzall(a language developed at Google) to instruct server-side data processing (transform, filter, aggregate) prior to the network fetch.
APIs for write operations:
A read or scan operation can read arbitrary cells in a BigTable:

Partitioning and High Level Architecture

Table Partitioning

A single instance of a BigTable implementation is known as a cluster.
Each cluster can store a number of tables where each table is split into multiple Tablets, each around 100–200 MB in size.
Tables broken into Tablets(row boundary) which hold a contiguous range of rows.
Initially, each table consists of only one Tablet. As the table grows, multiple Tablets are created. By default, a table is split at around 100 to 200 MB.
Tablets are the unit of distribution and load balancing.
Since the table is sorted by row, reads of short ranges of rows(within a small number of Tablets) are always efficient. This means selecting a row key with a high degree of locality is very important.
Each Tablet is assigned to a Tablet server, which manages all read/write requests of that Tablet.

High Level Architecture

Big Table cluster consists of 3 major components:

Client Library: Application talks to BigTable using client library.
One master server: For doing metadata operations, managing Tablets and assigning Tablets to Tablet servers.
Many Tablet servers: Each Tablet server serves read and write of the data to the Tablets it is assigned. BigTable is built on top of several other pieces from Google infrastructure:
GFS: BigTable uses the Google File System to store its data and log files.
SSTable: Google’s Sorted String Table file format is used to store BigTable data.
Chubby: BigTable uses a highly available and persistent distributed lock service called Chubby to handle synchronization issues and store configuration information.
Cluster Scheduling System: Google has a cluster management system that schedules, monitors, and manages the Bigtable’s cluster.

SSTables

How are Tablets stored in GFS?

BigTable uses Google File System (GFS), a persistent distributed file storage system to store data as files.
The file format used by BigTable to store its files is called SSTable.
SSTables are persisted, ordered maps of keys to values, where both keys and values are arbitrary byte strings.
Each Tablet is stored in GFS as a sequence of files called SSTables.
An SSTable consists of a sequence of data blocks (typically 64KB in size).
A block index is used to locate blocks; the index is loaded into memory when the SSTable is opened.
An SSTable lookup can be performed with a single disk seek. We first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from the disk.
To read data from an SSTable, it can either be copied from disk to memory as a whole or can be done via just the index. The former approach avoids subsequent disk seeks for lookups, while the latter requires a single disk seek for each lookup.
SSTables provide two operations:
SSTable is immutable once written to GFS. If new data is added, a new SSTable is created. Once an old SSTable is no longer needed, it is set out for garbage collection.
SSTable immutability is at the core of BigTable’s data checkpointing and recovery routines.
Advantages of SSTable’s immutability:

Table vs Tablet vs SSTable

Multiple Tablets make up a table.
SSTables can be shared by multiple Tablets. [Why?]
Tablets do not overlap, SSTables can overlap.
To improve performance, BigTable uses an in-memory, mutable sorted buffer called MemTable to store recent updates.
As more writes are performed, MemTable size increases, and when it reaches a threshold, the MemTable is frozen, a new MemTable is created, and the frozen MemTable is converted to an SSTable and written to GFS.
Each data update is also written to a commit-log(Write Ahead Log WAL) which is also stored in GFS. This log contains redo records used for recovery if a Tablet server fails before committing a MemTable to SSTable.
While reading, the data can be in MemTables or SSTables. Since both these tables are sorted, it is easy to find the most recent data.

GFS and Chubby

GFS

GFS files are broken down into fixed-size blocks called chunks.
SSTables are divided into fixed-size blocks and these blocks are stored on the chunk servers. Each Chunk is replicated across multiple chunk servers for reliability.
Clients interact with master for metadata, and chunk servers directly for SSTable data files.

Chubby

Chubby Recap:

Chubby is a highly available and persistent distributed locking service.
Chubby usually runs with five active replicas, one of which is elected as the master to serve requests. To remain alive, a majority of Chubby replicas must be running.
BigTable depends on Chubby so much that if Chubby is unavailable for an extended period of time, BigTable will also become unavailable.
Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure.
Chubby provides a namespace consisting of files and directories. Each file or directory can be used as a lock. Read and write access to a Chubby file is atomic. In BigTable, Chubby is used to:
Allows a multi-thousand node Bigtable cluster to stay coordinated.
Ensure there is only one active master. The master maintains a session lease with Chubby and periodically renews it to retain the status of the master.
Store the bootstrap location of BigTable data.
Discover new Tablet servers as well as the failure of existing ones.
Store BigTable schema information (the column family information for each table)
Store Access Control Lists (ACLs).

BigTable Components

A BigTable cluster consists of three major components:

A library component that is linked into every client.
One master server.
Many Tablet servers.

BigTable Master Server

There is only one master server in a BigTable cluster, and it is responsible for:

Assigning Tablets to Tablet servers and ensuring effective load balancing.
Monitoring the status of Tablet servers and managing the joining or failure of Tablet Servers.
Garbage collection of the underlying files stored in GFS
Handling metadata operations such as table and column family creations.
Bigtable master is not involved in the core task of mapping tablets onto the underlying files in GFS (Tablet servers handle this).
This means that Bigtable clients do not have to communicate with the master at all.(What?)
This design decision significantly reduces the load on the master and the possibility of the master becoming a bottleneck.

Tablet Server

Each Tablet server is assigned ownership of a number of Tablets (typically 10-1000 Tablets per server) by the master server.
Each Tablet server serves read and write requests of the data of the Tablets it is assigned.
The client communicates directly with the Tablet servers for reads/writes.
Tablet servers can be added or removed dynamically from a cluster to accommodate changes in the workloads.
Tablet creation, deletion, or merging is initiated by the master server, while Tablet partition or splitting(too Large) is handled by Tablet servers who notify the master.

Working with Tablets

Agenda

Locating Tablets
Assigning Tablets
Monitoring Tablet Servers
Load-balancing Tablet servers

Locating Tablets

Since Tablets move around from server to server (due to load balancing, Tablet server failures, etc.), given a row, how do we find the correct Tablet server?
To answer this, we need to find the Tablet whose row range covers the target row.
BigTable maintains a 3-level hierarchy, analogous to that of a B+ tree, to store Tablet location information.
BigTable creates a special table, called Metadata table, to store Tablet locations.
This Metadata table contains one row per Tablet that tells us which Tablet server is serving this Tablet.
Each row in the METADATA table stores a Tablet’s location under a row key that is an encoding of the Tablet’s table identifier and its end row.
BigTable stores the information about the Metadata table in two parts:
A BigTable client seeking the location of a Tablet starts the search by looking up a particular file in Chubby that is known to hold the location of the Meta- 0 Tablet.
This Meta-0 Tablet contains information about other metadata Tablets, which in turn contain the location of the actual data Tablets.
With this scheme, the depth of the tree is limited to three. For efficiency, the client library caches Tablet locations and also prefetch metadata associated with other Tablets whenever it reads the METADATA table

Assigning Tablets

A Tablet is assigned to only one Tablet server at any time.
The master keeps track of the set of live Tablet servers and the mapping of Tablets to Tablet servers.
The master also keeps track of any unassigned Tablets and assigns them to Tablet servers with sufficient room.
When a Tablet server starts, it creates and acquires an exclusive lock on a uniquely named file in Chubby’s “servers” directory. This mechanism is used to tell the master that the Tablet server is alive.
During Master restarts(or startup), following things happens:

Monitoring Tablet servers(Tablet Failures or Network Partitions)

BigTable maintains a ‘Servers’ directory in Chubby, which contains one file for each live Tablet server.
Whenever a new Tablet server comes online, it creates a new file in this directory to signal its availability and obtains an exclusive lock on this file. As long as a Tablet server retains the lock on its Chubby file, it is considered alive.
BigTable’s master keeps monitoring the ‘Servers’ directory, and whenever it sees a new file in this directory, it knows that a new Tablet server has become available and is ready to be assigned Tablets.
Master regularly checks the status of the lock. If the lock is lost, the master assumes that there is a problem either with the Tablet server or the Chubby.
In such a case, the master tries to acquire the lock, and if it succeeds, it concludes that Chubby is working fine, and the Tablet server is having problems.
The master, in this case, deletes the Tablet server’s Chubby lock file and reassigns the tablets of the failing Tablet server
The deletion of the file works as a signal for the failing Tablet server to terminate itself and stop serving the Tablets.
It tries to acquire the lock again, and if it succeeds, it considers it a temporary network problem and starts serving the Tablets again.
If the file gets deleted, then the Tablet server terminates itself to start afresh.

Load-balancing Tablet servers

Master periodically asks Tablet servers about their current load. All this information gives the master a global view of the cluster and helps assign and load-balance Tablets.

Life of BigTables Read and Write Operations

Write Request

Upon receiving a write request, Tablet server performs the following steps

Validate request to be well formed
Does sender Authorization to perform mutation using ACLs in Chubby.
If authorized, mutation is written to commit-log in GFS that stores redo records.
Once committed to commit-log, request contents are stored in memory in a sorted buffer called MemTable.
After inserting data into MemTable, success acknowledgement is sent to the client.
Periodically, MemTables are flushed to SSTables, and SSTables in the background are merged using Compaction.

Read Request

Upon receiving a read request, Tablet server performs following steps:

Validate request is well formed and sender is authorized.
Return rows if they are available in cache.
Read MemTable to find the required rows.
Read SSTable Indexes that are loaded in memory to find SSTables that will have the required data, then read those rows from SSTables.
Merge rows read from MemTable and SSTable to find the required version of data.
Since MemTable and SSTables are sorted, merged view can be formed efficiently.

Fault Tolerance and Compaction

Agenda

Fault tolerance and replication
Compaction

Fault tolerance and replication

Fault tolerance in Chubby and GFS

Both the systems employ a replication strategy for fault tolerance and high availability, that minimizes downtime for Chubby. Similarly, GFS replication creates multiple copies of the data to avoid data loss.

Fault tolerance for Tablet server

BigTable’s master is responsible for monitoring the Tablet servers.
The master does this by periodically checking the status of the Chubby lock against each Tablet server.
When the master finds out that a Tablet server has gone dead, it reassigns the tablets of the failing Tablet server.

Fault tolerance for the Master

The master acquires a lock in a Chubby file and maintains a lease.
If, at any time, the master’s lease expires, it kills itself.
When Google’s Cluster Management System finds out that there is no active master, it starts one up.
The new master has to acquire the lock on the Chubby file before acting as the master.

Compaction

Mutations in BigTable take up extra space till compaction happens. BigTable manages compaction behind the scenes. List of compactions:

Minor Compaction(MemTable Written to SSTables)
Merging Compaction(SSTables + MemTable compacted to Larger SSTable)
Major Compaction(All SSTables - >Single SS Table)

BigTable implemented certain refinements to achieve high performance, availability, and reliability.

Agenda

Locality groups
Compression
Caching
Bloom Filters
Unified commit Log
Speeding up Tablet recovery

Locality groups

BigTable uses column-oriented storage.
Clients can club together multiple column families into a locality group.
BigTable generates separate SSTables for each locality group.
This has few benefits:

Compression

Clients can choose to compress the SSTable for a locality group to save space.
BigTable allows its clients to choose compression techniques based on their application requirements.
The compression ratio gets even better when multiple versions of the same data are stored.
Compression is applied to each SSTable block separately.

Caching

To improve read performance, Tablet servers employ two levels of caching:

Bloom Filters

Any read operation has to read from all SSTables that make up a Tablet.
These SSTables are not in memory, thus the read operation needs to do many disk accesses. To reduce the number of disk accesses BigTable uses Bloom Filters.
Bloom Filters are created for SSTables (particularly for the locality groups).
They help to reduce the number of disk accesses by predicting if an SSTable does “not” contain data corresponding to a particular (row, column) pair.
Bloom filters take a small amount of memory but can improve the read performance drastically.

Unified commit Log

Instead of maintaining separate commit log files for each Tablet, BigTable maintains one log file for a Tablet server. This gives better write performance.
Since each write has to go to the commit log, writing to a large number of log files would be slow as it could cause a large number of disk seeks.
One disadvantage of having a single log file is that it complicates the Tablet recovery process.
When a Tablet server dies, the Tablets that it served will be moved to other Tablet servers.
To recover the state for a Tablet, the new Tablet server needs to reapply the mutations for that Tablet from the commit log written by the original Tablet server.
However, the mutations for these Tablets were co-mingled in the same physical log file. One approach would be for each new Tablet server to read this full commit log file and apply just the entries needed for the Tablets it needs to recover.
However, under such a scheme, if 100 machines were each assigned a single Tablet from a failed Tablet server, then the log file would be read 100 times.
BigTable avoids duplicating log reads by first sorting the commit log entries in order of the keys .
In the sorted output, all mutations for a particular Tablet are contiguous and can therefore be read efficiently.
To further improve the performance, each Tablet server maintains two log writing threads — each writing to its own and separate log file.
Only one of the threads is active at a time. If one of the threads is performing poorly (say, due to network congestion), the writing switches to the other thread. Log entries have sequence numbers to allow the recovery process

Speeding up Tablet recovery

One of the complicated and time-consuming tasks while loading Tablets is to ensure that the Tablet server loads all entries from the commit log.
When the master moves a Tablet from one Tablet server to another, the source Tablet server performs compactions to ensure that the destination Tablet server does not have to read the commit log. This is done in 3 steps:

Tablet Splitting

Concurrency on MemTable

Want to avoid read-contention when writes are also happening on the same rows.
Use Copy-on-write semantics on a per-row basis.

Performance Observations

BigTable Characteristics

BigTable performance(and Popularity)

Distributed multi-level map: BigTable can run on a large number of machines.
Scalable means that BigTable can be easily scaled horizontally by adding more nodes to the cluster without any performance impact. No manual intervention or rebalancing is required. BigTable achieves linear scalability and proven fault tolerance on commodity hardware
Fault-tolerant and reliable: Since data is replicated to multiple nodes, fault tolerance is pretty high.
Durable: BigTable stores data permanently.
Centralized: BigTable adopts a single-master approach to maintain data consistency and a centralized view of the state of the system.
Separation between control and data: BigTable maintains a strict separation between control and data flow. Clients talk to the Master for all metadata operations, whereas all data access happens directly between the Clients and the Tablet servers.

Dynamo vs. BigTable

Datastores developed on the principles of BigTable

Google’s BigTable has inspired many NoSQL systems. Here is a list of a few famous ones:

HBase: HBase is an open-source, distributed non-relational database modeled after BigTable. It is built on top of the Hadoop Distributed File System (HDFS).
Hypertable: Similar to HBase, Hypertable is an open-source implementation of BigTable and is written in C++. Unlike BigTable, which uses only one storage layer (i.e., GFS), Hypertable is capable of running on top of any file system (e.g., HDFS, GlusterFS, or the CloudStore ). To achieve this, the system has abstracted the interface to the file system by sending all data requests through a Distributed File System broker process.
Cassandra: Cassandra is a distributed, decentralized, and highly available NoSQL database. Its architecture is based on Dynamo and BigTable. Cassandra can be described as a BigTable-like datastore running on a Dynamo-like infrastructure. Cassandra is also a wide-column store and utilizes the storage model of BigTable, i.e., SSTables and MemTables.

Summary

BigTable is a Distributed wide column storage system designed to manage large amounts of semi-structured data with High Availability, Low Latency, Scalability, and Fault tolerance.
It is a sparse, distributed, persistent, Multi Dimensional sorted map.
Map is indexed by a unique key made up of Row Key(up to 64 KB), Column key, and a timestamp(64-bit integer).
Columns are grouped into Column families. RowKey and Column key uniquely identifies a Column data cell. Within each cell, data is further indexed by timestamps to store multiple versions of the data.
Each read/write to a row is atomic. Atomicity across rows is not guaranteed.
A BigTable’s Table could be a multi-TB table. A Table is broken into a smaller range of rows called Tablets.
One Master server and multiple Tablet Servers.
Master does metadata management, Assigns Tablets to Tablet servers, does Tablet rebalancing etc.
Read/Write of data goes directly to the tablet servers.
Tablet servers store each tablet as a set of Immutable SSTable files, each of which is further divided into 64KB Data Blocks. SStables are stored as Chunks in GFS and replicated to different chunk servers.
To enhance read performance, especially reducing disk seeks while trying to check for existence of a Key from each of SSTable, Bloom filters are used to check for existence.
BigTable relies on Chubby for master server selection(and Failover), using Locks, and also master check if the Tablet servers are alive, since they take a lock on the Chubby’s server directory.
Writes first go to a Commit Log(WAL) for failure recovery, then to In-Memory MemTable(where it’s kept as a Sorted Map), and when it breaches threshold, its written to SSTable.
MemTables, SSTables merged and SSTables are compacted to bigger SSTable in background using compactions.
All the read operations are served from a Merged view of MemTable and All SSTables.

Reference

BigTable
SSTable(LSM Trees)
Amazon Dynamo
Cassandra
HBase
Jordan BigTable

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Chubby

Sat, 07 Dec 2024 00:00:00 +0000

Paper: Chubby

Chubby / Distributed Locking Service

Goal

Design a highly available and consistent service that can store small objects and provide a locking mechanism on those objects.

What is Chubby?

Chubby is a service that provides a distributed locking mechanism and also stores small files.
Internally, it is implemented as a key/value store that also provides a locking mechanism on each object stored in it.
Extensively used in various systems inside Google to provide storage and coordination services for systems like GFS and BigTable.
Apache ZooKeeper is the open-source alternative to Chubby.
Chubby is a centralized service offering developer-friendly interfaces (to acquire/release locks and create/read/delete small files).
It does all this with just a few extra lines of code to any existing application without a lot of modification to application logic.
At a high level, Chubby provides a framework for distributed consensus.

Chubby Use Cases

Primarily Chubby was developed to provide a reliable locking service. Other use cases evolved like:

Leader Election

Any lock service can be seen as a consensus service, as it converts the problem of reaching consensus to handing out locks.
A set of distributed applications compete to acquire a lock, and whoever gets the lock first gets the resource.
Similarly, an application can have multiple replicas running and wants one of them to be chosen as the leader. Chubby can be used for leader election among a set of replicas.

Naming Service(Like DNS)

It is hard to make faster updates to DNS due to its time-based caching nature, which means there is generally a potential delay before the latest DNS mapping is effective.

Storage(Small Objects that rarely change)

Chubby provides a Unix-style interface to reliably store small files that do not change frequently (complementing the service offered by GFS).
Applications can then use these files for any usage like DNS, configs, etc.

Distributed Locking Mechanism

Chubby provides a developer-friendly interface for coarse-grained distributed locks (as opposed to fine-grained locks) to synchronize distributed activities in a distributed environment.
Application needs a few lines, and chubby can take care of all lock management so that devs can focus on business logic, and not solve distributed Locking problems in a Distributed system’s setting.
We can say that Chubby provides mechanisms like semaphores and mutexes for a distributed environment.

When Not to Use Chubby?

Bulk Storage is needed
Data update rate is high.
Locks are acquired/released frequently.
Usage is more like a publish/subscribe model.

Background

Chubby is neither really a research effort nor does it claim to introduce any new algorithms.
Rather, Chubby describes a certain design and implementation done at Google in order to provide a way for its clients to synchronize their activities and agree(Consensus) on basic information about their environment

Chubby and Paxos

Chubby uses Paxos underneath to manage the state of the Chubby system at any point in time.
Getting all nodes in a distributed system to agree on anything (e.g., election of primary among peers) is basically a kind of distributed consensus problem.
Distributed consensus using Asynchronous Communication is already solved by Paxos protocol.

Chubby Common Terms

Chubby Cell

Chubby cell is a Chubby Cluster. Most Chubby Cells are single Data Center(DC) but there can be some configuration where Chubby replicas exist Cross DC as well.
Chubby cell has two main components, server and client, that communicate via remote procedure call (RPC).

Chubby Servers

A Chubby Cell consists of a small set of servers(typically 5) known as Replicas.
Using Paxos, one of the servers is selected as Master which handles all client requests. Fails over to another replica if the master fails.
Each replica maintains a small database to store files/directories/locks.
The master writes directly to its own local database, which gets synced asynchronously to all the replicas(Reliability).
For Fault Tolerance, replicas are placed on different racks.

Chubby Client Library

Client applications use a Chubby library to communicate with the replicas in the chubby cell using RPC.

Chubby API

Chubby exports a unix-like file system interface similar to POSIX but simpler.
It consists of a strict tree of files and directories with name components separated by slashes. E.g. File format: /ls/chubby_cell/directory_name/…/file_name
A special name, /ls/local, will be resolved to the most local cell relative to the calling application or service. What is the most local Cell?
Chubby can be used for locking or storing a small amount of data or both, i.e., storing small files with locks.
API Categories

General

Open() : Opens a given named file or directory and returns a handle.
Close() : Closes an open handle.
Poison() : Allows a client to cancel all Chubby calls made by other threads without fear of deallocating the memory being accessed by them.
Delete() : Deletes the file or directory.

File

GetContentsAndStat() : Returns (atomically) the whole file contents and metadata associated with the file. This approach of reading the whole file is designed to discourage the creation of large files, as it is not the intended use of Chubby.
GetStat() : Returns just the metadata.
ReadDir() : Returns the contents of a directory – that is, names and metadata of all children.
SetContents() : Writes the whole contents of a file (atomically).
SetACL() : Writes new access control list information.

Locking

Acquire() : Acquires a lock on a file.
TryAcquire() : Tries to acquire a lock on a file; it is a non-blocking variant of Acquire.
Release() : Releases a lock.

Sequencer

GetSequencer() : Get the sequencer of a lock. A sequencer is a string representation of a lock.
SetSequencer() : Associate a sequencer with a handle.
CheckSequencer() : Check whether a sequencer is valid. Chubby does not support operations like append, seek, move files between directories, or making symbolic or hard links.

Files can only be completely read or completely written/overwritten. This makes it practical only for storing very small files.

Design Rationale

Agenda

Why was chubby built as a service?
Why coarse-grained locks?
Why advisory locks?
Why does Chubby need storage?
Why does Chubby exports like a unix-like file system interface?
High Availability and reliability

Why was chubby built as a service rather than a distributed client library doing Paxos?

Reasons behind building a distributed service instead of having a client library that only provides Paxos distributed consensus? A lock service has clear advantages over a client library:

Why coarse-grained locks?

Chubby locks usage is not expected to be fine-grained in which they might be held for only a short period (i.e., seconds or less). For example, electing a leader is not a frequent event. Reasons why only coarse grained locks ar supported:

Less load on the lock server
Survive Lock server failures
Fewer lock servers are needed:
Implement a fine-grained locking system on top of this coarse grained locking system Chubby.

Why advisory locks?

Chubby locks are advisory, which means it is up to the application to honor the lock. Chubby doesn’t make locked objects inaccessible to clients not holding their locks.
Chubby gave following reasons for not having mandatory locks:

Why does Chubby need storage?

To provide a Consistent view of the system to various distributed entities in some use cases like:

Why does Chubby exports like a unix-like file system interface?

It significantly reduces the effort needed to write basic browsing and namespace manipulation tools, and reduces the need to educate casual Chubby users.

High Availability and reliability

Chubby compromises on performance in favor of availability and consistency. What?

How Chubby Works?

Agenda

Service Initialization
Client Initialization
Leader Election example using Chubby

Service Initialization

A master is chosen among chubby replicas using Paxos.
Current master information is persisted in storage and all replicas become aware of the master.

Client Initialization

Client contacts DNS to know listed Chubby replicas.
Client calls Chubby Server directly via Remote Procedure Call(RPC)
If that replica is not the master, it will return the address of the current master.
Once the master is located, the client maintains a session with it and sends all requests to it until it indicates that it is not the master anymore or stops responding.

Leader Election example using Chubby

Example of application that uses Chubby to elect a single master from a bunch of instances of the same application.

Sample Pseudocode for leader election from client application.

Files, Directories and Handles

Agenda

Nodes
Metadata
Handles Chubby file system interface is a tree of files and directories(which can have sub-directories but not files), each of which is called a node.

Nodes

Any node can act as an advisory reader/writer lock.
Nodes can be ephemeral or permanent.
Ephemeral files are used as temporary files and act as an indicator to others that a client is alive.
Ephemeral files are also deleted if no client has them open.
Ephemeral directories are also deleted if they are empty.
Any node can be explicitly deleted.

Metadata

Metadata for each node includes ACL(Access control list), 4 monotonically increasing 64-bit numbers, and a checksum.
ACL
Monotonically increasing 64-bit numbers: These numbers allow clients to detect changes easily.
Checksum : Chubby exposes a 64-bit file-content checksum so clients may tell whether files differ.

Handles

Clients open nodes to obtain handles(similar to Unix File Descriptors). Handles include:

Locks Sequencers and Lock-Delays

Agenda

Locks
Sequencer
Lock-Delay

Locks

Each chubby node can act as a reader-writer lock in the following two ways:

Sequencer

With distributed systems, receiving messages out of order is a problem.
Chubby uses sequence numbers to solve this problem.
So below what we are basically trying to do is, trying to do distributed consensus(total order broadcast) on a bunch of application servers(using Leader election on those servers) by using a Distributed Lock Service(Chubby) which uses Paxos to help provide distributed consensus within application servers.
After acquiring a lock on a file, a client can immediately request a Sequencer, which is an opaque byte string describing the state of the lock.
An application’s master server can generate a sequencer and send it with any internal order to other application servers.
Application servers that receive orders from a primary can check with Chubby if the sequencer is still good and does not belong to a stale primary (to handle the ‘Brain split’ scenario).

Lock-Delay

For file servers(or external services) that do not support sequencers**(or Fencing Tokens** to protect against delayed packets belonging to an older lock**)**, Chubby provides a lock-delay period to protect against message delays and server restarts.
If a client releases a lock in the normal way, it is immediately available for other clients to claim, as one would expect.
However, if a lock becomes free because the holder has failed or become inaccessible, the lock server will prevent other clients from claiming the lock for a period called the lock- delay.
While imperfect, the lock-delay protects unmodified servers and clients from everyday problems caused by message delays and restarts.

Session and Events

Agenda

What is a Chubby Session?
Session Protocol
What is Keep Alive
Session Optimization
Failovers

What is a Chubby Session?

A relationship b/w Chubby Cell and a Client.
It exists for some interval of time and is maintained by periodic handshakes called keepalives.
Clients’ handles, locks, and cached data only remain valid provided its session remains valid.

Session Protocol

Client requests a new session from Chubby cells’s master.
Session ends if the client explicitly ends it or it has been idle.
Each session has an associated lease, which is the time interval during which the master guarantees not to terminate the session unilaterally. End of this interval is called Session Lease Timeout.
Master advances session lease timeout in 3 circumstances:

What is Keep Alive

Keepalive is a way for a client to maintain a constant session with Chubby Cell.
Steps:
Google experimentation showed that 93% of RPC requests are KeepAlives.
How can we reduce the keepalives?

Session Optimization

Piggybacking events(using a different event to transmit some additional detail)
Local Lease
Jeopardy
Grace Period
Original(Initial chubby session):
Optimization Attempt 1:

Failovers

Failover happens when the master fails or otherwise loses membership. Chubby typically takes b/w 5-30 seconds for fail-over.
Summary of things that happen in a master failover.
Client has lease M1 (& local lease C1) with master and pending KeepAlive request.
Master starts lease M2 and replies to the KeepAlive request.
Client extends the local lease to C2 and makes a new KeepAlive call. Master dies before replying to the next KeepAlive. So, no new leases can be assigned. Client’s C2 lease expires, and the client library flushes its cache and informs the application that it has entered jeopardy. The grace period starts on the client.
Eventually, a new master is elected and initially uses a conservative approximation M3 of the session lease that its predecessor may have had for the client. Client sends KeepAlive to new master (4).
The first KeepAlive request from the client to the new master is rejected (5) because it has the wrong master epoch number (described in the next section).
Client retries with another KeepAlive request.
Re-tried KeepAlive succeeds. Client extends its lease to C3 and optionally informs the application that its session is no longer in jeopardy (session is in the safe mode now).
Client makes a new KeepAlive call, and the normal protocol works from this point onwards.
Because the grace period was long enough to cover the interval between the end of lease C2 and the beginning of lease C3, the client saw nothing but a delay. If the grace period was less than that interval, the client would have abandoned the session and reported the failure to the application.

Master Election and Chubby Events?

Initializing a newly elected Master

A newly elected master proceeds as follows:
Picks a new Epoch Number: To differentiate itself from the previous master. Clients are required to present an epoch number on every call. Master rejects calls from clients using older epoch numbers. This ensures that the new master will not respond to a very old packet that was sent to the previous master.
Responds to master-location requests: but doesn’t respond to session related operations yet.
Build in-memory data structures:
Let clients perform keep-alives:
Emits a fail-over event to each session:
Wait: Master waits until each session acknowledges the fail-over event or lets its session expire.
Allow all operations to proceed.
Honor older handles by clients:
Delete Ephemeral files:

Chubby Events

Chubby supports a simple event mechanism to let its clients subscribe to events.
Events are delivered asynchronously via callbacks from the chubby library.
Clients subscribe to a range of events while creating a handle.
Example of events from Server to Chubby Client:
Additionally Chubby client sends the following session events to the application:

Caching

Chubby Cache

Caching is important since it is used for read heavy purposes rather than write heavy.
Chubby clients cache file contents, node metadata, and information on open handles in a consistent, write-through cache in clients’ memory.
Chubby must maintain consistency b/w file, its replicas, and cache as well.
Clients maintain their cache by a lease mechanism, and flush the cache when the lease expires.

Cache Invalidation

Protocol for cache invalidation when file data or metadata is changed:

Question: While the master is waiting for acknowledgments, are other clients allowed to read the file?

Answer: During the time the master is waiting for the acknowledgments from clients, the file is treated as ‘uncachable.’ This means that the clients can still read the file but will not cache it. This approach ensures that reads always get processed without any delay. This is useful because reads outnumber writes. Question: Are clients allowed to cache locks? If yes, how is it used?
Answer: Chubby allows its clients to cache locks, which means the client can hold locks longer than necessary, hoping that they can be used again by the same client. Question: Are clients allowed to cache open handles?
Answer: Chubby allows its clients to cache open handles. This way, if a client tries to open a file it has opened previously, only the first open() call goes to the master.

Database

Agenda

Backup
Mirroring How chubby uses a database for storage.
Initially, Chubby used a replicated version of Berkeley DB to store its data. Later, the Chubby team felt that using Berkeley DB exposes Chubby to more risks, so they decided to write a simplified custom database with the following characteristics:

Backup

For recovery in case of failure, all database transactions are stored in a transaction log (a write-ahead log).
As this transaction log can become very large over time, every few hours, the master of each Chubby cell writes a snapshot of its database to a GFS server in a different building.
The use of a separate building ensures both that the backup will survive building damage, and that the backups introduce no cyclic dependencies in the system;
Once a snapshot is taken, the previous transaction log is deleted. Therefore, at any time, the complete state of the system is determined by the last snapshot together with the set of transactions from the transaction log.
Backup databases are used for disaster recovery and to initialize the database of a newly replaced replica without placing a load on other replicas.

Mirroring

Mirroring is a technique that allows a system to automatically maintain multiple copies. Chubby allows a collection of files to be mirrored from one cell to another.
Mirroring is fast because the files are small.
A special “global” cell subtree /ls/global/master that is mirrored to the subtree /ls/cell/replica in every other Chubby cell.
Various files in which Chubby cells and other systems advertise their presence to monitoring services.
Pointers to allow clients to locate large data sets such as Bigtable cells, and many configuration files for other systems.

Scaling Chubby

Agenda

Proxies
Partitioning
Learning Chubby’s clients are individual processes, so Chubby handles more clients than expected. At Google, 90,000+ clients communicate with a single Chubby server.

Techniques used to reduce communication with the master(since read heavy):

Minimize request rate by creating more chubby cells so that clients almost always use a nearby cell(found with DNS) to avoid reliance on remote machines.
Minimize KeepAlives Load: KeepAlives are by far the dominant types of request.
Caching: Clients cache file data, metadata, handles, locks etc.
Simplified protocol conversions:

Proxies

A proxy is an additional server that can act on behalf of the actual server.
A Chubby proxy can handle KeepAlives and read requests.
All writes and first-time reads pass through the cache to reach the master
Proxy responsible for invalidating client’s cache as well.

Partitioning

Need to support 100K clients. How would chubby do that?
Chubby’s interface (files & directories) was designed such that namespaces can easily be partitioned between multiple Chubby cells if needed.
Chubby can partition nodes within a large directory(with lots of sub-directories).
Scenarios in which partitioning does not help scale:

Learning

Lack of aggressive caching: Initially, clients were not caching the absence of files or open file handles. An abusive client could write loops that retry indefinitely when a file is not present or poll a file by opening it and closing it repeatedly when one might expect they would open the file just once. Chubby educated its users to make use of aggressive caching for such scenarios.
Lack of quotas: Chubby was never intended to be used as a storage system for large amounts of data, so it has no storage quotas. In hindsight, this was naive. To handle this, Chubby later introduced a limit on file size (256kBytes).
Publish/subscribe: There have been several attempts to use Chubby’s event mechanism as a publish/subscribe system. Chubby is a strongly consistent system, and the way it maintains a consistent cache makes it a slow and inefficient choice for publish/subscribe. Chubby developers caught and stopped such uses early on.
Developers rarely consider availability: Developers generally fail to think about failure probabilities and wrongly assume that Chubby will always be available. Chubby educated its clients to plan for short Chubby outages so that it has little or no effect on their applications.

Chubby as a Name Service?

Authors were surprised to find that Chubby was most popular for DNS.
Hard to pick a good value for TTL, since DNS uses TTL and may serve stale values for some time(up-to 60 secs).
Chubby however, via Client side Cache invalidation, provides Consistent Reads.
E.g. If starting a n processes where each process looks each other up(via DNS), that’s N^2 DNS lookups.
Chubby sees a thundering herd from the reads at client startup(not cached). Summary:
Distributed Lock Service used inside Google.
Provides coarse-grained locking(for minutes, hours or days) and not recommended for fine-grained locking(seconds or less). Suited to read-heavy rather than write-heavy. Although you can build a fine-grained locking system on top of Chubby.
A Chubby cell is a Chubby Cluster(usually with 3 or 5 replicas).
Using Paxos, one replica in a Cell is chosen as master which handles all read/write requests. If the master fails, a fail-over is performed.
Each replica has a local database, for files/directories/locks etc. Master writes directly to its own database, which gets asynchronously replicated for Fault Tolerance.
Clients use a Chubby Library to communicate with Servers using RPC.
Chubby interface is a unix-like file system based, a tree of files and directories(which other sub-directories but not files).
Locks: Each node(file/directory) can act as an advisory reader(shared)-writer(exclusive) lock.
Ephemeral Nodes to indicate others that a client is alive.
Metadata includes ACL, Monotonically increasing 64-bit numbers, and CheckSum.
Events mechanism between Chubby Client and server and Client and application for a variety of events like, Lock Acquired, file edited, Jeopardy, Safe etc.
Client Caching to reduce read traffic. Need consistency b/w File, Replica, and its client cache. Client cache invalidation using KeepAlive request/responses.
Clients maintain Sessions using KeepAlive RPCs.
Backup Snapshot of database(Write-Ahead Log) to a GFS file server to different buildings.
Mirroring:Collection of files synced from one cell to another. System Design Patterns:
Write-Ahead Log: For Fault Tolerance, to handle master crash, all database transactions stored in a transaction log**(on local drive or on a distributed GFS?)**
Quorum: To ensure strong consistency. Master gets write ack from N replicas before responding back to client about write success.
Generation Clock: Newly elected master uses Epoch number(monotonically increasing) to avoid split brain.
Lease: Chubby Client maintains a Time bound session lease with Master. References:
Chubby Paper
Chubby Architecture video
Chubby vs ZooKeeper
Hierarchical Chubby
BigTable
GFS
Jordan Deep Dive

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Kafka

Wed, 04 Dec 2024 00:00:00 +0000

Paper: Kafka

Kafka/Distributed Messaging System

Goal

Design a distributed messaging system that can reliably transfer a high throughput of messages between different entities.

Background

One common challenge in distributed systems is handling continuous influx of data from multiple sources.
E.g. Imagine a log aggregation service that can receive hundreds of log entries per second from different sources. Function of this log aggregation service is to store these logs on a disk at a shared server and build an index on top of these logs so that they can be searched later.
Challenges of this service?
Distributed Messaging Systems(or Asynchronous processing paradigm) can help.

What is a messaging System?

System responsible for transferring data amongst various disparate systems like apps, services, processes, servers etc, w/o introducing additional coupling b/w producers and consumers, and by providing asynchronous way of communicating b/w sender and receiver.
Two types of Messaging Systems

Queue

A Particular message can be consumed by one consumer only.
Once a message is consumed, it’s removed from the queue.
Limits the system as the same messages can’t be read by the multiple consumers.

In the Pub-Sub model, messages are written into Partitions/Topics.
Producers write the messages to topics that get persisted in the messaging system.
Subscribers subscribe to those topics to receive each message that was published.
Pub-Sub model allows multiple consumers to read the same message.
Messaging system that stores and handles messages is called a Broker.
Provides a loose coupling b/w producers and consumers so they don’t need to be synchronized. They can read and write messages at different rates.
Also provides fault-tolerance. Messages don’t get lost.
A messaging system can be deployed for various reasons:

Kafka

Agenda

What is Kafka
Background
Kafka Use Cases

What is Kafka?

Open source pub-sub messaging system
Can work as a message-queue as well.
Distributed, Fault tolerant, highly scalable by design.
Fundamentally a system that takes streams of messages from producers, store reliably on a central cluster(with a set of brokers), and allows those messages to be delivered to consumers.

Background

Created at LinkedIn in 2010 to track Page Views(events), Messages from Messaging Systems, and Logs from Various services.
Kafka is also known as a Distributed Commit log or Write Ahead Log or a Transaction Log.
Commit Log is an append-only data structure that can persistently store a sequence of records.
Records are always appended to the end of the log, and once added, records cannot be deleted or modified. Reading from a commit log always happens from left to right (or old to new).
Stores all messages on disk and reads and writes take advantage of sequential disk reads/writes.

Kafka Use Cases

Can be used for collecting huge amounts(Big Data) events and do real-time stream processing of those events.
Metrics: Can collect and aggregate monitoring data. Different services can write their metrics which can be later pulled from Kafka to produce aggregate statistics.
Log Aggregation: Collect logs from various sources, and make them available in standard format to multiple consumers.
Stream Processing: Cases where data undergoes transformation after reading. E.g. Raw data consumed from the topic is transformed, enriched, aggregated, and pushed to a new topic for further consumption. Sort of creating a derived view of data from the source of record data.
Commit Log: Can be used as an external commit log for distributed systems which can keep track of their states.
Website Activity Tracking: One of the original use cases was to build a User-Activity tracking pipeline. Like Page clicks searches, are published to separate topics. These topics are made available for later processing like Loading data into Hadoop(for Batch Processing), Data Warehousing systems for Analytics or reporting. Can also be fed into Product Suggestion or Recommendations systems which can power, Similar Products that you may like, or people have bought etc.

High Level Architecture

Agenda

Kafka Common Terms
High-Level Architecture

Kafka Common Terms

Brokers
Records
Topics
Producers
Consumers
In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers.

High Level Architecture

Kafka Cluster

Kafka is run as a cluster of one or more servers, where each server is responsible for running one Kafka broker.

ZooKeeper

ZooKeeper is a highly read optimized distributed key-value store and is used for coordination and storing configurations.
In Original Version of Kafka, Kafka had used Zookeeper to coordinate between Kafka brokers; ZooKeeper maintains metadata information about the Kafka cluster

Kafka Deep Dive

Related Notes: Alex XU II

Agenda

Topic Partitions
High-Water Mark Kafka is simply a collection of topics. As topics can get quite big, they are split into partitions of a smaller size for better performance and scalability.

Topic Partitions

Kafka Topics are partitioned, and these partitions are placed on separate nodes/brokers.
When a new message is published to a topic, it gets appended to one of the topic’s partitions, usually decided by using the Customer specified Partition Key.
A partition is an ordered sequence of messages.
Kafka Guarantees FIFO Ordering between messages of a single partition. No ordering guarantees across partitions or at a topic level.
A Unique Sequence ID called a Partition Offset gets assigned to every message added to a partition. Used to identify a message’s sequential position within a partition.
Offset sequences are unique to a single partition. Messages are uniquely located using (Topic, Partition, Offset).
Producers can choose to publish messages to any partition. If Ordering within a partition is not needed, a Round-Robin strategy can be used for evenly partitioning data across nodes.
Placing partitions on separate brokers allows for multiple consumers to read from a topic in a parallel, i.e. Different consumers can concurrently read different partitions on separate brokers. However, for multiple consumers within the same Consumer Group, only 1 consumer from a consumer Group can read the data from a Partition at any time.
Messages once written to a partition are immutable(Append Only Log).
Producer specifies a Partition Key, to any message that it publishes so that data is written to the same partition.
Each broker can manage a set of partitions from across various topics.
Follows the principle of Dumb Broker and Smart Consumer.
Kafka doesn’t keep a record of what records are read by the consumer. Consumers poll kafka for new messages and specify which records(specified by partition Offset) they want to read from the topic.
Consumers are allowed to increment/decrement the offset to replay and reprocess the messages.
Each Topic partition has one leader broker and multiple replica(followers) brokers.

Leaders and Followers

A leader is the node responsible for all reads and writes for the given partition. Every partition has one Kafka broker acting as a leader.
To handle Single Point of Failure and to enable Fault Tolerance, Kafka replicates partitions and distributes them across multiple brokers.
Each follower’s responsibility is to replicate the leader’s data to serve as a backup partition.
A follower can take over the leadership if the leader of a partition goes down.
Kafka stores the location of the leader of each partition in ZooKeeper
As all writes/reads happen at/from the leader, producers and consumers directly talk to ZooKeeper to find a partition leader.

In Sync Replicas(ISR)

An in-sync replica (ISR) is a broker that has the latest data for a given partition.
A follower is an in-sync replica only if it has fully caught up to the partition it is following.
Only ISRs are eligible to become partition leaders.
Kafka can choose the minimum number of ISRs required before the data becomes available for consumers to read.

High Water Mark

To ensure data consistency, the leader broker never returns (or exposes) messages which have not been replicated to a minimum set of ISRs.
Broker uses High Water Mark which is the highest offset that all ISRs of a particular partition share.
The leader exposes data only up to the high-water mark offset and propagates the high-water mark offset to all followers.
This avoids the case of a Non-Repeatable read in case the Leader crashes before Replicas get the latest messages.

Consumer Groups

Agenda

What is a Consumer Group?
Distributing Partitions to a consumer within Consumer Groups.

What is a Consumer Group?

A consumer group is basically a set of one or more consumers working together in parallel to consume messages from topic partitions.
No two consumers within the same Consumer group can attach to the same partition at a time. Thus no two consumers within CG receive the same message.

Distributing Partitions to a consumer within Consumer Groups.

Kafka ensures that only a single consumer reads messages from any partition within a consumer group
Topic partitions are a unit of parallelism
If a consumer stops, Kafka spreads partitions across the remaining consumers in the same consumer group
Every time a consumer is added to or removed from a group, the consumption is rebalanced within the group.
Parallelizing processing across multiple partitions of a topic, helps support very high Throughput.
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
Kafka uses any unused consumers as failovers when there are more consumers than partitions. Extra Consumers are idle in the meantime.
Rebalancing happens as Consumers are added and removed from the ConsumerGroups.

Kafka Workflow

Agenda

Kafka Workflow as Pub-Sub messaging
Kafka Workflow for Consumer Group
Kafka provides both pub-sub and queue-based messaging systems in a fast, reliable, persisted, fault-tolerance, and zero downtime manner.
In both cases, producers simply send the message to a topic, and consumers can choose any one type of messaging system depending on their need

Kafka Workflow as Pub-Sub Messaging

Producer publishes a message to a topic.
Broker stores messages in the partitions configured for that topic. If no partition keys were specified, Broker spreads the messages evenly across partitions.
Consumer subscribes to a specific Topic. Broker provides the current offset of that Topic back to Consumer and saves that Offset to ZooKeeper.
Consumers will request Brokers at regular intervals for new messages and process it once kafka sends those messages.
Once the consumer processes the message, it sends an acknowledgement back to the broker. Broker Updates the processed offsets in the ZooKeeper.
Consumers can rewind/skip to the desired offset and read subsequent messages.

Role of Zookeeper

Agenda

What is ZooKeeper?
ZooKeeper as Central Coordinator.

What is ZooKeeper?

Distributed configuration and synchronization service.
Serves as the coordination interface between the Kafka brokers, producers, and consumers.
Kafka stores basic metadata in ZooKeeper, such as information about brokers, topics, partitions, partition leader/followers, consumer offsets.

ZooKeeper as the central coordinator(Might be Stale info)

Kafka brokers are stateless; they rely on ZooKeeper to maintain and coordinate brokers, such as notifying consumers and producers of the arrival of a new broker or failure of an existing broker, as well as routing all requests to partition leaders.
Stores all sorts of Metadata about the Kafka Cluster

How do producers or consumers find out who the leader of a partition is?

In the older versions of Kafka, all clients (i.e., producers and consumers) used to directly talk to ZooKeeper to find the partition leader.
Kafka has moved away from this coupling, and in Kafka’s latest releases, clients fetch metadata information from Kafka brokers directly;
All the critical information is stored in the ZooKeeper and ZooKeeper replicates this data across its cluster, therefore, failure of Kafka broker (or ZooKeeper itself) does not affect the state of the Kafka cluster.
Zookeeper is also responsible for coordinating the partition leader election between the Kafka brokers in case of leader failure.

Controller Broker

Agenda

What is a Controller Broker?
Split Brain.
Generation Clock.

What is a Controller Broker?

Within the Kafka cluster, one broker is elected as the Controller.
Controller broker is responsible for admin operations, such as creating/deleting a topic, adding partitions, assigning leaders to partitions, monitoring broker failures by doing health checks on other brokers.
Communicates the result of the partition leader election to other brokers in the system.

Split Brain

When a controller node dies, kafka elects a new controller. One of the problems is that we cannot truly know if the leader has stopped for good(Crash Stop) or has experienced intermittent failures like Stop the World GC or process Pause, or a temporary network disruption.
Two split-brain controllers would be giving out conflicting commands in parallel. If something like this happens in a cluster, it can result in major inconsistencies. How do we handle this?

Generation Clock?

Split-brain is commonly solved with a generation clock, which is simply a monotonically increasing number to indicate a server’s generation.
In Kafka, the generation clock is implemented through an epoch number, Old leader = epoch 1, and new leader = epoch 2.
This epoch is included in every request that is sent from the Controller to other brokers.
Brokers can now easily differentiate the real Controller by simply trusting the Controller with the highest number.
This epoch number is stored in ZooKeeper.

Kafka Delivery Semantics?

Agenda

Producer Delivery Semantics
Consumer Delivery Semantics

Producer Delivery Semantics

A producer writes only to the leader broker, and the followers asynchronously replicate the data.
How can a producer know that the data is successfully stored at the leader or that the followers are keeping up with the leader?
Kafka offers three options to denote the number of brokers that must receive the record before the producer considers the write as successful:

Consumer Delivery Semantics

A consumer can read only those messages that have been written to a set of in-sync replicas(High Water Mark).
There are three ways of providing consistency to the consumer:

Kafka Characteristics

Agenda

Storing messages to disks
Record Retention in Kafka
Client Quota
Kafka Performance

Storing messages to disks

Kafka writes its messages to the local disk and does not keep anything in RAM. Disk storage is important for durability so that the messages will not disappear if the system dies and restarts.
Even though disk access is generally considered to be slow, there is a huge performance difference b/w Random Block Access and Sequential Access.
Random block access is slower because of numerous disk seeks, whereas the sequential nature of writing or reading, enables disk operations to be thousands of times faster than random access.
Because all writes and reads happen sequentially, Kafka has a very high throughput.
Writing or reading sequentially from disks are heavily optimized by the OS, via read-ahead (prefetch large block multiples) and write-behind (group small logical writes into big physical writes) techniques.
Also, modern operating systems cache the disk in free RAM. This is called Pagecache.
Since Kafka stores messages in a standardized binary format unmodified throughout the whole flow (producer → broker → consumer), it can make use of the zero-copy optimization.
Kafka has a protocol that groups messages together. This allows network requests to group messages together and reduces network overhead.

Record Retention in Kafka

By default, Kafka retains records until it runs out of disk space. We can set time-based limits (configurable retention period), size-based limits (configurable based on size), or compaction (keeps the latest version of record using the key).
For example, we can set a retention policy of three days, or two weeks, or a month, etc.
The records in the topic are available for consumption until discarded by time, size, or compaction.

Client Quota

Heavy Hitters(Noisy Neighbours) can exhaust broker resources, or can cause network saturation to multi-tenant kafka clusters, which can deny service to other clients and broker themselves.
In Kafka, quotas are byte-rate thresholds defined per client-ID(application).
The broker does not return an error when a client exceeds its quota but instead attempts to slow the client down by holding the client’s response for enough time to keep the client under the quota.
This also prevents clients from having to implement special back-off and retry behavior.

Kafka Performance

Scalability
Fault Tolerance and Reliability
Throughput
Low Latency?

System Design Pattern:

High Water Mark - To deal with Non-Repeatable reads and data consistency.
Leader and Follower - Leader serves read/writes. Followers do replication.
Split-Brain - Multiple Controller nodes active at a time(due to Zombie Controller). Generational Epoch number to resolve.
Segmented Log - Log segmentation to implement storage for its partitions. References:
Confluent Docs
NYTimes usecase
Kafka Summit 2019
Kafka Acks explained(TODO)
Kafka as distributed log
Minimizing Kafka Latency(TODO)
Kafka Internal Storage(TODO)
Exactly once semantics(TODO)
Split Brain(TODO) Open Questions:
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
In At-most-once consumer delivery semantics, Why can’t the consumer read from the previous offset? Why are messages said to be lost?[Research]
Exactly once semantics? How would transactions happen across 2 systems(consumer processing + Kafka Offset Commit). How are they suggesting the transaction would be rolled back?
Zero Copy Optimization
Page Cache optimization Kafka
How does replication internal work b/w leader follower?
Tombstoning in Kafka.

1️⃣ Zero Copy Optimizations & Page Cache in Kafka and Other Systems

📌 What is Zero Copy?

Zero Copy is a kernel-level optimization that allows data to be transferred between disk and network without passing through user-space memory, reducing CPU overhead and increasing throughput.

🚀 Why is Zero Copy important? ✔ Reduces CPU usage (since data isn’t copied multiple times).

✔ Minimizes context switches (between user and kernel space).

✔ Improves I/O throughput (as memory copying is avoided).

📌 How Kafka Uses Zero Copy (Sendfile Optimization)

Kafka uses Zero Copy via the sendfile system call in Linux.

🔹 Without Zero Copy (Traditional Path)

Kafka reads a log file from disk → (Disk → Kernel Space).
The kernel copies data to Kafka’s user-space buffer → (Kernel Space → User Space).
Kafka writes the buffer to a network socket → (User Space → Kernel Space → Network).
The kernel sends data over the network. 🔹 With Zero Copy (Optimized Path)
Kafka calls sendfile() → Kernel directly transfers a log file to the network socket.
No user-space buffer required → Data goes directly from disk to network. ✔ Avoids unnecessary copies in user-space.✔ Greatly improves throughput (Kafka can achieve millions of messages per second).

📌 Zero Copy Optimizations in Other Systems

[Table content - requires manual formatting]

📌 What is Page Cache and How Kafka Optimizes It?

Kafka doesn’t need a traditional database cache. Instead, it relies on the OS page cache for fast reads.

✔ Page Cache: The Linux kernel automatically caches recently read disk pages in memory.

✔ Kafka uses the Page Cache to serve reads directly from memory without hitting disk.

🔹 How Page Cache Works in Kafka:

When a consumer reads a message, Kafka first checks the OS page cache.
If the data is cached, it is served directly from memory (zero disk I/O).
If the data isn’t in cache, Kafka reads it from disk, and the OS automatically caches it. 🚀 Optimizations in Kafka: ✔ Uses sendfile() to directly transfer from Page Cache to network.✔ Leverages sequential disk access (append-only logs) for high read efficiency.✔ Minimizes JVM heap memory usage by relying on OS caching.

2️⃣ How Replication Works Between Leaders and Followers in Kafka

Kafka ensures fault tolerance and high availability using replication.

📌 Basics of Kafka Replication

✔ Each Kafka topic is partitioned, and each partition has:

One Leader (handles all reads & writes).
One or more Followers (replicas of the leader’s data).

📌 Steps in Kafka Replication

1️⃣ Producer writes data to the Leader Partition.

2️⃣ Leader appends data to its local log segment.

3️⃣ Followers fetch new data from the leader.

4️⃣ Followers append data to their own log segment.

5️⃣ **Followers send an acknowledgment (ACK) once they persist the data.**6️⃣ If a majority of followers acknowledge, Kafka considers the message committed.

📌 Leader and Follower Sync Mechanism

✔ Kafka uses a pull-based replication model → Followers poll the leader to fetch new data.

✔ Offset tracking: Followers maintain an offset to track the latest committed message.

✔ ISR (In-Sync Replicas): Only replicas in sync with the leader are part of the ISR.

📌 How a New Leader is Elected?

✔ If the Leader fails, one of the ISR replicas is promoted.

✔ The new Leader starts serving read and write requests.

✔ If no ISR exists, the partition becomes temporarily unavailable until a new Leader is available.

📌 Replication Strategies

[Table content - requires manual formatting]

🚀 Tuning Replication Settings for Performance ✔ min.insync.replicas = 2 → Ensures durability (at least two replicas must ACK).

✔ unclean.leader.election = false → Prevents unsafe leader elections (data loss risk).

✔ replica.lag.time.max.ms = 10,000 → Defines when a slow follower is removed from ISR.

3️⃣ Tombstoning in Kafka

Kafka Tombstoning is used for deleting records in log-compacted topics.

📌 Why is Tombstoning Needed?

✔ Kafka never deletes data immediately.

✔ Instead, Kafka marks the record as deleted (tombstone message).

✔ The actual data is removed later during log compaction.

📌 How Tombstoning Works

Producer sends a null value for a key (marks it as deleted).
Kafka appends this tombstone message to the log.
The consumer sees the tombstone event and removes the record from its own storage.
Kafka’s log compaction eventually purges the tombstone message and the original record.

📌 Example of Tombstone Message

{

“key”: “user_123”,

“value”: null,

“timestamp”: 1700000000

}

✔ This soft deletes “user_123”.

✔ Log compaction later removes both the original record and the tombstone.

📌 How Log Compaction Works

Log compaction keeps only the latest value for each key.
Tombstones stay in the log until Kafka compacts the segment.
Kafka guarantees at least one copy of the latest record is retained (even after compaction). 🚀 Tuning Log Compaction ✔ log.cleanup.policy = compact → Enables log compaction.

✔ delete.retention.ms = 86400000 → Keeps tombstones for 24 hours before purging.

✔ log.segment.bytes → Defines when segments are compacted.

🔹 Summary & Key Takeaways

Zero Copy & Page Cache

✔ Kafka uses sendfile() for Zero Copy, avoiding unnecessary memory copies.✔ Page Cache stores recent messages, reducing disk I/O.

Replication Between Leaders and Followers

✔ Kafka uses asynchronous, pull-based replication for performance.✔ ISR (In-Sync Replicas) ensures durability.✔ New leaders are elected from ISR in case of failure.

Tombstoning & Log Compaction

✔ Kafka uses tombstones (null values) for soft deletes.✔ Log compaction removes older records but keeps the latest one per key.

Paper Link: https://notes.stephenholiday.com/Kafka.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Cassandra

Tue, 03 Dec 2024 00:00:00 +0000

Paper: Cassandra

Cassandra / Distributed Wide Column NoSQL Database

Goal

Design a distributed and scalable system that can store a huge amount of semi-structured data, which is indexed by a row key where each row can have an unbounded number of columns.

Background

Open source Apache Project developed at FB in 2007 for Inbox Search feature.
Designed to provide Scalability, Availability, Reliability to store large amounts of data.
Combines distributed nature of Amazon’s Dynamo(K-V store) and DataModel for Google’s BigTable which is a Column based store.
Decentralized architecture with no Single Point of Failure(SPOF), Performance can scale linearly with addition of nodes.

What is Cassandra?

Cassandra is typically classified as an AP (i.e., Available and Partition Tolerant) system which means that availability and partition tolerance are generally considered more important than the consistency. Eventually Consistent
Similar to Dynamo, Cassandra can be tuned with replication-factor and consistency levels to meet strong consistency requirements, but this comes with a performance cost.
Uses peer-to-peer architecture where each node communicates to all other nodes.

Cassandra Use Cases

Any application where eventual consistency is not a concern can utilize Cassandra.
Cassandra is optimized for high throughput writes.
Can be used for collecting big data for performing real-time analysis.
Storing key-value data with high availability(Reddit/Dig) because of linear scaling w/o downtime.
Time Series Data Model
Write Heavy Applications
NoSQL

High Level Architecture

Agenda

Cassandra Common Terms
High Level Architecture

Cassandra Common Terms

Column: A Key-Value pair. Most basic unit of data structure in Cassandra.
Row: Container for columns referenced by the primary key.
Table: Container of rows.
KeySpace: Container for tables that span over one or more cassandra nodes.
Cluster : Container of KeySpaces.
Node: Computer system running cassandra instance. Physical Host, or VM, or even Docker container.

Data Partitioning

Cassandra uses Consistent Hashing similar to Dynamo.

Cassandra Keys

Mechanisms used by Cassandra to uniquely identify the rows.
Primary Key uniquely identifies each row of a table.
Primary Key = Partition Key + Clustering Key

Clustering Keys

Clustering keys define how the data is stored within a node. Can have multiple clustering keys.

Partitioner

Component which is responsible for determining how the data is distributed on the consistent hashing ring.
When cassandra inserts some data, partitioning applies a hashing algorithm to the partition Key to determine which range(and the corresponding node) the data lies.
Cassandra uses Murmer3 hashing function(Default).
In cassandra’s default configuration, a token is a 64-bit integer. Gives possible token ranges from [-2^63 , 2^63 + 1]. How does it differ from Dynamo?
All nodes learn about token assignment of other nodes through Gossip.

Replication

Agenda

Replication Factor
Replication Strategy.
Each node in Cassandra serves as a replica for a different range of data. Replication factor decides how many replicas the system would have, which is the number of nodes that will receive the copy of the same data.
The node that owns the range in which hash of the partition key falls is the first replica. All additional replicas are placed on the consecutive nodes in a clockwise manner.
Simple Replication Strategy
Network Topology Strategy

Cassandra Consistency Levels

Agenda

Cassandra Consistency Levels
Write Consistency Levels
Read consistency level
Snitch

Cassandra Consistency Levels

Minimum number of Cassandra nodes that must fulfill a read or write operation before the operation can be considered successful.
Has Tunable Consistency levels for reads and writes.
Tradeoff b/w Consistency and performance.

Write Consistency Levels

One or Two or Three : Success acknowledgement from specified number of nodes**.**
Quorum: Data must be written to at least the Majority Quorum of nodes.
All: Data is written to all nodes.
Local Quorum: Data is written to the Quorum of nodes in the same data center as the coordinator. Don’t wait for responses from other Data Centers.
Each Quorum: Data written to the Quorum of nodes in each data center.
Any: Data written to at least one node**.**
Performing Write Operation? Hinted Handoff?

When the node where the data was supposed to be written for Quorum was down comes online again, how should we write data to it? Cassandra accomplishes this through a Hinted handoff.
[FAILURE MODE] When a node is down or does not respond to a write request, the coordinator node writes a hint in a text file on the local disk. This hint contains the data itself along with information about which node the data belongs to. When the coordinator node discovers(using Gossip Protocol) that a node for which it holds hints has recovered, it forwards the write requests for each hint to the target. Furthermore, each node every ten minutes checks to see if the failing node, for which it is holding any hints, has recovered.
[FAILURE MODE] If a node is offline for some time, the hints can build up considerably on other nodes. Now, when the failed node comes back online, other nodes tend to flood that node with write requests. This can cause issues on the node, as it is already trying to come back after a failure.
Cassandra by default stores hints for 3 hours. After 3 hours, older hints are removed , if the failed node comes back up(hinted handoff won’t happen), and the node would contain stale data. Stale data can be fixed by Read-Repair(Read Path)
When the cluster cannot meet the client’s consistency level, cassandra fails the write request, and doesn’t store a hint.

Read Consistency Levels

Specifies how many replica nodes must respond to a read request before returning the data.
Same levels as Write operations except(Each Quorum) because Expensive.
R + W > Replication Factor can give Strong consistency levels in Cassandra?[Research]
Cassandra uses Snitch, an application that determines the proximity of nodes within the ring and also tells which nodes are faster and cassandra uses this to route read/write requests.

How does Cassandra perform a Read Operation?

Coordinator sends the read request to the fastest node(using Snitch).
E.g. Quorum R = 2, sends the request to the fastest node, and digest of the data from the second fastest node.
If the digest does not match, it means some replicas do not have the latest version of the data. In this case, the coordinator reads the data from all the replicas to determine the latest data.
The coordinator then returns the latest data to the client and initiates a read repair request.
The latest write-timestamp is used as a marker for the correct version of data[Research?] in Cassandra? Conflict resolution? Last write wins or Vector Clocks? Data Loss?
The read repair operation is performed only in a portion of the total reads to avoid performance degradation.
By default, Cassandra tries to read-repair 10% of all requests with DC local read repair.

Snitch

Snitch keeps track of network topology of Cassandra nodes. It determines which data center and racks nodes belong to and uses this info to route requests efficiently.
Functions of Snitch in Cassandra?

Gossiper

How does Cassandra use the Gossip protocol?
Node failure detection?

How does Cassandra use the Gossip Protocol?

Allows each node to keep track of state information about other nodes in the cluster.
Gossip protocol is a peer-to-peer communication mechanism in which nodes periodically exchange state information about themselves and other nodes they know about.
Each node initiates a gossip round every second to exchange state information about themselves (and other nodes) with one to three other random nodes.
Each Gossip message has a version associated with it, so that during gossip exchange, older information is overwritten with the most current state for a particular node.
Generation Number: Each node tracks a generation number which increments every time a node restarts.
Seed Nodes?

Node Failure Detection?

Accurately detecting failures is a hard problem to solve. We cannot say with 100% accuracy if a node is actually down or is just very slow to respond due to heavy load, network congestion, GC/process pauses etc.
Heart Beating(Boolean Failure detector, Yes or No) uses a fixed timeout, and if there is no heartbeat from a server, the system, after the timeout, assumes that the server has crashed. Here the value of the timeout is critical.
Cassandra uses an Adaptive failure detection mechanism, Phi Accrual Failure Detector
A generic Accrual Failure Detector, instead of telling if the server is alive or not, outputs the suspicion level about a server; a higher suspicion level means there are higher chances that the server is down.
Phi Accrual Failure Detector, if a node does not respond, its suspicion level is increased and could be declared dead later.

Anatomy of Cassandra’s write operation

Agenda

CommitLog
MemTable
SSTable
Cassandra stores data both in-memory and on-disk to provide both high performance and durability. Every write includes a timestamp. The Write-Path involves a lot of components.
Cassandra’s Write path Summary:

Commit Log

When a node receives a write request, it immediately writes the data to a commit log.
The commit log is a write-ahead log and is stored on disk.
Used as a Crash-Recovery mechanism for Cassandra’s Durability goals.
A write on the node isn’t considered successful until it’s written to the commit log.

MemTable

After a Write is persisted to CommitLog, it is then written to the memory-resident data structure which is MemTable.
Each Cassandra node has an In-Memory MemTable for each Table. It resembles that data in that Table it represents.
Accrues writes and provides reads for data not yet flushed to disk.
Commit Log stores all the writes in sequential Order(append only log) whereas MemTable stores data in sorted order of PartitionKey, and Clustering Columns.
After data is written to Commit-Log and MemTable, node sends success acknowledgement to the Coordinator.

SSTable(Sorted String Table)

When the number of objects stored in the MemTable reaches a Threshold, the contents of the MemTable are flushed to disk in a file called SSTable.
New MemTable is created to serve in-memory requests for subsequent data.
Flushing of MemTables is a Non-Blocking operation.
Multiple MemTables may exist for a single Table, one current, and others waiting to be flushed.
SSTable contains data for a specific Table.
When the MemTable is flushed to SStables, corresponding entries in the Commit Log are removed.
The Term SSTable first appeared in Google’s Bigtable which is also a storage system. Cassandra borrowed this term even though it does not store data as strings on the disk.
Once a MemTable is flushed to disk as an SSTable, it is immutable and cannot be changed by the application.
If we are not allowed to update SSTables, how do we delete or update a column?
The current data state of a Cassandra table consists of its MemTables in memory and SSTables on the disk.
On reads, Cassandra will first read MemTables, and then subsequently SSTables(if MemTables Does Not contain the key) to find data values, as the MemTable may still contain values that have not yet been flushed to the disk.
MemTable works as a WriteBack cache that Cassandra looks up by Key.
Generation Number: an Index number that is incremented every time a new SSTable is created for a Table. Uniquely identifies an SSTable.

Anatomy of Cassandra’s read operation

Agenda

Caching
Reading from MemTable
Reading from SSTable

Caching

To boost read performance, Cassandra provides 3 optional forms of caching:

Reading from MemTable

Data is sorted by the partition key and the clustering columns.
When a read request comes in, the node performs a binary search on the partition key to find the required partition and then returns the row.

Reading from SSTables

Bloom Filters

Each SStable has a Bloom filter associated with it, which tells(probabilistic) if a particular key is present in it or not for boosting read performance.
Bloom filters are very fast, non-deterministic algorithms for testing whether an element is a member of a set.
Bloom filters work by mapping the values in a data set into a bit array and condensing a larger data set into a digest string using a hash function.
The filters are stored in memory and are used to improve performance by reducing the need for disk access on key lookups since disk access is much slower.
Because false negatives are not possible:

How are SSTables Stored on Disk?

Each SSTable Consists of Two Files:
Partition Index Summary File
If we want to read data for key=12, here are the steps we need to follow (also shown in the figure below):

Reading SSTable through Key Cache

As the Key Cache stores a map of recently read partition keys to their SSTable offsets, it is the fastest way to find the required row in the SSTable.
Summary of Read Operation:

Compaction

Agenda

How does compaction work in Cassandra?
Compaction Strategies?
Sequential Writes?

How does compaction work in Cassandra?

SSTables are immutable(Append Only Log), which helps Cassandra achieve such high write speeds.
Flushing of MemTable to SStable is a continuous process. This means we can have a large number of SStables lying on the disk. While reading, it is tedious to scan all these SStables. So, to improve the read performance, we need compaction.
Compaction refers to the operation of merging multiple related SSTables into a single new one.
During compaction, the data in SSTables is merged: the keys are merged, columns are combined, obsolete values are discarded, and a new index is created.
On compaction, the merged data is sorted, a new index is created over the sorted data, and this freshly merged, sorted, and indexed data is written to a single new SSTable.
Compaction will reduce the number of SSTables to consult and therefore improve read performance.
Compaction will also reclaim space taken by obsolete(Tombstoned or overwritten) data in SSTables.

Compaction Strategies

Size Tiered(Default, Write Optimized)
Levelled(Read Optimized)
Time Window(Time Series Optimized)

Sequential Writes

Sequential writes are the primary reason that writes perform so well in Cassandra.
No reads or seeks of any kind are required for writing a value to Cassandra because writes are append-only operations.
Write speed of the disk becomes a performance bottleneck.
Compaction is intended to amortize the reorganization of data, but it uses sequential I/O to do so, which makes it efficient.
If Cassandra naively inserted values where they ultimately belonged, writing clients would pay for seeks upfront.

Tombstones

An interesting case with Cassandra can be when we delete some data for a node that is down or unreachable, that node could miss a delete. When that node comes back online later and a repair occurs, the node could “resurrect” the data that had been previously deleted by re-sharing it with other nodes.
To prevent deleted data from being reintroduced, Cassandra uses a concept called a tombstone Which is similar to a “soft delete” from the Relational databases world.
When we delete data, Cassandra does not delete it right away, instead associates a tombstone with it, with a time to expiry.
The purpose of this delay is to give a node that is unavailable time to recover.
Tombstones are removed as part of compaction. During compaction, any row with an expired tombstone will not be propagated further.

Common Problems associated with Tombstones?

Tombstones make Cassandra’s writes efficient because the data is not removed right away when deleted. Instead, it is removed later during compaction.
Problems?
Slower Reads Indexes?
Cassandra uses clustering keys to create indexes of data within a partition.
These are only local indexes, not global indexes.
If you have many clustering keys in order to achieve multiple different sort orders, Cassandra will de-normalize the data such that it keeps two copies of it. Cassandra Pitfalls?
Lack of Strong Consistency even with Quorums(say Sloppy Quorum or hinted handoffs) which can create race conditions amongst concurrent writes.
Lack of ability to support data relationships(outside of sorting data within a partition)
Lack of Global Secondary Indexes if needed for Read Heavy applications where read cache may not work.

Summary

Cassandra is Distributed, Decentralized(Leaderless), Scalable, Highly Available, Eventually Consistent NoSQL datastore.
Was designed with Fault-Tolerance in mind(hardware/software failures can and do happen).
Peer-to-Peer(Gossip) distributed System, with no Leader/Follower nodes. All nodes are equal except some are tagged seed nodes, for bootstrapping gossip to the nodes added to the cluster.
Data is automatically Partitioned across nodes using Consistent Hashing as well as Replicated for Fault Tolerance and redundancy.
Combines Distributed Nature of Amazon’s Dynamo(Consistent Hashing, Replication, Partitioning), with DataModel of Google’s BigTable, i.e. SSTable/MemTable.
Offers Tunable Consistency(Default AP) but can be made strongly consistent(CP) but with performance implications.
Uses Gossip protocol for Inter-Node communication.
Supports Geographical Distribution of data across multiple clouds and data centers?

System Design Patterns Used?

Consistent Hashing : Data Partitioning
Quorum : Data Consistency
Write Ahead Log : Durability
Segmented Log: Splits its commit log into multiple smaller files instead of single large file for easier operation.
Gossip Protocol: Membership or Cluster State information, Failure Detection?
Phi Accrual Failure Detector: Adaptive Failure Detection using suspicion levels.
Bloom Filters: Check for partition Key presence in SSTable(Read Optimized).
Hinted Handoff: Sloppy Quorum?? and High Availability.
Read Repair: Fix Stale values on Read?

References:

DataStax Docs
Cassandra Tombstone issues
BigTable
Dynamo
PhiAccrual Failure Detector(Akka) Open Questions?
What is a Murmer3 hashing function? How does it compare to MD5? Why Murmur?
Why 64 bit Token range? How does that compare to Dynamo?
R + W > Replication Factor can give Strong consistency levels in Cassandra?[Research]
What happens if the coordinator node which wrote the Hint on the local disk crashes? How does the hinted handoff process complete? [Research]
The latest write-timestamp is used as a marker for the correct version of data[Research?] in Cassandra? Conflict resolution? Last write wins or Vector Clocks? Data Loss?
Phi Accrual Failure Detector?
Write Ahead Log? Cassandra?
KeyCache and Row Cache in Cassandra? How is it used? How is it invalidated or kept in Sync?
Bloom Filters details?
Why is each compaction Strategy Size-Tiered or Levelled Compaction a good strategy for its corresponding workload?
Anti-Entropy in Cassandra?
Geographical replication of data?
Read up on Various company blogs on Cassandra?
Last Write Wins and Conflict Resolution?

Paper Link: https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Dynamo

Mon, 02 Dec 2024 00:00:00 +0000

Paper: Dynamo

Dynamo / Distributed Key Value Store

Problem: Design a distributed key-value store(or Distributed Hash Table) that is highly available (i.e., reliable), highly scalable, and completely decentralized.

Features

Highly available Key-Value Store.
Shopping Cart, Bestseller Lists, Sales Rank, Product Catalog, etc which needs only primary-key access to data.
Multi-table RDBMS would limit scalability and availability.
Can choose desired Level of Availability and Consistency.

Background?

Designed for **high availability(**at a massive scale) and partition tolerance at the expense of strong consistency.
Primary Motivation for being optimized for High Availability(Over consistency) was to be always up for serving customer requests to provide better customer experience.
Dynamo design inspired various NoSQL Databases, Cassandra, Riak, VoldemortDB, DynamoDB.

Design Goals?

Highly Available
Reliability
Highly Scalable
Decentralized
Eventually Consistent(EC) - Weaker Consistency model than Strong Consistency(Linearizability)
(Notes: ) Latency Requirements?
(Notes: ) Geographical Distribution of Data?

Use cases

Dynamo can achieve strong consistency, but it comes with a performance impact. If Strong Consistency is a requirement, Dynamo is not the best option.
Applications that need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance.
Services that need only Primary Key access to the data.

System APIs:

get(key) : T… Object, Context
put(key, context, object)
Dynamo treats both the object and the key as an arbitrary array of bytes (typically less than 1 MB).
Uses MD5 Hashing algorithm on the key to generate 128-bit HashID, which is used to determine the storage nodes that are responsible for serving the key.

High Level Architecture

Agenda

Data Distribution(Partitioning)
Data Replication and Consistency
Handing Temporary Failures(Fault Tolerance)
Inter-Node communication(Unreliable Network) and Failure Detection
High Availability
Conflict resolution and handling permanent failures.

Data Partitioning

Distributing data across a set of nodes is called data partitioning.
Challenges with Partitioning?
Naive Approach(Modulo Hashing)
Better Approach(Consistent Hashing)
Consistent hashing represents the data managed by a cluster as a ring. The ring is divided into smaller predefined ranges. Each node in the ring is assigned a range of data. The start of the range is called a token(each node is assigned one token).
Above works great when a node is added or removed from the ring; as only the next node is affected in these scenarios
The basic Consistent Hashing algorithm assigns a single token (or a consecutive hash range) to each physical node and does a static division of ranges that requires calculating tokens based on a given number of nodes.
Dynamo efficiently handles these scenarios(node addition/removal) through the use of Virtual Nodes(or Vnodes). New scheme for distributing Tokens to physical nodes.
Instead of assigning a single token to a node, the hash range is divided into multiple smaller ranges, and each physical node is assigned multiple of these smaller ranges. Each of these subranges is called a Vnode.
Vnodes are randomly distributed across the cluster and are generally non-contiguous so that no two neighboring Vnodes are assigned to the same physical node.
Nodes also carry replicas of other nodes for fault-tolerance.
Since there can be heterogeneous machines in the clusters, some servers might hold more Vnodes than others.
Advantages of VNodes:

Data Replication

Agenda

Optimistic replication
Preference List
Sloppy Quorum and Handling of Temporary failures
Hinted Handoff Optimistic replication
Replicates each data item on N nodes(N = Replication Factor, configurable per Dynamo instance).
Each key is assigned a Coordinator node(node that falls first in the hash range), which stores the data locally and replicates asynchronously(What?? or Synchronously?) to N-1 Clockwise successor nodes in the ring(eventually consistent) called Optimistic replication.
As Dynamo stores N copies of data spread across different nodes, if one node is down, other replicas can respond to queries for that range of data.
If a client cannot contact the coordinator node, it sends the request to a node holding a replica. Preference List
The list of nodes responsible for storing a particular key is called the preference list.
Dynamo is designed so that every node in the system can determine which nodes should be in this list for any specific key.
This list contains more than N nodes to account for failure and skip virtual nodes on the ring so that the list only contains distinct physical nodes. Sloppy Quorum and handling of temporary failures
Following traditional/strict quorum approaches, any distributed system becomes unavailable during server failures or network partitions and would have reduced availability even under simple failure conditions. Dynamo uses Sloppy Quorums.
With this approach, all read/write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while moving clockwise on the consistent hashing ring.
Fault Tolerance with Sloppy Quorum.
Hinted Handoff

Vector Clocks and Conflicting Data(Conflict Resolution)

Agenda:

Clock Skew?
Vector Clock?
Conflict Free Replicated Data Types(CRDTs)
Last Write Wins(LWW) Clock Skew
Physical clocks have clock skews, which is okay in single node systems, but can create concurrency updates in distributed systems, due to clock skews across different nodes.
Physical clocks are synchronized using NTP, but that still has skew, and 2 different nodes’ physical clocks can’t be accurately synchronized.
Using special hardware like GPS clocks and Atomic Clocks can reduce the clock skews, but doesn’t entirely eliminate it.
Physical clock has a problem with Causal Ordering of events(happens-before relationship). Vector Clock?
Captures Causal ordering between events.
Vector clock is a (node, counter) pair. What? Isn’t it Lamport Clocks?
Vector timestamps are attached to every version of the object stored in Dynamo.
One can determine whether two versions of an object are on parallel branches or have a causal ordering by examining their vector clocks.
If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation. Dynamo resolves these conflicts at read-time.
Version branching may happen in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object.
Dynamo truncates vector clocks (oldest first) when they grow too large. If Dynamo ends up deleting older vector clocks that are required to reconcile an object’s state, Dynamo would not be able to achieve eventual consistency. Conflict Free Replicated Data Types?
To make use of CRDTs, we need to model our data in such a way that concurrent changes can be applied to the data in any order and will produce the same end result. This way, the system does not need to worry about any ordering guarantees.
The idea that any two nodes that have received the same set of updates will see the same end result is called strong eventual consistency. Last Write Wins
Dynamo(and Cassandra) also offer a way to do server side conflict resolution, LWW.
Uses Physical(Wall Clock/Time-Of-the-Day) Clocks.
Can potentially lead to Data-Loss during concurrent writes.

Life of Dynamo’s put() and get() operations.

Agenda:

Strategies for Coordinator selection
Consistency protocol
put() process
get() process
Request handling through a state machine.

Strategies for choosing coordinator

Clients route request using Generic Load Balancer.
Clients use a partition-aware client library that routes requests to the appropriate coordinator with lower latency.

Consistency Protocol

Uses a consistency protocol similar to quorum systems.
R + W > N ( R /W = minimum number of nodes to participate in Read/Write)
Common configurations(N, R, W) for Dynamo (3, 2, 2)
Latency of get() and put() depends upon the slowest of replicas. Put() Process
Coordinator generates new data version and vector timestamp.
Saves data locally.
Sends write requests to N-1 highest ranked healthy nodes from the preference list.
Put() is considered successful after receiving W-1 confirmations. Get() process
Coordinator requests the data version from N-1 highest ranked healthy nodes from the preference list.
Waits until R - 1 replies.
Coordinator handles causal data versioning using vector clocks/timestamps.
Returns all data versions to the caller.

Request handling through the state machine

Each client request results in creating a state machine on the node that received the client request.
The state machine contains all the logic for
Each state machine instance handles exactly one client request.
A read operation implements following state machine:
Writes:

Anti-Entropy through Merkle Trees

Dynamo uses Vector clocks to remove write conflicts(Read Repair) while serving read requests if it receives stale responses from some of the replicas.
If a replica fell significantly behind others, it might take a very long time to resolve conflicts using read repair(vector clocks), depending upon if those keys were read or not. It may happen that some of the keys are never accessed, and they cold remain stale for longer.
We need a mechanism to automatically reconcile replicas in the background(and do conflict resolution if any).
To do this, we need to quickly compare two copies of a range of data residing on different replicas and figure out exactly which parts are different.
Naively splitting up the entire data range for checksums is not very feasible; there is simply too much data to be transferred.(Transferred? How?)
Dynamo uses Merkle trees to compare replicas of a range.
A Merkle tree is a binary tree of hashes, where each internal node is the hash of its two children, and each leaf node is a hash of a portion of the original data.
Now comparing the ranges of data on two replicas is equivalent to comparing two Merkle Trees
The principal advantage of using a Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the whole data set.
Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.
The disadvantage of using Merkle trees is that many key ranges can change when a node joins or leaves, and as a result, the trees need to be recalculated.

Gossip Protocol

What is a Gossip Protocol?

How does Node Failure Detection happen in Dynamo?
Since we do not have any central node that keeps track of all nodes to know if a node is down or not, how does a node know every other node’s current state?
Naive Approach: Each Node broadcast HeatBeat message to every other Node
Optimized Approach: Gossip Protocol

External Discovery Through Seed Nodes?

Dynamo nodes use gossip protocol to find the current state of the ring. This can result in a logical partition of the cluster in a particular scenario.
An administrator joins node A to the ring and then joins node B to the ring. Nodes A and B consider themselves part of the ring, yet neither would be immediately aware of each other. To prevent these logical partitions, Dynamo introduced the concept of seed nodes.
Seed nodes are fully functional nodes and can be obtained either from a static configuration or a configuration service. This way, all nodes are aware of seed nodes.
Each node communicates with seed nodes through gossip protocol to reconcile membership changes; therefore, logical partitions are highly unlikely.

Characteristics and Criticism of Dynamo

Responsibilities of a Dynamo Node

Managing get() and put() requests via acting as a Coordinator(or request Forwarder).
Keeping track of membership(Hash ranges in a Ring) and detecting failures(Gossip)
Local Persistent Storage

Characteristics of Dynamo

Distributed(Can run across several machines)
Decentralized(No external coordinator, all nodes identical)
Scalable(Horizontally scaled on commodity hardware with Fault Tolerance. No Manual intervention/rebalancing required)
Highly Available
Fault Tolerant and Reliable
Tunable Consistency(Trade Offs b/w Availability and Consistency by adjusting the replication factor 3,2,2, or 3,1,3, or 3,3,1 etc).

Criticism on Dynamo Design?

Each Dynamo node contains the entire routing table. Could affect scalability of the system as this routing table gets larger as more nodes are added to the system.
Dynamo seems to imply that it strives for symmetry(all nodes have the same set of responsibilities). But it does specify some nodes as seed nodes for external discovery to avoid logical partition. May violate Dynamo’s symmetry principle.
DHTs can be susceptible to Several different types of attack?[Research More?]
Dynamo’s design can be described as a Leaky Abstraction.

DataStores developed on Principles of Dynamo

Riak is a distributed NoSQL key-value data store that is highly available, scalable, fault-tolerant, and easy to operate.
Cassandra is a distributed, decentralized, scalable, and highly available NoSQL wide-column database.

Summary

Paper reading Video.

References:

https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
https://docs.riak.com/riak/kv/2.2.0/developing/data-types/
https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/
https://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
https://www.allthingsdistributed.com/2017/10/a-decade-of-dynamo.html
https://news.ycombinator.com/item?id=915212 Open Questions:
Anti-Entropy and Merkle Trees
DHTs can be susceptible to Several different types of attack?[Research More?]
Underlying storage for Dynamo store. Berkeley, in-memory + persistent + more.
Why does it use MD5 hashing? Why not something else?
Logical partitioning and seed nodes?
New Features and revision in the design?

Paper Link: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Last updated: March 15, 2026

Questions or discussion? Email me