Aws on Hemant Sethi

DynamoDB

Wed, 11 Dec 2024 00:00:00 +0000

Paper: DynamoDB

DynamoDB

Summary/Abstract

Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale.
Fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience.
In 2021, during the 66-hour Amazon Prime Day shopping event, 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance.
Design and implementation of DynamoDB have evolved since the first launch in 2012. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance.

Introduction

The goal of the design of DynamoDB is to complete all requests with low single-digit millisecond latencies.
DynamoDB uniquely integrates the following six fundamental system properties:
DynamoDB is a fully managed cloud service.
DynamoDB employs a multi-tenant architecture.
DynamoDB provides predictable performance
DynamoDB is highly available.
DynamoDB supports flexible use cases.
DynamoDB evolved as a distributed database service to meet the needs of its customers without losing its key aspect of providing a single-tenant experience to every customer using a multi-tenant architecture.
The paper explains the challenges faced by the system and how the service evolved to handle those challenges while connecting the required changes to a common theme of durability, availability, scalability, and predictable performance.

History

Design of DynamoDB was motivated by our experiences with its predecessor Dynamo. Dynamo was created in response to the need for a highly scalable, available, and durable key-value database for shopping cart data
Amazon learned that providing applications with direct access to traditional enterprise database instances led to scal- ing bottlenecks such as connection management, interference between concurrent workloads, and operational problems with tasks such as schema upgrades.
Service Oriented Architecture was adopted to encapsulate an application’s data behind service-level APIs that allowed sufficient decoupling to address tasks like reconfiguration without having to disrupt clients.
DynamoDB took the principles from Dynamo(which was being run as Self-hosted DB but created operational burden for developers) & Simple DB, a fully managed elastic NoSQL database service, but the data model couldn’t scale to the demands of the large Tables which DDB needed.
Dynamo Limitations:
SimpleDB limitations:
Amazon concluded that a better solution would combine the best parts of the original Dynamo design (incremental scalability and predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store)

Architecture

A DynamoDB table is a collection of items.
Each item is a collection of attributes. Uniquely identified by a primary key.
Schema of the primary key is specified at the table creation time.
The partition key’s value is always used as an input to an internal hash function.
The output from the hash function and the sort key value (if present) determines where the item will be stored.
Multiple items can have the same partition key value in a table with a composite primary key. However, those items must have different sort key values.
Supports secondary indexes to provide enhanced querying capability, which allows querying the data in the table using an alternate key.
DynamoDB provides a simple interface to store or retrieve items from a table or an index.
DynamoDB supports ACID transactions for multi-item updates w/o affecting scalability/availability/performance.
A DynamoDB table is divided into multiple partitions.
Each partition of the table hosts a disjoint and contiguous part of the table’s key-range and has multiple replicas(replication Group) distributed across different Availability Zones for high availability and durability.
The Replication Group uses Multi-Paxos for leader election and consensus.
Any replica can trigger a round of the election.
Once elected leader, a replica can maintain leadership as long as it periodically renews its leadership lease.
Only the leader replica can serve write and strongly consistent read requests.
Leader generates a write-ahead log record and sends it to its peers.
Write is acknowledged to the application once a quorum of peers persists the log record to their local write-ahead logs.
DynamoDB supports strong(Leader Read) and eventually consistent(Replica read) reads.
The leader of the group extends its leadership using a lease mechanism.
If the leader of the group is detected as a failure (considered unhealthy or unavailable) by any of its peers, the peer can propose a new round of the election to elect itself as the new leader. The new leader won’t serve any writes or consistent reads until the previous leader’s lease expires.
Partitioning/Replication Group
Log Replica/Node - Write Ahead Log(replicated) for High Availability and Durability.
Multi-Paxos Leader Election and Consensus.
Writes and Strongly/Eventually Consistent Reads
Microservice architecture
Metadata Service
Request Router Service
Auto-Admin Service(Central Nervous System of DDB)
Storage Service
Features supported by other Services

Journey from Provisioned to On-Demand

DDB was launched with Partitions as an internal abstraction, as a way to dynamically scale both the capacity and performance of tables.
Customers explicitly specified the throughput that a table required in terms of read capacity units (RCUs) and write capacity units (WCUs). RCUs and WCUs collectively are called provisioned throughput.
As the demands from a table changed (because it grew in size or because the load increased), partitions could be further split and migrated to allow the table to scale elastically. Partition abstraction proved to be really valuable and continues to be central to the design of DynamoDB.
[Challenge] This early version tightly coupled the assignment of both capacity and performance to individual partitions, which led to challenges
DynamoDB uses admission control to ensure that storage nodes don’t become overloaded, to avoid interference between co-resident table partitions, and to enforce the throughput limits requested by customers.
Admission control was the shared responsibility of all storage nodes for a table. Storage nodes independently performed admission control based on the allocations of their locally stored partitions.
Allocated throughput of each partition was used to isolate the workloads. DynamoDB enforced a cap on the maximum throughput that could be allocated to a single partition. Total throughput of all the partitions hosted by a storage node is less than or equal to the maximum allowed throughput on the node as determined by the physical characteristics of its storage drives.
The throughput allocated to partitions was adjusted when the overall table’s throughput was changed or its partitions were split into child partitions.
When a partition was split for size, the allocated throughput of the parent partition was equally divided among the child partitions and was allocated based on the table’s provisioned throughput.
E.g. Assume that a partition can accommodate a maximum provisioned throughput of 1000 WCUs. When a table is created with 3200 WCUs, DynamoDB created four partitions that each would be allocated 800 WCUs. If the table’s provisioned throughput was increased to 3600 WCUs, then each partition’s capacity would increase to 900 WCUs. If the table’s provisioned throughput was increased to 6000 WCUs, then the partitions would be split to create eight child partitions, and each partition would be allocated 750 WCUs. If the table’s capacity was decreased to 5000 WCUs, then each partition’s capacity would be decreased to 675 WCUs
The uniform distribution of throughput across partitions is based on the assumptions that an application uniformly accesses keys in a table and the splitting a partition for size equally splits the performance.
However, it was discovered that application workloads frequently have non-uniform access patterns both over time and over key ranges.
Hot Partition Worsening with Split: When the request rate within a table is non-uniform, splitting a partition and dividing performance allocation proportionately can result in the hot portion of the partition having less available performance than it did before the split.
[Single Hot Partition] Since throughput was allocated statically and enforced at a partition level, these non-uniform workloads occasionally resulted in an application’s reads and writes being rejected, called throttling, even though the total provisioned throughput of the table was sufficient to meet its needs. Common Challenges faced by the applications were:
Hot Partition
Throughput Dilution.
Customers would increase the provisioned throughput of the table(even if they were under the limit overall), which caused poor performance. It was difficult to estimate the correct provisioned throughput.
Hot partitions and throughput dilution stemmed from tightly coupling a rigid performance allocation to each partition, and dividing that allocation as partitions split. Bursting and Adaptive Capacity to address these concerns.

Improvements to Admission Control:

Key Observations:

Partitions had non-uniform access/traffic.
Not all partitions hosted by a storage node used their allocated throughput simultaneously.

Bursting

The idea behind Bursting was to let applications tap into the unused capacity at a partition level on a best effort basis to absorb short-lived spikes.
DynamoDB retained a portion of a partition’s unused capacity for later bursts of throughput usage for up to 300 seconds and utilized it when consumed capacity exceeded the provisioned capacity of the partition.
DynamoDB still maintained workload isolation by ensuring that a partition could only burst if there was unused throughput at the node level. The capacity was managed on the storage node using multiple token buckets to provide admission control:
**[Partition Token + Node Token]**When a read or write request arrives on a storage node, if there were tokens in the partition’s allocated token bucket, then the request was admitted and tokens were deducted from the partition and node level bucket.
[Burst Token + Node Token] Once a partition had exhausted all the provisioned tokens, requests were allowed to burst only when tokens were available both in the burst token bucket and the node level token bucket.
Read requests were accepted based on the local token buckets.
[Replica node’s Token Bucket for Write] Write requests using burst capacity require an additional check on the node-level token bucket of other member replicas of the partition.
The leader replica of the partition periodically collected information about each of the members node-level capacity.

Adaptive Capacity

DynamoDB launched adaptive capacity to better absorb long-lived spikes that cannot be absorbed by the burst capacity.
Better absorb work-loads that had heavily skewed access patterns across partitions.
Adaptive capacity actively monitored the provisioned and consumed capacity of all the tables.
If a table experienced throttling and the table level throughput was not exceeded, then it would automatically increase (boost) the allocated throughput of the partitions of the table using a proportional control algorithm.
The autoadmin system ensured that partitions receiving boost were relocated to an appropriate node that had the capacity to serve the increased throughput, however like bursting, adaptive capacity was also best-effort but eliminated over 99.99% of the throttling due to skewed access pattern.

Global Admission Control

Even though Bursting and Adaptive Capacity significantly reduced throughput problems for non-uniform access, they had limitations.
Takeaway from bursting and adaptive capacity was that we had tightly coupled partition level capacity to admission control.
Admission control was distributed and performed at a partition level.
DynamoDB realized it would be beneficial to remove admission control from the partition and let the partition always burst while providing workload isolation.
DynamoDB replaced adaptive capacity with global admission control (GAC).
GAC builds on the same idea of Token Bucket.
The GAC service centrally tracks the total consumption of the table capacity in terms of tokens.
Each request router maintains a local token bucket to make admission decisions and communicates with GAC to replenish tokens at regular intervals (in the order of a few seconds).
[Important Design Consideration] Each GAC server can be stopped and restarted without any impact on the overall operation of the service.
Each GAC server can track one or more token buckets configured independently.
All the GAC servers are part of an independent hash ring.
Request routers manage several time-limited tokens locally. When a request from the application arrives, the request router deducts tokens. Eventually, the request router will run out of tokens because of consumption or expiry. When the request router runs off of tokens, it requests more tokens from GAC.
The GAC instance uses the information provided by the client to estimate the global token consumption and vends tokens available for the next time unit to the client’s share of overall tokens.
Thus, it ensures that non-uniform workloads that send traffic to only a subset of items can execute up to the maximum partition capacity.
In addition to the global admission control scheme, the partition-level token buckets were retained for defense in-depth. The capacity of these token buckets is then capped to ensure that one application doesn’t consume all or a significant share of the resources on the storage nodes.

Balancing Consumed Capacity

Letting partitions burst(always) required DynamoDB to manage burst capacity effectively.
Colocation was a straightforward problem with provisioned throughput tables because of static partitions.

Splitting for Consumption

[Problem] Even with GAC and the ability for partitions to always burst, tables could experience throttling if their traffic was skewed to a specific set of items.
[Solution]
DynamoDB automatically scales out partitions once the consumed throughput of a partition crosses a certain threshold.
The split point in the key range is chosen based on key distribution the partition has observed.
The observed key distribution serves as a proxy for the application’s access pattern and is more effective than splitting the key range in the middle.
Partition splits usually complete in the order of minutes.
[Catch] Still class of workloads exist that cannot benefit from split for consumption. E.g. a partition receiving high traffic to a single item or a partition where the key range is accessed sequentially will not benefit from split. DDB avoids splitting the partition.

On Demand Provisioning

[Context]
Initially, applications migrated to DDB, were on self provisioned servers either on-prem or on self-hosted databases.
DynamoDB provides a simplified serverless operational model and a new model for provisioning - read and write capacity units.
[Problem]
The concept of capacity units was new to customers, some found it challenging to forecast the provisioned throughput.
Customers either over provisioned(Low utilization) or under provisioned(Throttling).
[Solution] To improve the customer experience for spiky workloads, DDB launched On-Demand Tables.
DynamoDB provisions the on-demand tables based on the consumed capacity by collecting the signal of reads and writes and instantly accommodates up to double the previous peak traffic on the table.
On-demand scales a table by splitting partitions for consumption. The split decision algorithm is based on traffic.
GAC allows DynamoDB to monitor and protect the system from one application consuming all the resources.

Durability and Correctness

Data loss can occur because of hardware failures, software bugs, or hardware bugs.
DynamoDB is designed for high durability by having mechanisms to prevent, detect, and correct any potential data losses.

Hardware Failures

Write-ahead logs(WAL) in DynamoDB are central for providing durability and crash recovery. Write ahead logs are stored in all three replicas of a partition.
For higher durability, the write ahead logs are periodically archived to S3, an object store that is designed for 11 nines(99.999999999) of durability.
The unarchived logs are typically a few hundred megabytes in size.
When a node fails, all replication groups hosted on the node are down to two copies.
The process of healing a storage replica can take several minutes because the repair process involves copying the B-tree and write-ahead logs.
[Solution] Upon detecting an unhealthy storage replica, the leader of a replication group adds a log replica to ensure there is no impact on durability.
Adding a log replica takes only a few seconds because the system has to copy only the recent write-ahead logs from a healthy replica to the new replica without the B-tree. Quick healing of impacted replication groups using log replicas ensures high durability of most recent writes.

Silent Data Errors

[Problem] Some hardware failures can cause incorrect data to be stored . These errors can happen because of the storage media, CPU, or memory.
It’s very difficult to detect these and they can happen anywhere in the system.
[Solution] DynamoDB makes extensive use of checksums to detect silent errors.
By maintaining checksums within every log entry, message, and log file, DynamoDB validates data integrity for every data transfer between two nodes.
Checksums serve as guardrails to prevent errors from spreading to the rest of the system.
Every log file that is archived to S3 has a manifest that contains information about the log, such as a table, partition and start and end markers for the data stored in the log file.
The agent responsible for archiving log files to S3 performs various checks before uploading the data. These include and are not limited to verification of every log entry to ensure that it belongs to the correct table and partition, verification of checksums to detect any silent errors, and verification that the log file doesn’t have any holes in the sequence numbers.
Once all the checks are passed, the log file and its manifest are archived. Log archival agents run on all three replicas of the replication group. If one of the agents finds that a log file is already archived, the agent downloads the uploaded file to verify the integrity of the data by comparing it with its local write-ahead log.
Every log file and manifest file are uploaded to S3 with a content checksum. The content checksum is checked by S3 as part of the put operation, which guards against any errors during data transit to S3.

Continuous Verification

DynamoDB also continuously verifies data at rest. Our goal is to detect any silent data errors or bit rot in the system. An example of such a continuous verification system is the scrub process.
The goal of scrub is to detect errors that we had not anticipated, such as bit rot.
The scrub process runs and verifies two things:
The verification is done by computing the checksum of the live replica and matching that with a snapshot of one generated from the log entries archived in S3.
Scrub acts as a defense in depth to detect divergences between the live storage replicas with the replicas built using the history of logs from the inception of the table.
A similar technique of continuous verification is used to verify replicas of global tables.
We have learned that continuous verification of data-at-rest is the most reliable method of protecting against hardware failures, silent data corruption, and even software bugs.

Software Bugs

[Problem] DDB is a complex Distributed Key Value store. High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on.
[Solution] DDB uses formal methods extensively to ensure the correctness of our replication protocols. The core replication protocol was specified using TLA+.
When new features that affect the replication protocol are added, they are incorporated into the specification and model checked.
Model checking has allowed us to catch subtle bugs that could have led to durability and correctness issues before the code went into production. S3 also uses Model Checking.
Extensive failure injection testing and stress testing to ensure the correctness of every piece of software deployed.
In addition to testing and verification of the replication protocol of the data plane, formal methods have also been used to verify the correctness of our control plane and features such as distributed transactions.

Backups and Restore

In addition to guarding against physical media corruption, DynamoDB also supports backup and restore to protect against any logical corruption due to a bug in a customer’s application. Backups or restores don’t affect performance or availability of the table as they are built using the write-ahead logs that are archived in S3.
The backups are consistent across multiple partitions up to the nearest second.
The backups are full copies of DynamoDB tables and are stored in an Amazon S3 bucket.
DynamoDB also supports point-in-time restore where customers can restore the contents of a table that existed at any time in the previous 35 days to a different DynamoDB table in the same region.
For tables with the point-in-time restore enabled, DynamoDB creates periodic(based on the amount of write-ahead logs accumulated for the partition) snapshots of the partitions that belong to the table and uploads them to S3.
Snapshots, in conjunction to write-ahead logs, are used to do point-in-time restore.
[Workflow] When a point-in-time restore is requested for a table,

Availability

To achieve high availability, DynamoDB tables are distributed and replicated across multiple Availability Zones (AZ) in a Region. DynamoDB regularly tests resilience to node, rack, and AZ failures.
To test the availability and durability of the overall service, power-off tests are exercised. Using realistic simulated traffic, random nodes are powered off using a job scheduler. At the end of all the power-off tests, the test tools verify that the data stored in the database is logically valid and not corrupted.

Write and Consistent Read Availability

A partition’s write availability depends on its ability to have a healthy leader and a healthy write quorum.
A healthy write quorum in the case of DynamoDB consists of two out of the three replicas from different AZs.
A partition remains available as long as there are enough healthy replicas for a write quorum and a leader
A partition will become unavailable for writes if the number of replicas needed to achieve the minimum quorum are unavailable
The leader replica serves consistent reads.
Introducing log replicas was a big change to the system, and the formally proven implementation of Paxos provided us the confidence to safely tweak and experiment with the system to achieve higher availability
Eventually consistent reads can be served by any of the replicas.
In case a leader replica fails, other replicas detect its failure and elect a new leader to minimize disruptions to the availability of consistent reads.

Failure Detection

[Problem] A newly elected leader will have to wait for the expiry of the old leader’s lease before serving any traffic. While this only takes a couple of seconds, the elected leader cannot accept any new writes or consistent read traffic during that period, thus disrupting availability.
Failure detection must be quick and robust to minimize disruptions. False positives in failure detection can lead to more disruptions in availability. Failure detection works well for failure scenarios where every replica of the group loses connection to the leader.
However, nodes can experience gray network failures(Gray Failure).
Gray network failures can happen because of communication issues between a leader and follower, issues with outbound or inbound communication of a node, or front-end routers facing communication issues with the leader even though the leader and followers can communicate with each other.
Gray failures can disrupt availability because there might be a false positive in failure detection or no failure detection
For example, a replica that isn’t receiving heartbeats from a leader will try to elect a new leader. This can disrupt availability.
[Solution] To solve the availability problem caused by gray failures, a follower that wants to trigger a failover sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election. This change in the failure detection algorithm used by DynamoDB significantly minimized the number of false positives in the system, and hence the number of spurious leader elections.

Measuring Availability

DynamoDB is designed for 99.999(5-9s) percent availability for global tables and 99.99**(4-9s)** percent availability for regional tables.
To ensure these goals are being met, DynamoDB continuously monitors availability at service and table levels. The tracked availability data is used to analyze customer perceived availability trends and trigger alarms if customers see errors above a certain threshold. These alarms are called customer-facing alarms (CFA) to report any availability-related problems and proactively mitigate the problem either automatically or through operator intervention.
In addition to real time monitoring of availability, the system runs daily jobs that trigger aggregation to calculate aggregate availability metrics per customer.
DynamoDB also measures and alarms on availability observed on the client-side. There are two sets of clients used to measure the user-perceived availability.
Real application traffic allows us to reason about DynamoDB availability and latencies as seen by our customers and catch gray failures.

Deployments

Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows and without impacting the performance and availability that customers experience.
The rollback procedure is often missed in testing and can lead to customer impact. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment.
[Problem] Deployments are not atomic in a distributed system. At any given time, there will be software running the old code on some nodes and new code on other parts of the fleet.
New software might introduce a new type of message or change the protocol in a way that old software in the system doesn’t understand.
[Solution] DynamoDB handles these kinds of changes with read-write deployments. Read-write deployment is completed as a multi-step process.
The first step is to deploy the software to read the new message format or protocol. Once all the nodes can handle the new message, the software is updated to send new messages.
Read-write deployments ensure that both types of messages can coexist in the system. Even in the case of rollbacks, the system can understand both old and new messages.
[OneBox] Deployments are done on a small set of nodes before pushing them to the entire fleet of nodes. The strategy reduces the potential impact of faulty deployments.
[AutoRollback AlarmWatcher/ApprovalWorkflow] DynamoDB sets alarm thresholds on availability metrics. If error rates or latency exceed the threshold values during deployments, the system triggers automatic rollbacks.
**[Problem]**Software deployments to storage nodes trigger leader failovers that are designed to avoid any impact to availability.

Dependencies on External Services

To ensure high availability, all the services that DynamoDB depends on in the request path should be more highly available than DynamoDB.
Alternatively, DynamoDB should be able to continue to operate even when the services on which it depends are impaired.
Examples of services DynamoDB depends on for the request path include AWS Identity and Access Management Services (IAM), and AWS Key Management Service (AWS KMS) for tables encrypted using customer keys. DynamoDB uses IAM and AWS KMS to authenticate every customer request.
While these services are highly available, DynamoDB is designed to operate when these services are unavailable without sacrificing any of the security properties that these systems provide.
In the case of IAM and AWS KMS, DynamoDB employs a statically stable design, where the overall system keeps working even when a dependency becomes impaired.
Perhaps the system doesn’t see any updated information that its dependency was supposed to have delivered. However, everything before the dependency became impaired continues to work despite the impaired dependency.
DynamoDB caches result from IAM and AWS KMS in the request routers that perform the authentication of every request. DynamoDB periodically refreshes the cached results asynchronously.
If AWS IAM or KMS were to become unavailable, the routers will continue to use the cached results for a predetermined extended period.
Caches improve response times by removing the need to do an off-box call, which is especially valuable when the system is under high load.

Metadata Availability

One of the most important pieces of metadata the request routers needs is the mapping between a table’s primary keys and storage nodes.
[Metadata Storage]At launch, DynamoDB stored the metadata in DynamoDB itself.
[Routing Schema] This routing information consists of all the partitions for a table, the key range of each partition, and the storage nodes hosting the partition.
[Router Metadata Caching] When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.

DynamoDB Limits

**Per Partition Read and Write Capacity Units - **Ref
1 MB limit on the size of data returned by a single Query, Scan/GetItem Op.
BatchGetItem operation can return up to 16MB of data - Ref
Item Size Limit: Ref
**Secondary Indexes - **Ref
Transactions:

MicroBenchmarks

To show that scale doesn’t affect the latencies observed by applications, we ran YCSB [8] workloads of types A (50 percent reads and 50 percent updates) and B (95 percent reads and 5 percent updates)
Both benchmarks used a uniform key distribution and items of size 900 bytes.
The workloads were scaled from 100 thousand total operations per second to 1 million total operations per second.
The purpose of the graph is to show, even at different throughput, DynamoDB read latencies show very little variance and remain identical even as the throughput of the workload is increased.

Paper Link: https://www.usenix.org/conference/atc22/presentation/elhemali

Last updated: March 15, 2026

Questions or discussion? Email me

Dynamo

Mon, 02 Dec 2024 00:00:00 +0000

Paper: Dynamo

Dynamo / Distributed Key Value Store

Problem: Design a distributed key-value store(or Distributed Hash Table) that is highly available (i.e., reliable), highly scalable, and completely decentralized.

Features

Highly available Key-Value Store.
Shopping Cart, Bestseller Lists, Sales Rank, Product Catalog, etc which needs only primary-key access to data.
Multi-table RDBMS would limit scalability and availability.
Can choose desired Level of Availability and Consistency.

Background?

Designed for **high availability(**at a massive scale) and partition tolerance at the expense of strong consistency.
Primary Motivation for being optimized for High Availability(Over consistency) was to be always up for serving customer requests to provide better customer experience.
Dynamo design inspired various NoSQL Databases, Cassandra, Riak, VoldemortDB, DynamoDB.

Design Goals?

Highly Available
Reliability
Highly Scalable
Decentralized
Eventually Consistent(EC) - Weaker Consistency model than Strong Consistency(Linearizability)
(Notes: ) Latency Requirements?
(Notes: ) Geographical Distribution of Data?

Use cases

Dynamo can achieve strong consistency, but it comes with a performance impact. If Strong Consistency is a requirement, Dynamo is not the best option.
Applications that need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance.
Services that need only Primary Key access to the data.

System APIs:

get(key) : T… Object, Context
put(key, context, object)
Dynamo treats both the object and the key as an arbitrary array of bytes (typically less than 1 MB).
Uses MD5 Hashing algorithm on the key to generate 128-bit HashID, which is used to determine the storage nodes that are responsible for serving the key.

High Level Architecture

Agenda

Data Distribution(Partitioning)
Data Replication and Consistency
Handing Temporary Failures(Fault Tolerance)
Inter-Node communication(Unreliable Network) and Failure Detection
High Availability
Conflict resolution and handling permanent failures.

Data Partitioning

Distributing data across a set of nodes is called data partitioning.
Challenges with Partitioning?
Naive Approach(Modulo Hashing)
Better Approach(Consistent Hashing)
Consistent hashing represents the data managed by a cluster as a ring. The ring is divided into smaller predefined ranges. Each node in the ring is assigned a range of data. The start of the range is called a token(each node is assigned one token).
Above works great when a node is added or removed from the ring; as only the next node is affected in these scenarios
The basic Consistent Hashing algorithm assigns a single token (or a consecutive hash range) to each physical node and does a static division of ranges that requires calculating tokens based on a given number of nodes.
Dynamo efficiently handles these scenarios(node addition/removal) through the use of Virtual Nodes(or Vnodes). New scheme for distributing Tokens to physical nodes.
Instead of assigning a single token to a node, the hash range is divided into multiple smaller ranges, and each physical node is assigned multiple of these smaller ranges. Each of these subranges is called a Vnode.
Vnodes are randomly distributed across the cluster and are generally non-contiguous so that no two neighboring Vnodes are assigned to the same physical node.
Nodes also carry replicas of other nodes for fault-tolerance.
Since there can be heterogeneous machines in the clusters, some servers might hold more Vnodes than others.
Advantages of VNodes:

Data Replication

Agenda

Optimistic replication
Preference List
Sloppy Quorum and Handling of Temporary failures
Hinted Handoff Optimistic replication
Replicates each data item on N nodes(N = Replication Factor, configurable per Dynamo instance).
Each key is assigned a Coordinator node(node that falls first in the hash range), which stores the data locally and replicates asynchronously(What?? or Synchronously?) to N-1 Clockwise successor nodes in the ring(eventually consistent) called Optimistic replication.
As Dynamo stores N copies of data spread across different nodes, if one node is down, other replicas can respond to queries for that range of data.
If a client cannot contact the coordinator node, it sends the request to a node holding a replica. Preference List
The list of nodes responsible for storing a particular key is called the preference list.
Dynamo is designed so that every node in the system can determine which nodes should be in this list for any specific key.
This list contains more than N nodes to account for failure and skip virtual nodes on the ring so that the list only contains distinct physical nodes. Sloppy Quorum and handling of temporary failures
Following traditional/strict quorum approaches, any distributed system becomes unavailable during server failures or network partitions and would have reduced availability even under simple failure conditions. Dynamo uses Sloppy Quorums.
With this approach, all read/write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while moving clockwise on the consistent hashing ring.
Fault Tolerance with Sloppy Quorum.
Hinted Handoff

Vector Clocks and Conflicting Data(Conflict Resolution)

Agenda:

Clock Skew?
Vector Clock?
Conflict Free Replicated Data Types(CRDTs)
Last Write Wins(LWW) Clock Skew
Physical clocks have clock skews, which is okay in single node systems, but can create concurrency updates in distributed systems, due to clock skews across different nodes.
Physical clocks are synchronized using NTP, but that still has skew, and 2 different nodes’ physical clocks can’t be accurately synchronized.
Using special hardware like GPS clocks and Atomic Clocks can reduce the clock skews, but doesn’t entirely eliminate it.
Physical clock has a problem with Causal Ordering of events(happens-before relationship). Vector Clock?
Captures Causal ordering between events.
Vector clock is a (node, counter) pair. What? Isn’t it Lamport Clocks?
Vector timestamps are attached to every version of the object stored in Dynamo.
One can determine whether two versions of an object are on parallel branches or have a causal ordering by examining their vector clocks.
If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation. Dynamo resolves these conflicts at read-time.
Version branching may happen in the presence of failures combined with concurrent updates, resulting in conflicting versions of an object.
Dynamo truncates vector clocks (oldest first) when they grow too large. If Dynamo ends up deleting older vector clocks that are required to reconcile an object’s state, Dynamo would not be able to achieve eventual consistency. Conflict Free Replicated Data Types?
To make use of CRDTs, we need to model our data in such a way that concurrent changes can be applied to the data in any order and will produce the same end result. This way, the system does not need to worry about any ordering guarantees.
The idea that any two nodes that have received the same set of updates will see the same end result is called strong eventual consistency. Last Write Wins
Dynamo(and Cassandra) also offer a way to do server side conflict resolution, LWW.
Uses Physical(Wall Clock/Time-Of-the-Day) Clocks.
Can potentially lead to Data-Loss during concurrent writes.

Life of Dynamo’s put() and get() operations.

Agenda:

Strategies for Coordinator selection
Consistency protocol
put() process
get() process
Request handling through a state machine.

Strategies for choosing coordinator

Clients route request using Generic Load Balancer.
Clients use a partition-aware client library that routes requests to the appropriate coordinator with lower latency.

Consistency Protocol

Uses a consistency protocol similar to quorum systems.
R + W > N ( R /W = minimum number of nodes to participate in Read/Write)
Common configurations(N, R, W) for Dynamo (3, 2, 2)
Latency of get() and put() depends upon the slowest of replicas. Put() Process
Coordinator generates new data version and vector timestamp.
Saves data locally.
Sends write requests to N-1 highest ranked healthy nodes from the preference list.
Put() is considered successful after receiving W-1 confirmations. Get() process
Coordinator requests the data version from N-1 highest ranked healthy nodes from the preference list.
Waits until R - 1 replies.
Coordinator handles causal data versioning using vector clocks/timestamps.
Returns all data versions to the caller.

Request handling through the state machine

Each client request results in creating a state machine on the node that received the client request.
The state machine contains all the logic for
Each state machine instance handles exactly one client request.
A read operation implements following state machine:
Writes:

Anti-Entropy through Merkle Trees

Dynamo uses Vector clocks to remove write conflicts(Read Repair) while serving read requests if it receives stale responses from some of the replicas.
If a replica fell significantly behind others, it might take a very long time to resolve conflicts using read repair(vector clocks), depending upon if those keys were read or not. It may happen that some of the keys are never accessed, and they cold remain stale for longer.
We need a mechanism to automatically reconcile replicas in the background(and do conflict resolution if any).
To do this, we need to quickly compare two copies of a range of data residing on different replicas and figure out exactly which parts are different.
Naively splitting up the entire data range for checksums is not very feasible; there is simply too much data to be transferred.(Transferred? How?)
Dynamo uses Merkle trees to compare replicas of a range.
A Merkle tree is a binary tree of hashes, where each internal node is the hash of its two children, and each leaf node is a hash of a portion of the original data.
Now comparing the ranges of data on two replicas is equivalent to comparing two Merkle Trees
The principal advantage of using a Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the whole data set.
Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed during the anti-entropy process.
The disadvantage of using Merkle trees is that many key ranges can change when a node joins or leaves, and as a result, the trees need to be recalculated.

Gossip Protocol

What is a Gossip Protocol?

How does Node Failure Detection happen in Dynamo?
Since we do not have any central node that keeps track of all nodes to know if a node is down or not, how does a node know every other node’s current state?
Naive Approach: Each Node broadcast HeatBeat message to every other Node
Optimized Approach: Gossip Protocol

External Discovery Through Seed Nodes?

Dynamo nodes use gossip protocol to find the current state of the ring. This can result in a logical partition of the cluster in a particular scenario.
An administrator joins node A to the ring and then joins node B to the ring. Nodes A and B consider themselves part of the ring, yet neither would be immediately aware of each other. To prevent these logical partitions, Dynamo introduced the concept of seed nodes.
Seed nodes are fully functional nodes and can be obtained either from a static configuration or a configuration service. This way, all nodes are aware of seed nodes.
Each node communicates with seed nodes through gossip protocol to reconcile membership changes; therefore, logical partitions are highly unlikely.

Characteristics and Criticism of Dynamo

Responsibilities of a Dynamo Node

Managing get() and put() requests via acting as a Coordinator(or request Forwarder).
Keeping track of membership(Hash ranges in a Ring) and detecting failures(Gossip)
Local Persistent Storage

Characteristics of Dynamo

Distributed(Can run across several machines)
Decentralized(No external coordinator, all nodes identical)
Scalable(Horizontally scaled on commodity hardware with Fault Tolerance. No Manual intervention/rebalancing required)
Highly Available
Fault Tolerant and Reliable
Tunable Consistency(Trade Offs b/w Availability and Consistency by adjusting the replication factor 3,2,2, or 3,1,3, or 3,3,1 etc).

Criticism on Dynamo Design?

Each Dynamo node contains the entire routing table. Could affect scalability of the system as this routing table gets larger as more nodes are added to the system.
Dynamo seems to imply that it strives for symmetry(all nodes have the same set of responsibilities). But it does specify some nodes as seed nodes for external discovery to avoid logical partition. May violate Dynamo’s symmetry principle.
DHTs can be susceptible to Several different types of attack?[Research More?]
Dynamo’s design can be described as a Leaky Abstraction.

DataStores developed on Principles of Dynamo

Riak is a distributed NoSQL key-value data store that is highly available, scalable, fault-tolerant, and easy to operate.
Cassandra is a distributed, decentralized, scalable, and highly available NoSQL wide-column database.

Summary

Paper reading Video.

References:

https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
https://docs.riak.com/riak/kv/2.2.0/developing/data-types/
https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/
https://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type
https://www.allthingsdistributed.com/2017/10/a-decade-of-dynamo.html
https://news.ycombinator.com/item?id=915212 Open Questions:
Anti-Entropy and Merkle Trees
DHTs can be susceptible to Several different types of attack?[Research More?]
Underlying storage for Dynamo store. Berkeley, in-memory + persistent + more.
Why does it use MD5 hashing? Why not something else?
Logical partitioning and seed nodes?
New Features and revision in the design?

Paper Link: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Last updated: March 15, 2026

Questions or discussion? Email me