Spanner

Paper: Spanner Spanner: Google’s Globally-Distributed Database Abstract Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database which supports externally-consistent(Linearizable) distributed transactions. Paper describes how Spanner is Structured, feature set, rationale behind various design decisions, and a Novel Time API that exposes clock certainty. Introduction Spanner shards data across many sets of Paxos state machines in DCs spread across the world. Replication for global availability and geographic locality, clients automatically failover between replicas. Automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures. Designed to scale up to millions of machines across hundreds of data centers and trillions of database rows. Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents. BigTable problems Megastore supports semi-relational data model and synchronous replication, despite its relatively poor write throughput. Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Globally distributed features: TrueTime API and its implementation(Key enabler of the above properties) Spanner Implementation Directory abstraction(unit of data movement) to manage replication and locality. ...

December 14, 2024 · Hemant Sethi

DynamoDB

Paper: DynamoDB DynamoDB Summary/Abstract Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale. Fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience. In 2021, during the 66-hour Amazon Prime Day shopping event, 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance. Design and implementation of DynamoDB have evolved since the first launch in 2012. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance. Introduction The goal of the design of DynamoDB is to complete all requests with low single-digit millisecond latencies. DynamoDB uniquely integrates the following six fundamental system properties: DynamoDB is a fully managed cloud service. DynamoDB employs a multi-tenant architecture. DynamoDB provides predictable performance DynamoDB is highly available. DynamoDB supports flexible use cases. DynamoDB evolved as a distributed database service to meet the needs of its customers without losing its key aspect of providing a single-tenant experience to every customer using a multi-tenant architecture. The paper explains the challenges faced by the system and how the service evolved to handle those challenges while connecting the required changes to a common theme of durability, availability, scalability, and predictable performance. History Design of DynamoDB was motivated by our experiences with its predecessor Dynamo. Dynamo was created in response to the need for a highly scalable, available, and durable key-value database for shopping cart data Amazon learned that providing applications with direct access to traditional enterprise database instances led to scal- ing bottlenecks such as connection management, interference between concurrent workloads, and operational problems with tasks such as schema upgrades. Service Oriented Architecture was adopted to encapsulate an application’s data behind service-level APIs that allowed sufficient decoupling to address tasks like reconfiguration without having to disrupt clients. DynamoDB took the principles from Dynamo(which was being run as Self-hosted DB but created operational burden for developers) & Simple DB, a fully managed elastic NoSQL database service, but the data model couldn’t scale to the demands of the large Tables which DDB needed. Dynamo Limitations: SimpleDB limitations: Amazon concluded that a better solution would combine the best parts of the original Dynamo design (incremental scalability and predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store) Architecture A DynamoDB table is a collection of items. ...

December 11, 2024 · Hemant Sethi

BigTable

Paper: BigTable BigTable/Wide Column Storage System Goal Design a distributed and scalable system that can store a huge amount of semi-structured data. The data will be indexed by a row key where each row can have an unbounded number of columns. What is BigTable BigTable is a distributed and massively scalable wide-column store. Designed to store huge sets of structured data. Provides storage for very big tables (often in the terabyte range) BigTable is a CP system, i.e., it has strongly consistent reads and writes. BigTable can be used as an input source or output destination for MapReduce. Background Developed at Google in 2005 and used in dozens of Google services. Google couldn’t use external commercial databases because of its large scale services, and costs would have been too high. So they built an in-house solution, custom built for their use case and traffic patterns. BigTable is highly available(?? With consistency??) and high-performing database that powers multiple applications across Google — where each application has different needs in terms of the size of data to be stored and latency with which results are expected. BigTable inspired various open source databases like Cassandra(borrow BigTable’s DataModel), HBase(Distributed Non-Relational Database) and HyperTable. BigTable UseCases Google built BigTable to store large amounts of data and perform thousands of queries per second on that data. Examples of BigTable data are billions of URLs with many versions per page, petabytes of Google Earth data, and billions of users’ search data. BigTable is suitable to store large datasets that are greater than one TB where each row is less than 10MB. Since BigTable does not provide ACID properties or transaction support(Across Rows or Tables), OLTP applications should not use BigTable. Data should be structured in the form of key-value pairs or rows-columns. Non-structured data like images or movies should not be stored in BigTable. Google examples: BigTable can be used to store the following types of data: Big Table Data Model Agenda Rows Column families Columns Timestamps Details BigTable can be characterized as a sparse, distributed, persistent, multidimensional, sorted map. ...

December 9, 2024 · Hemant Sethi

Chubby

Paper: Chubby Chubby / Distributed Locking Service Goal Design a highly available and consistent service that can store small objects and provide a locking mechanism on those objects. What is Chubby? Chubby is a service that provides a distributed locking mechanism and also stores small files. Internally, it is implemented as a key/value store that also provides a locking mechanism on each object stored in it. Extensively used in various systems inside Google to provide storage and coordination services for systems like GFS and BigTable. Apache ZooKeeper is the open-source alternative to Chubby. Chubby is a centralized service offering developer-friendly interfaces (to acquire/release locks and create/read/delete small files). It does all this with just a few extra lines of code to any existing application without a lot of modification to application logic. At a high level, Chubby provides a framework for distributed consensus. Chubby Use Cases Primarily Chubby was developed to provide a reliable locking service. Other use cases evolved like: Leader Election Any lock service can be seen as a consensus service, as it converts the problem of reaching consensus to handing out locks. A set of distributed applications compete to acquire a lock, and whoever gets the lock first gets the resource. Similarly, an application can have multiple replicas running and wants one of them to be chosen as the leader. Chubby can be used for leader election among a set of replicas. Naming Service(Like DNS) It is hard to make faster updates to DNS due to its time-based caching nature, which means there is generally a potential delay before the latest DNS mapping is effective. Storage(Small Objects that rarely change) Chubby provides a Unix-style interface to reliably store small files that do not change frequently (complementing the service offered by GFS). Applications can then use these files for any usage like DNS, configs, etc. Distributed Locking Mechanism Chubby provides a developer-friendly interface for coarse-grained distributed locks (as opposed to fine-grained locks) to synchronize distributed activities in a distributed environment. Application needs a few lines, and chubby can take care of all lock management so that devs can focus on business logic, and not solve distributed Locking problems in a Distributed system’s setting. We can say that Chubby provides mechanisms like semaphores and mutexes for a distributed environment. When Not to Use Chubby? Bulk Storage is needed Data update rate is high. Locks are acquired/released frequently. Usage is more like a publish/subscribe model. Background Chubby is neither really a research effort nor does it claim to introduce any new algorithms. Rather, Chubby describes a certain design and implementation done at Google in order to provide a way for its clients to synchronize their activities and agree(Consensus) on basic information about their environment Chubby and Paxos Chubby uses Paxos underneath to manage the state of the Chubby system at any point in time. Getting all nodes in a distributed system to agree on anything (e.g., election of primary among peers) is basically a kind of distributed consensus problem. Distributed consensus using Asynchronous Communication is already solved by Paxos protocol. Chubby Common Terms Chubby Cell Chubby cell is a Chubby Cluster. Most Chubby Cells are single Data Center(DC) but there can be some configuration where Chubby replicas exist Cross DC as well. Chubby cell has two main components, server and client, that communicate via remote procedure call (RPC). Chubby Servers A Chubby Cell consists of a small set of servers(typically 5) known as Replicas. Using Paxos, one of the servers is selected as Master which handles all client requests. Fails over to another replica if the master fails. Each replica maintains a small database to store files/directories/locks. The master writes directly to its own local database, which gets synced asynchronously to all the replicas(Reliability). For Fault Tolerance, replicas are placed on different racks. Chubby Client Library Client applications use a Chubby library to communicate with the replicas in the chubby cell using RPC. Chubby API Chubby exports a unix-like file system interface similar to POSIX but simpler. It consists of a strict tree of files and directories with name components separated by slashes. E.g. File format: /ls/chubby_cell/directory_name/…/file_name A special name, /ls/local, will be resolved to the most local cell relative to the calling application or service. What is the most local Cell? Chubby can be used for locking or storing a small amount of data or both, i.e., storing small files with locks. API Categories General Open() : Opens a given named file or directory and returns a handle. Close() : Closes an open handle. Poison() : Allows a client to cancel all Chubby calls made by other threads without fear of deallocating the memory being accessed by them. Delete() : Deletes the file or directory. File GetContentsAndStat() : Returns (atomically) the whole file contents and metadata associated with the file. This approach of reading the whole file is designed to discourage the creation of large files, as it is not the intended use of Chubby. GetStat() : Returns just the metadata. ReadDir() : Returns the contents of a directory – that is, names and metadata of all children. SetContents() : Writes the whole contents of a file (atomically). SetACL() : Writes new access control list information. Locking Acquire() : Acquires a lock on a file. TryAcquire() : Tries to acquire a lock on a file; it is a non-blocking variant of Acquire. Release() : Releases a lock. Sequencer GetSequencer() : Get the sequencer of a lock. A sequencer is a string representation of a lock. SetSequencer() : Associate a sequencer with a handle. CheckSequencer() : Check whether a sequencer is valid. Chubby does not support operations like append, seek, move files between directories, or making symbolic or hard links. Files can only be completely read or completely written/overwritten. This makes it practical only for storing very small files. ...

December 7, 2024 · Hemant Sethi

Kafka

Paper: Kafka Kafka/Distributed Messaging System Goal Design a distributed messaging system that can reliably transfer a high throughput of messages between different entities. Background One common challenge in distributed systems is handling continuous influx of data from multiple sources. E.g. Imagine a log aggregation service that can receive hundreds of log entries per second from different sources. Function of this log aggregation service is to store these logs on a disk at a shared server and build an index on top of these logs so that they can be searched later. Challenges of this service? Distributed Messaging Systems(or Asynchronous processing paradigm) can help. What is a messaging System? System responsible for transferring data amongst various disparate systems like apps, services, processes, servers etc, w/o introducing additional coupling b/w producers and consumers, and by providing asynchronous way of communicating b/w sender and receiver. Two types of Messaging Systems Queue A Particular message can be consumed by one consumer only. Once a message is consumed, it’s removed from the queue. Limits the system as the same messages can’t be read by the multiple consumers. Publish-Subscribe Messaging System In the Pub-Sub model, messages are written into Partitions/Topics. ...

December 4, 2024 · Hemant Sethi

Cassandra

Paper: Cassandra Cassandra / Distributed Wide Column NoSQL Database Goal Design a distributed and scalable system that can store a huge amount of semi-structured data, which is indexed by a row key where each row can have an unbounded number of columns. Background Open source Apache Project developed at FB in 2007 for Inbox Search feature. Designed to provide Scalability, Availability, Reliability to store large amounts of data. Combines distributed nature of Amazon’s Dynamo(K-V store) and DataModel for Google’s BigTable which is a Column based store. Decentralized architecture with no Single Point of Failure(SPOF), Performance can scale linearly with addition of nodes. What is Cassandra? Cassandra is typically classified as an AP (i.e., Available and Partition Tolerant) system which means that availability and partition tolerance are generally considered more important than the consistency. Eventually Consistent Similar to Dynamo, Cassandra can be tuned with replication-factor and consistency levels to meet strong consistency requirements, but this comes with a performance cost. Uses peer-to-peer architecture where each node communicates to all other nodes. Cassandra Use Cases Any application where eventual consistency is not a concern can utilize Cassandra. Cassandra is optimized for high throughput writes. Can be used for collecting big data for performing real-time analysis. Storing key-value data with high availability(Reddit/Dig) because of linear scaling w/o downtime. Time Series Data Model Write Heavy Applications NoSQL High Level Architecture Agenda Cassandra Common Terms High Level Architecture Cassandra Common Terms Column: A Key-Value pair. Most basic unit of data structure in Cassandra. ...

December 3, 2024 · Hemant Sethi

Dynamo

Paper: Dynamo Dynamo / Distributed Key Value Store Problem: Design a distributed key-value store(or Distributed Hash Table) that is highly available (i.e., reliable), highly scalable, and completely decentralized. Features Highly available Key-Value Store. Shopping Cart, Bestseller Lists, Sales Rank, Product Catalog, etc which needs only primary-key access to data. Multi-table RDBMS would limit scalability and availability. Can choose desired Level of Availability and Consistency. Background? Designed for **high availability(**at a massive scale) and partition tolerance at the expense of strong consistency. Primary Motivation for being optimized for High Availability(Over consistency) was to be always up for serving customer requests to provide better customer experience. Dynamo design inspired various NoSQL Databases, Cassandra, Riak, VoldemortDB, DynamoDB. Design Goals? Highly Available Reliability Highly Scalable Decentralized Eventually Consistent(EC) - Weaker Consistency model than Strong Consistency(Linearizability) (Notes: ) Latency Requirements? (Notes: ) Geographical Distribution of Data? Use cases Dynamo can achieve strong consistency, but it comes with a performance impact. If Strong Consistency is a requirement, Dynamo is not the best option. Applications that need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance. Services that need only Primary Key access to the data. System APIs: get(key) : T… Object, Context put(key, context, object) Dynamo treats both the object and the key as an arbitrary array of bytes (typically less than 1 MB). Uses MD5 Hashing algorithm on the key to generate 128-bit HashID, which is used to determine the storage nodes that are responsible for serving the key. High Level Architecture Agenda Data Distribution(Partitioning) Data Replication and Consistency Handing Temporary Failures(Fault Tolerance) Inter-Node communication(Unreliable Network) and Failure Detection High Availability Conflict resolution and handling permanent failures. Data Partitioning Distributing data across a set of nodes is called data partitioning. ...

December 2, 2024 · Hemant Sethi