Technical Notes

Kafka

Paper: Kafka Kafka/Distributed Messaging System Goal Design a distributed messaging system that can reliably transfer a high throughput of messages between different entities. Background One common challenge in distributed systems is handling continuous influx of data from multiple sources. E.g. Imagine a log aggregation service that can receive hundreds of log entries per second from different sources. Function of this log aggregation service is to store these logs on a disk at a shared server and build an index on top of these logs so that they can be searched later. Challenges of this service? Distributed Messaging Systems(or Asynchronous processing paradigm) can help. What is a messaging System? System responsible for transferring data amongst various disparate systems like apps, services, processes, servers etc, w/o introducing additional coupling b/w producers and consumers, and by providing asynchronous way of communicating b/w sender and receiver. Two types of Messaging Systems Queue A Particular message can be consumed by one consumer only. Once a message is consumed, it’s removed from the queue. Limits the system as the same messages can’t be read by the multiple consumers. Publish-Subscribe Messaging System In the Pub-Sub model, messages are written into Partitions/Topics. ...

Cassandra

Paper: Cassandra Cassandra / Distributed Wide Column NoSQL Database Goal Design a distributed and scalable system that can store a huge amount of semi-structured data, which is indexed by a row key where each row can have an unbounded number of columns. Background Open source Apache Project developed at FB in 2007 for Inbox Search feature. Designed to provide Scalability, Availability, Reliability to store large amounts of data. Combines distributed nature of Amazon’s Dynamo(K-V store) and DataModel for Google’s BigTable which is a Column based store. Decentralized architecture with no Single Point of Failure(SPOF), Performance can scale linearly with addition of nodes. What is Cassandra? Cassandra is typically classified as an AP (i.e., Available and Partition Tolerant) system which means that availability and partition tolerance are generally considered more important than the consistency. Eventually Consistent Similar to Dynamo, Cassandra can be tuned with replication-factor and consistency levels to meet strong consistency requirements, but this comes with a performance cost. Uses peer-to-peer architecture where each node communicates to all other nodes. Cassandra Use Cases Any application where eventual consistency is not a concern can utilize Cassandra. Cassandra is optimized for high throughput writes. Can be used for collecting big data for performing real-time analysis. Storing key-value data with high availability(Reddit/Dig) because of linear scaling w/o downtime. Time Series Data Model Write Heavy Applications NoSQL High Level Architecture Agenda Cassandra Common Terms High Level Architecture Cassandra Common Terms Column: A Key-Value pair. Most basic unit of data structure in Cassandra. ...

Dynamo

Paper: Dynamo Dynamo / Distributed Key Value Store Problem: Design a distributed key-value store(or Distributed Hash Table) that is highly available (i.e., reliable), highly scalable, and completely decentralized. Features Highly available Key-Value Store. Shopping Cart, Bestseller Lists, Sales Rank, Product Catalog, etc which needs only primary-key access to data. Multi-table RDBMS would limit scalability and availability. Can choose desired Level of Availability and Consistency. Background? Designed for **high availability(**at a massive scale) and partition tolerance at the expense of strong consistency. Primary Motivation for being optimized for High Availability(Over consistency) was to be always up for serving customer requests to provide better customer experience. Dynamo design inspired various NoSQL Databases, Cassandra, Riak, VoldemortDB, DynamoDB. Design Goals? Highly Available Reliability Highly Scalable Decentralized Eventually Consistent(EC) - Weaker Consistency model than Strong Consistency(Linearizability) (Notes: ) Latency Requirements? (Notes: ) Geographical Distribution of Data? Use cases Dynamo can achieve strong consistency, but it comes with a performance impact. If Strong Consistency is a requirement, Dynamo is not the best option. Applications that need tight control over the trade-offs between availability, consistency, cost-effectiveness, and performance. Services that need only Primary Key access to the data. System APIs: get(key) : T… Object, Context put(key, context, object) Dynamo treats both the object and the key as an arbitrary array of bytes (typically less than 1 MB). Uses MD5 Hashing algorithm on the key to generate 128-bit HashID, which is used to determine the storage nodes that are responsible for serving the key. High Level Architecture Agenda Data Distribution(Partitioning) Data Replication and Consistency Handing Temporary Failures(Fault Tolerance) Inter-Node communication(Unreliable Network) and Failure Detection High Availability Conflict resolution and handling permanent failures. Data Partitioning Distributing data across a set of nodes is called data partitioning. ...