Technical Notes

High Performance IO For Large Scale Deep Learning

Paper: High Performance IO For Large Scale Deep Learning High Performance I/O For Large Scale Deep Learning Ideas Explored(TLDR) WebDataset(Large Sharded Datasets instead of smaller Random Reads) AIStore(S3 compatible Object store w/ Caching) instead of Distributed File Systems like GFS/HDFS. Background Deep learning training needs Petascale datasets. Existing Distributed File systems not suited for access patterns of DL jobs. DL Workloads: Repeated random access of training datasets(not High throughput sequential IO). DL datasets are transformed to shard collections from the original dataset to change access patterns from random reads to sequential IO. DL Model Training Steps Traditional Big Data ML Storage solutions Requirements for Large scale Deep Learning Storage Solutions AI Store Provide infinitely scalable namespace over arbitrary numbers of Disks(SSDs & HDDs). ...

Characterizing Deep-Learning IO Workloads in TensorFlow

Paper: Characterizing Deep-Learning IO Workloads in TensorFlow Characterizing Deep-Learning I/O Workloads in TensorFlow(2018) Paper Link: https://arxiv.org/abs/1810.03035 Ideas Explored (TLDR) Three levers to fix TensorFlow I/O Background DL I/O != Traditional HPC I/O HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes. DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing TensorFlow Data Pipeline Dataset API - supports POSIX, HDFS, GCS, S3 Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes Embarrassingly parallel - each file used by one worker, no collective I/O needed tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples) dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory. Checkpointing tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights) No guaranteed flush to disk, no async checkpoint support (at the time) Each snapshot ~600MB - bursty writes stall training Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately Benchmarks & Setup Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10 Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10 Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset) AlexNet mini-app: Caltech-101, median 12KB images Results Threading 1→2 threads doubles bandwidth HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets) Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself Prefetching Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time With prefetch: runtime same regardless of storage type or thread count Without prefetch: bursty read pattern, GPU idles waiting for data Checkpointing ~15% of total execution time w/o optimization Lustre fastest, HDD slowest Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint Background copy to HDD completes while training continues Key Takeaways DL I/O bottleneck is reads not writes - opposite of traditional HPC Threading helps but is limited; Prefetching is the highest-leverage knob. Burst buffer essential for checkpointing at scale - NVMe as staging tier Prefetch hierarchy likely needed at scale: stage training data in burst buffer too. Paper Link: https://arxiv.org/pdf/1810.03035 ...

Megastore

Paper: Megastore Megastore Google 2011 No SQL Database with strong consistency*(Within 1 partition). Replicas spread across data centers for fault tolerance. Built on top of BigTable(+ GFS), Paxos(Distributed Consensus) is used for strong consistency. We have seen multiple Paxos implementations before, Chubby, Single Sign On(SSO). Motivations behind building Megastore Single Leader Paxos Pros PiggyBacking:Prepare phase of write n+1 can be piggybacked on the commit of write n. Local reads can be served from the master. Cons Follower replicas are just wasting resources. Master failover takes a while and we need to wait for master lease timeout. Servers that are not close to the master but are close to the end user still have to go through the masters. Megastore proposed Improvements Writes can be proposed by ANY replica. Reads can be initiated by any replica No more need to handle Master Failover. Entity Groups(Partitions) Paxos maintains a distributed log across computers. We use it to create a database write ahead log. If we use one log for the entire database, every write would compete to be the next spot in the log. Partition the log by Entity Group. Entity Group Example Megastore allows you to do a ton of data denormalization because BigTable provides a flexible schema. ...

Spanner

Paper: Spanner Spanner: Google’s Globally-Distributed Database Abstract Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database which supports externally-consistent(Linearizable) distributed transactions. Paper describes how Spanner is Structured, feature set, rationale behind various design decisions, and a Novel Time API that exposes clock certainty. Introduction Spanner shards data across many sets of Paxos state machines in DCs spread across the world. Replication for global availability and geographic locality, clients automatically failover between replicas. Automatically reshards data across machines as the amount of data or the number of servers changes, and it automatically migrates data across machines (even across datacenters) to balance load and in response to failures. Designed to scale up to millions of machines across hundreds of data centers and trillions of database rows. Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents. BigTable problems Megastore supports semi-relational data model and synchronous replication, despite its relatively poor write throughput. Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Globally distributed features: TrueTime API and its implementation(Key enabler of the above properties) Spanner Implementation Directory abstraction(unit of data movement) to manage replication and locality. ...

DynamoDB

Paper: DynamoDB DynamoDB Summary/Abstract Amazon DynamoDB is a NoSQL cloud database service that provides consistent performance at any scale. Fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience. In 2021, during the 66-hour Amazon Prime Day shopping event, 89.2 million requests per second, while experiencing high availability with single-digit millisecond performance. Design and implementation of DynamoDB have evolved since the first launch in 2012. The system has successfully dealt with issues related to fairness, traffic imbalance across partitions, monitoring, and automated system operations without impacting availability or performance. Introduction The goal of the design of DynamoDB is to complete all requests with low single-digit millisecond latencies. DynamoDB uniquely integrates the following six fundamental system properties: DynamoDB is a fully managed cloud service. DynamoDB employs a multi-tenant architecture. DynamoDB provides predictable performance DynamoDB is highly available. DynamoDB supports flexible use cases. DynamoDB evolved as a distributed database service to meet the needs of its customers without losing its key aspect of providing a single-tenant experience to every customer using a multi-tenant architecture. The paper explains the challenges faced by the system and how the service evolved to handle those challenges while connecting the required changes to a common theme of durability, availability, scalability, and predictable performance. History Design of DynamoDB was motivated by our experiences with its predecessor Dynamo. Dynamo was created in response to the need for a highly scalable, available, and durable key-value database for shopping cart data Amazon learned that providing applications with direct access to traditional enterprise database instances led to scal- ing bottlenecks such as connection management, interference between concurrent workloads, and operational problems with tasks such as schema upgrades. Service Oriented Architecture was adopted to encapsulate an application’s data behind service-level APIs that allowed sufficient decoupling to address tasks like reconfiguration without having to disrupt clients. DynamoDB took the principles from Dynamo(which was being run as Self-hosted DB but created operational burden for developers) & Simple DB, a fully managed elastic NoSQL database service, but the data model couldn’t scale to the demands of the large Tables which DDB needed. Dynamo Limitations: SimpleDB limitations: Amazon concluded that a better solution would combine the best parts of the original Dynamo design (incremental scalability and predictable high performance) with the best parts of SimpleDB (ease of administration of a cloud service, consistency, and a table-based data model that is richer than a pure key-value store) Architecture A DynamoDB table is a collection of items. ...

Raft

Paper: Raft Raft Paper -> https://raft.github.io/raft.pdf Usenix -> https://web.stanford.edu/~ouster/cgi-bin/papers/raft-atc14.pdf Website -> https://raft.github.io/ Designing for Understandability Raft 2016 -> Video , Slides Raft User Study, 2013 -> Video, Slides Motivation: Replicated State Machines Service that is replicated on multiple machines: Raft Basics Leader based: Server states: Time divided into terms: Request-response protocol between servers (remote procedure calls, or RPCs). 2 request types: Leader Election All servers start as followers No heartbeat (AppendEntries)? Start election: Election outcomes: Each server votes for at most one candidate in a given term Election Safety: at most one server can be elected leader in a given term Availability: randomized election timeouts reduce split votes Log Replication Handled by leader When client request arrives: Log entries: index, term, command Logs can become inconsistent after leader crashes Raft maintains a high level of coherency between logs (Log Matching Property): AppendEntries consistency check preserves above properties. Leader forces other logs to match its own: Safety Must ensure that the leader for new term always holds all of the log entries committed in previous terms (Leader Completeness Property). Step 1: restriction on elections: don’t vote for a candidate unless candidate’s log is at least as up-to-date as yours. Compare indexes and terms from last log entries. Step 2: be very careful about when an entry is considered committed Persistent Storage Each server stores the following in persistent storage (e.g. disk or flash): These must be recovered from persistent storage after a crash If a server loses its persistent storage, it cannot participate in the cluster anymore Implementing Raft Client Interactions Clients interact only with the leader Initially, a client can send a request to any server If leader crashes while executing a client request, the client retries (with a new randomly-chosen server) until the request succeeds This can result in multiple executions of a command: not consistent! Goal: linearizability: System behaves as if each operation is executed exactly once, atomically, sometime between sending of the request and receipt of the response. Solution: Other Issues Cluster membership Log compaction See paper for details Paxos Vs Raft by John Kubiatowicz. Paper Link: https://raft.github.io/raft.pdf ...

BigTable

Paper: BigTable BigTable/Wide Column Storage System Goal Design a distributed and scalable system that can store a huge amount of semi-structured data. The data will be indexed by a row key where each row can have an unbounded number of columns. What is BigTable BigTable is a distributed and massively scalable wide-column store. Designed to store huge sets of structured data. Provides storage for very big tables (often in the terabyte range) BigTable is a CP system, i.e., it has strongly consistent reads and writes. BigTable can be used as an input source or output destination for MapReduce. Background Developed at Google in 2005 and used in dozens of Google services. Google couldn’t use external commercial databases because of its large scale services, and costs would have been too high. So they built an in-house solution, custom built for their use case and traffic patterns. BigTable is highly available(?? With consistency??) and high-performing database that powers multiple applications across Google — where each application has different needs in terms of the size of data to be stored and latency with which results are expected. BigTable inspired various open source databases like Cassandra(borrow BigTable’s DataModel), HBase(Distributed Non-Relational Database) and HyperTable. BigTable UseCases Google built BigTable to store large amounts of data and perform thousands of queries per second on that data. Examples of BigTable data are billions of URLs with many versions per page, petabytes of Google Earth data, and billions of users’ search data. BigTable is suitable to store large datasets that are greater than one TB where each row is less than 10MB. Since BigTable does not provide ACID properties or transaction support(Across Rows or Tables), OLTP applications should not use BigTable. Data should be structured in the form of key-value pairs or rows-columns. Non-structured data like images or movies should not be stored in BigTable. Google examples: BigTable can be used to store the following types of data: Big Table Data Model Agenda Rows Column families Columns Timestamps Details BigTable can be characterized as a sparse, distributed, persistent, multidimensional, sorted map. ...

Google File System

Paper: Google File System Google File System / Distributed File System Goal Design a distributed file system to store huge files (terabyte and larger). The system should be scalable, reliable, and highly available. Developed by Google for its large data-intensive applications. Background GFS was built for handling batch processing on large data sets and is designed for system-to-system interaction, not user-to-system interaction. Was designed with following goals in mind: GFS Use Cases Built for distributed data-intensive applications like Gmail or Youtube. Google’s BigTable uses GFS to store log files and data files. APIs GFS doesn’t provide a standard posix-like API. Instead user-level APIs are provided. Files organized hierarchically in directories and identified by their path names. Supports usual file system operations: Additional Special Operations High Level Architecture Agenda Chunks Chunk Handle Cluster Chunk Server Master Client A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. Chunk As files stored in GFS tend to be very large, GFS breaks files into multiple fixed-size chunks where each chunk is 64 megabytes in size. Chunk Handle Each chunk is identified by an Immutable and globally unique 64-bit ID number called chunk handle. Allows 2^64 unique chunks. Total allowed storage space = 2^64 * 64MB = 10^9 exabytes Files are split into Chunks, so the job of GFS is to provide a mapping from files to Chunks, and then to support standard operations on Files, mapping down operations to individual chunks. Cluster GFS is organized into a network of computers(nodes) called a cluster. A GFS cluster contains 3 types of entities: Chunk Server Nodes which stores chunks on local disks as linux files Read or write chunk data specified by chunk handle and byte-range. For reliability, each chunk is replicated to multiple chunk servers. By default, GFS stores three replicas, though different replication factors can be specified on a per-file basis. Master Coordinator of GFS cluster. Responsible for keeping track of filesystem metadata. Metadata stored at master includes: Master also controls system-wide activities such as: Periodically communicates with each ChunkServer in HeartBeat messages to give it instructions and collect its state. For performance and fast random access, all metadata is stored in the master’s main memory, i.e. entire filesystem namespace as well as all the name-to-chunk mappings. For fault tolerance and to handle a master crash, all metadata changes(every operation to File System) are written to the disk onto an operation log(similar to Journal) which is replicated to remote machines. The benefit of having a single, centralized master is that it has a global view of the file system, and hence, it can make optimum management decisions, for example, related to chunk placement. Client Application/Entity that makes read/write requests to GFS using GFS Client library. This library communicates with the master for all metadata-related operations like creating or deleting files, looking up files, etc. To read or write data, the client(library) interacts directly with the ChunkServers that hold the data. Neither the client nor the ChunkServer caches file data. ChunkServers rely on the buffer cache in Linux to maintain frequently accessed data in memory. Single Master and Large Chunk Size Agenda Single Master Chunk Size Single Master Having a single master vastly simplifies GFS design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. GFS minimizes the master’s involvement in reads and writes, so that it does not become a bottleneck. Chunk Size GFS has chosen 64 MB, which is much larger than typical filesystem block sizes (which are often around 4KB). One of the key design parameters. Advantages of large chunk size: Lazy space Allocation Each chunk replica is stored as a plain Linux file on a ChunkServer. GFS does not allocate the whole 64MB of disk space when creating a chunk. Instead, as the client appends data, the ChunkServer, lazily extends the chunk One disadvantage of having a large chunk size is the handling of small files. Metadata Let’s explore how GFS manages file system metadata. ...

Hadoop Distributed File System

Paper: Hadoop Distributed File System Hadoop Distributed File System Goal Design a distributed system that can store huge files (terabyte and larger). The system should be scalable, reliable, and highly available. What is Hadoop Distributed File System HDFS is a distributed file system and was built to store unstructured data. It is designed to store huge files reliably and stream those files at high bandwidth to user applications. HDFS is a variant and a simplified version of the Google File System (GFS). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. Background Apache Hadoop is a software framework that provides a distributed file storage system(HDFS) and distributed computing for analyzing and transforming very large data sets using the MapReduce programming model. HDFS is the default file storage system in Hadoop. It is designed to be a distributed, scalable, fault-tolerant file system that primarily caters to the needs of the MapReduce paradigm. Both HDFS and GFS were built to store very large files and scale to store petabytes of storage. Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end- users. Like GFS, HDFS is also not POSIX-compliant and is not a mountable file system on its own. It is typically accessed via HDFS clients or by using application programming interface (API) calls from the Hadoop libraries. Given HDFS design, following applications are not a good fit for HDFS, API Provides user-level APIs(and not standard POSIX-like APIs). Files are organized hierarchically in directories and identified by their pathnames. Supports the usual file system operations on files and directories. Create, Delete, Rename, Move, and Symbolic Links(unlike GFS) etc. All read and write operations are done in an append-only fashion. High Level Architecture HDFS Architecture Files are broken into 128 MB fixed-size blocks (configurable on a per-file basis). ...

Chubby

Paper: Chubby Chubby / Distributed Locking Service Goal Design a highly available and consistent service that can store small objects and provide a locking mechanism on those objects. What is Chubby? Chubby is a service that provides a distributed locking mechanism and also stores small files. Internally, it is implemented as a key/value store that also provides a locking mechanism on each object stored in it. Extensively used in various systems inside Google to provide storage and coordination services for systems like GFS and BigTable. Apache ZooKeeper is the open-source alternative to Chubby. Chubby is a centralized service offering developer-friendly interfaces (to acquire/release locks and create/read/delete small files). It does all this with just a few extra lines of code to any existing application without a lot of modification to application logic. At a high level, Chubby provides a framework for distributed consensus. Chubby Use Cases Primarily Chubby was developed to provide a reliable locking service. Other use cases evolved like: Leader Election Any lock service can be seen as a consensus service, as it converts the problem of reaching consensus to handing out locks. A set of distributed applications compete to acquire a lock, and whoever gets the lock first gets the resource. Similarly, an application can have multiple replicas running and wants one of them to be chosen as the leader. Chubby can be used for leader election among a set of replicas. Naming Service(Like DNS) It is hard to make faster updates to DNS due to its time-based caching nature, which means there is generally a potential delay before the latest DNS mapping is effective. Storage(Small Objects that rarely change) Chubby provides a Unix-style interface to reliably store small files that do not change frequently (complementing the service offered by GFS). Applications can then use these files for any usage like DNS, configs, etc. Distributed Locking Mechanism Chubby provides a developer-friendly interface for coarse-grained distributed locks (as opposed to fine-grained locks) to synchronize distributed activities in a distributed environment. Application needs a few lines, and chubby can take care of all lock management so that devs can focus on business logic, and not solve distributed Locking problems in a Distributed system’s setting. We can say that Chubby provides mechanisms like semaphores and mutexes for a distributed environment. When Not to Use Chubby? Bulk Storage is needed Data update rate is high. Locks are acquired/released frequently. Usage is more like a publish/subscribe model. Background Chubby is neither really a research effort nor does it claim to introduce any new algorithms. Rather, Chubby describes a certain design and implementation done at Google in order to provide a way for its clients to synchronize their activities and agree(Consensus) on basic information about their environment Chubby and Paxos Chubby uses Paxos underneath to manage the state of the Chubby system at any point in time. Getting all nodes in a distributed system to agree on anything (e.g., election of primary among peers) is basically a kind of distributed consensus problem. Distributed consensus using Asynchronous Communication is already solved by Paxos protocol. Chubby Common Terms Chubby Cell Chubby cell is a Chubby Cluster. Most Chubby Cells are single Data Center(DC) but there can be some configuration where Chubby replicas exist Cross DC as well. Chubby cell has two main components, server and client, that communicate via remote procedure call (RPC). Chubby Servers A Chubby Cell consists of a small set of servers(typically 5) known as Replicas. Using Paxos, one of the servers is selected as Master which handles all client requests. Fails over to another replica if the master fails. Each replica maintains a small database to store files/directories/locks. The master writes directly to its own local database, which gets synced asynchronously to all the replicas(Reliability). For Fault Tolerance, replicas are placed on different racks. Chubby Client Library Client applications use a Chubby library to communicate with the replicas in the chubby cell using RPC. Chubby API Chubby exports a unix-like file system interface similar to POSIX but simpler. It consists of a strict tree of files and directories with name components separated by slashes. E.g. File format: /ls/chubby_cell/directory_name/…/file_name A special name, /ls/local, will be resolved to the most local cell relative to the calling application or service. What is the most local Cell? Chubby can be used for locking or storing a small amount of data or both, i.e., storing small files with locks. API Categories General Open() : Opens a given named file or directory and returns a handle. Close() : Closes an open handle. Poison() : Allows a client to cancel all Chubby calls made by other threads without fear of deallocating the memory being accessed by them. Delete() : Deletes the file or directory. File GetContentsAndStat() : Returns (atomically) the whole file contents and metadata associated with the file. This approach of reading the whole file is designed to discourage the creation of large files, as it is not the intended use of Chubby. GetStat() : Returns just the metadata. ReadDir() : Returns the contents of a directory – that is, names and metadata of all children. SetContents() : Writes the whole contents of a file (atomically). SetACL() : Writes new access control list information. Locking Acquire() : Acquires a lock on a file. TryAcquire() : Tries to acquire a lock on a file; it is a non-blocking variant of Acquire. Release() : Releases a lock. Sequencer GetSequencer() : Get the sequencer of a lock. A sequencer is a string representation of a lock. SetSequencer() : Associate a sequencer with a handle. CheckSequencer() : Check whether a sequencer is valid. Chubby does not support operations like append, seek, move files between directories, or making symbolic or hard links. Files can only be completely read or completely written/overwritten. This makes it practical only for storing very small files. ...