High Performance IO For Large Scale Deep Learning
Paper: High Performance IO For Large Scale Deep Learning High Performance I/O For Large Scale Deep Learning Ideas Explored(TLDR) WebDataset(Large Sharded Datasets instead of smaller Random Reads) AIStore(S3 compatible Object store w/ Caching) instead of Distributed File Systems like GFS/HDFS. Background Deep learning training needs Petascale datasets. Existing Distributed File systems not suited for access patterns of DL jobs. DL Workloads: Repeated random access of training datasets(not High throughput sequential IO). DL datasets are transformed to shard collections from the original dataset to change access patterns from random reads to sequential IO. DL Model Training Steps Traditional Big Data ML Storage solutions Requirements for Large scale Deep Learning Storage Solutions AI Store Provide infinitely scalable namespace over arbitrary numbers of Disks(SSDs & HDDs). ...