High Performance IO For Large Scale Deep Learning

Paper: High Performance IO For Large Scale Deep Learning High Performance I/O For Large Scale Deep Learning Ideas Explored(TLDR) WebDataset(Large Sharded Datasets instead of smaller Random Reads) AIStore(S3 compatible Object store w/ Caching) instead of Distributed File Systems like GFS/HDFS. Background Deep learning training needs Petascale datasets. Existing Distributed File systems not suited for access patterns of DL jobs. DL Workloads: Repeated random access of training datasets(not High throughput sequential IO). DL datasets are transformed to shard collections from the original dataset to change access patterns from random reads to sequential IO. DL Model Training Steps Traditional Big Data ML Storage solutions Requirements for Large scale Deep Learning Storage Solutions AI Store Provide infinitely scalable namespace over arbitrary numbers of Disks(SSDs & HDDs). ...

March 15, 2026 · Hemant Sethi

Characterizing Deep-Learning IO Workloads in TensorFlow

Paper: Characterizing Deep-Learning IO Workloads in TensorFlow Characterizing Deep-Learning I/O Workloads in TensorFlow(2018) Paper Link: https://arxiv.org/abs/1810.03035 Ideas Explored (TLDR) Three levers to fix TensorFlow I/O Background DL I/O != Traditional HPC I/O HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes. DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing TensorFlow Data Pipeline Dataset API - supports POSIX, HDFS, GCS, S3 Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes Embarrassingly parallel - each file used by one worker, no collective I/O needed tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples) dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory. Checkpointing tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights) No guaranteed flush to disk, no async checkpoint support (at the time) Each snapshot ~600MB - bursty writes stall training Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately Benchmarks & Setup Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10 Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10 Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset) AlexNet mini-app: Caltech-101, median 12KB images Results Threading 1→2 threads doubles bandwidth HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets) Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself Prefetching Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time With prefetch: runtime same regardless of storage type or thread count Without prefetch: bursty read pattern, GPU idles waiting for data Checkpointing ~15% of total execution time w/o optimization Lustre fastest, HDD slowest Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint Background copy to HDD completes while training continues Key Takeaways DL I/O bottleneck is reads not writes - opposite of traditional HPC Threading helps but is limited; Prefetching is the highest-leverage knob. Burst buffer essential for checkpointing at scale - NVMe as staging tier Prefetch hierarchy likely needed at scale: stage training data in burst buffer too. Paper Link: https://arxiv.org/pdf/1810.03035 ...

March 14, 2026 · Hemant Sethi