Paper: Characterizing Deep-Learning IO Workloads in TensorFlow
Characterizing Deep-Learning I/O Workloads in TensorFlow(2018) Paper Link: https://arxiv.org/abs/1810.03035
Ideas Explored (TLDR) Three levers to fix TensorFlow I/O Background DL I/O != Traditional HPC I/O HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes. DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing TensorFlow Data Pipeline Dataset API - supports POSIX, HDFS, GCS, S3 Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes Embarrassingly parallel - each file used by one worker, no collective I/O needed tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples) dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory. Checkpointing tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights) No guaranteed flush to disk, no async checkpoint support (at the time) Each snapshot ~600MB - bursty writes stall training Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately Benchmarks & Setup Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10 Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10 Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset) AlexNet mini-app: Caltech-101, median 12KB images Results Threading 1→2 threads doubles bandwidth HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets) Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself Prefetching Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time With prefetch: runtime same regardless of storage type or thread count Without prefetch: bursty read pattern, GPU idles waiting for data Checkpointing ~15% of total execution time w/o optimization Lustre fastest, HDD slowest Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint Background copy to HDD completes while training continues Key Takeaways DL I/O bottleneck is reads not writes - opposite of traditional HPC Threading helps but is limited; Prefetching is the highest-leverage knob. Burst buffer essential for checkpointing at scale - NVMe as staging tier Prefetch hierarchy likely needed at scale: stage training data in burst buffer too. Paper Link: https://arxiv.org/pdf/1810.03035
...