Paper: Characterizing Deep-Learning IO Workloads in TensorFlow
Characterizing Deep-Learning I/O Workloads in TensorFlow(2018)
Paper Link: https://arxiv.org/abs/1810.03035
Ideas Explored (TLDR)
- Three levers to fix TensorFlow I/O
Background
- DL I/O != Traditional HPC I/O
- HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes.
- DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing
TensorFlow Data Pipeline
- Dataset API - supports POSIX, HDFS, GCS, S3
- Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes
- Embarrassingly parallel - each file used by one worker, no collective I/O needed
- tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms
- tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples)
- dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold
- prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory.
Checkpointing
- tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights)
- No guaranteed flush to disk, no async checkpoint support (at the time)
- Each snapshot ~600MB - bursty writes stall training
- Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately
Benchmarks & Setup
- Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10
- Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10
- Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s
- Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset)
- AlexNet mini-app: Caltech-101, median 12KB images
Results
Threading
- 1→2 threads doubles bandwidth
- HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads
- Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets)
- Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself
Prefetching
- Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time
- With prefetch: runtime same regardless of storage type or thread count
- Without prefetch: bursty read pattern, GPU idles waiting for data
Checkpointing
- ~15% of total execution time w/o optimization
- Lustre fastest, HDD slowest
- Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint
- Background copy to HDD completes while training continues
Key Takeaways
- DL I/O bottleneck is reads not writes - opposite of traditional HPC
- Threading helps but is limited; Prefetching is the highest-leverage knob.
- Burst buffer essential for checkpointing at scale - NVMe as staging tier
- Prefetch hierarchy likely needed at scale: stage training data in burst buffer too.
Paper Link: https://arxiv.org/pdf/1810.03035
Last updated: March 15, 2026
Questions or discussion? Email me