Paper: Characterizing Deep-Learning IO Workloads in TensorFlow

Characterizing Deep-Learning I/O Workloads in TensorFlow(2018)

Ideas Explored (TLDR)

DL I/O != Traditional HPC I/O
HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes.
DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing

Dataset API - supports POSIX, HDFS, GCS, S3
Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes
Embarrassingly parallel - each file used by one worker, no collective I/O needed
tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms
tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples)
dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold
prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory.

tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights)
No guaranteed flush to disk, no async checkpoint support (at the time)
Each snapshot ~600MB - bursty writes stall training
Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately

Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10
Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10
Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s
Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset)
AlexNet mini-app: Caltech-101, median 12KB images

1→2 threads doubles bandwidth
HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads
Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets)
Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself

Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time
With prefetch: runtime same regardless of storage type or thread count
Without prefetch: bursty read pattern, GPU idles waiting for data

DL I/O bottleneck is reads not writes - opposite of traditional HPC
Threading helps but is limited; Prefetching is the highest-leverage knob.
Burst buffer essential for checkpointing at scale - NVMe as staging tier
Prefetch hierarchy likely needed at scale: stage training data in burst buffer too.

Paper Link: https://arxiv.org/pdf/1810.03035

Last updated: March 15, 2026

Questions or discussion? Email me