Paper: Characterizing Deep-Learning IO Workloads in TensorFlow


Characterizing Deep-Learning I/O Workloads in TensorFlow(2018)

Paper Link: https://arxiv.org/abs/1810.03035

Ideas Explored (TLDR)

  • Three levers to fix TensorFlow I/O

Background

  • DL I/O != Traditional HPC I/O
  • HPC: Few large files, collective I/O, same input repeated (iterative solvers), frequent intermediate writes.
  • DL: Many small files, individual I/O, different batches each step, periodic checkpoints only. DL on HPC clusters growing - need to understand I/O behavior before optimizing

TensorFlow Data Pipeline

  • Dataset API - supports POSIX, HDFS, GCS, S3
  • Producer-consumer behavior: I/O pipeline (CPU) produces batches, training pipeline (GPU) consumes
  • Embarrassingly parallel - each file used by one worker, no collective I/O needed
  • tf.data.map(num_parallel_calls=N) - N threads doing individual file I/O + transforms
  • tf.data.interleave() - expands one entry into many downstream (e.g. TFRecord -> samples)
  • dataset.prefetch(N) - CPU runs ahead, buffers N batches in host memory, refills below threshold
  • prefetch_to_device(’/gpu:0’) - skip host <-> device copy, prefetch directly into GPU memory.

Checkpointing

  • tf.train.Saver() generates 3 files: metadata (graph), index (tensor descriptors), data (weights)
  • No guaranteed flush to disk, no async checkpoint support (at the time)
  • Each snapshot ~600MB - bursty writes stall training
  • Burst Buffer solution: save to fast NVMe -> background copy to slow storage -> training resumes immediately

Benchmarks & Setup

  • Blackdog (workstation): Xeon E5-2609v2, Quadro K4000, 72GB RAM, HDD + SSD + 480GB Optane NVMe, TF 1.10
  • Tegner (KTH HPC cluster): Xeon E5-2690v3, K80, 512GB RAM, Lustre FS, TF 1.10
  • Storage IOR baselines: HDD 163 MB/s, SSD 280 MB/s, Optane 1603 MB/s, Lustre 1968 MB/s
  • Micro-benchmark: 16,384 JPEG files, median 112KB (ImageNet subset)
  • AlexNet mini-app: Caltech-101, median 12KB images

Results

Threading

  • 1→2 threads doubles bandwidth
  • HDD hits diminishing returns beyond 4 threads → 2.3x max at 8 threads
  • Lustre scales much better → 7.8x at 8 threads (parallel reads across OSD targets)
  • Raw TF bandwidth well below IOR baseline → overhead in TF pipeline itself

Prefetching

  • Completely overlaps CPU I/O w/ GPU compute → I/O cost effectively zeroed out from wall time
  • With prefetch: runtime same regardless of storage type or thread count
  • Without prefetch: bursty read pattern, GPU idles waiting for data

Checkpointing

  • ~15% of total execution time w/o optimization
  • Lustre fastest, HDD slowest
  • Burst buffer (Optane) → 2.6x faster than direct HDD checkpoint
  • Background copy to HDD completes while training continues

Key Takeaways

  • DL I/O bottleneck is reads not writes - opposite of traditional HPC
  • Threading helps but is limited; Prefetching is the highest-leverage knob.
  • Burst buffer essential for checkpointing at scale - NVMe as staging tier
  • Prefetch hierarchy likely needed at scale: stage training data in burst buffer too.

Paper Link: https://arxiv.org/pdf/1810.03035


Last updated: March 15, 2026

Questions or discussion? Email me