My work focuses on storage systems for AI/ML workloads, bridging production systems experience with storage research:

Storage Benchmarking & Characterization Current GPU training benchmarks (MLPerf Storage, DLIO) focus on throughput but miss critical production behaviors: checkpoint I/O burstiness, metadata operation scalability, and S3 API compatibility edge cases. I’m developing evaluation methodologies that capture these real-world dimensions, informed by production deployments at Crusoe.

RDMA Object Storage & Accelerated I/O Exploring RDMA-based object storage architectures (S3/RDMA) and tier-0 caching systems that reduce checkpoint latency for GPU training.

Storage Evaluation Frameworks Developing framework for evaluating object storage systems beyond throughput: S3 compatibility testing, failure mode analysis, operational complexity metrics, and total cost of ownership. Investigating trade-offs between file systems (Lustre, DAOS, 3FS) and emerging disaggregated object storage vendors.


Current Work

Evaluating object storage vendors for GPU training clusters at Crusoe. Member of MLCommons Storage Working Groups (Object Storage, Accelerated IO) and currently looking to contribute to SNIA’s Technical Working Groups (Cloud Object Storage Test Tools Group, Accelerated IO).

Looking to collaborate on the following:

  • Storage Performance Characterization for AI Training Workloads
  • Framework for Evaluating Object Storage vendors for AI Workloads

Readings

I am capturing the Research Papers that have helped me learn I will also be capturing my reading notes here.

AI Storage & ML Infrastructure

3FS (Fire-Flyer File System) | DeepSeek’s RDMA-based Distributed File System

Fire-Flyer AI-HPC | Cost-Effective Software-Hardware Co-Design

RDMA-First Object Storage with SmartNIC Offload | Low-latency GPU storage

DAOS | Storage Stack for Storage Class Memory

Benchmarking All-Flash Storage for HPC | Storage benchmarking methodologies

io_uring for High-Performance DBMSs | Modern Linux async I/O


Distributed Storage & Databases

Dynamo | Amazon’s Highly Available Key-Value Store (SOSP 2007) My Notes

DynamoDB | Amazon’s Fully Managed NoSQL Service My Notes

Spanner | Google’s Globally-Distributed Database My Notes

Cassandra | Distributed Wide-Column Store My Notes

Bigtable | Distributed Storage for Structured Data My Notes

Megastore | Scalable, Highly Available Storage My Notes


Distributed File Systems

Google File System | Large-scale distributed file system My Notes

HDFS | Hadoop Distributed File System My Notes


Consensus & Coordination

Raft | Consensus Algorithm My Notes

Chubby | Distributed Lock Service My Notes

Paxos Made Live | Consensus

Time, Clocks, and Ordering | Lamport’s Classic


Streaming & Messaging

Kafka | Distributed Append-Only Log My Notes

MapReduce | Simplified Data Processing

Spark | Resilient Distributed Datasets

Flink | Stream and Batch Processing


Caching & Performance

Memcache at Facebook | Scaling Memcache My Notes

TAO | Facebook’s Distributed Data Store


Books

Designing Data-Intensive Applications by Martin Kleppmann

Database Internals by Alex Petrov

A Philosophy of Software Design by John Ousterhout