My work focuses on storage systems for AI/ML workloads, bridging production systems experience with storage research:
Storage Benchmarking & Characterization Current GPU training benchmarks (MLPerf Storage, DLIO) focus on throughput but miss critical production behaviors: checkpoint I/O burstiness, metadata operation scalability, and S3 API compatibility edge cases. I’m developing evaluation methodologies that capture these real-world dimensions, informed by production deployments at Crusoe.
RDMA Object Storage & Accelerated I/O Exploring RDMA-based object storage architectures (S3/RDMA) and tier-0 caching systems that reduce checkpoint latency for GPU training.
Storage Evaluation Frameworks Developing framework for evaluating object storage systems beyond throughput: S3 compatibility testing, failure mode analysis, operational complexity metrics, and total cost of ownership. Investigating trade-offs between file systems (Lustre, DAOS, 3FS) and emerging disaggregated object storage vendors.
Current Work
Evaluating object storage vendors for GPU training clusters at Crusoe. Member of MLCommons Storage Working Groups (Object Storage, Accelerated IO) and currently looking to contribute to SNIA’s Technical Working Groups (Cloud Object Storage Test Tools Group, Accelerated IO).
Looking to collaborate on the following:
- Storage Performance Characterization for AI Training Workloads
- Framework for Evaluating Object Storage vendors for AI Workloads
Readings
I am capturing the Research Papers that have helped me learn I will also be capturing my reading notes here.
AI Storage & ML Infrastructure
3FS (Fire-Flyer File System) | DeepSeek’s RDMA-based Distributed File System
Fire-Flyer AI-HPC | Cost-Effective Software-Hardware Co-Design
RDMA-First Object Storage with SmartNIC Offload | Low-latency GPU storage
DAOS | Storage Stack for Storage Class Memory
Benchmarking All-Flash Storage for HPC | Storage benchmarking methodologies
io_uring for High-Performance DBMSs | Modern Linux async I/O
Distributed Storage & Databases
Dynamo | Amazon’s Highly Available Key-Value Store (SOSP 2007) My Notes
DynamoDB | Amazon’s Fully Managed NoSQL Service My Notes
Spanner | Google’s Globally-Distributed Database My Notes
Cassandra | Distributed Wide-Column Store My Notes
Bigtable | Distributed Storage for Structured Data My Notes
Megastore | Scalable, Highly Available Storage My Notes
Distributed File Systems
Google File System | Large-scale distributed file system My Notes
HDFS | Hadoop Distributed File System My Notes
Consensus & Coordination
Raft | Consensus Algorithm My Notes
Chubby | Distributed Lock Service My Notes
Paxos Made Live | Consensus
Time, Clocks, and Ordering | Lamport’s Classic
Streaming & Messaging
Kafka | Distributed Append-Only Log My Notes
MapReduce | Simplified Data Processing
Spark | Resilient Distributed Datasets
Flink | Stream and Batch Processing
Caching & Performance
Memcache at Facebook | Scaling Memcache My Notes
TAO | Facebook’s Distributed Data Store
Books
Designing Data-Intensive Applications by Martin Kleppmann
Database Internals by Alex Petrov
A Philosophy of Software Design by John Ousterhout