Projects

My work spans AI storage optimization, distributed systems, and high-throughput data processing systems. Below are some key areas/projects I have been involved with.

Multi-Protocol Storage for ML Workloads

Building and supporting storage systems that support both NFS (for legacy HPC workflows) and S3 (for cloud-native pipelines) over the same dataset. Key challenges include:

Protocol semantic differences (POSIX vs object storage)
Consistency models for concurrent access patterns
Performance optimization for sequential reads (training) vs random access (checkpointing/Inference)

Tier-0 Caching Strategies for GPU Clusters

Researching node-local caching architectures to minimize data transfer over network fabric:

Pre-fetching strategies based on training iteration patterns
Cache eviction policies
RDMA integration for sub-microsecond latency data transfers
Benchmarking with PyTorch and TensorFlow data pipelines

Storage Benchmarking for Transformer Training

Profiling storage I/O patterns for large language model training:

Characterizing data loading bottlenecks in transformer architectures
Tools: FIO, elbencho, custom PyTorch profilers
Building Load Testing Frameworks on top of Elbencho to test File and Object storage.

High-Throughput Streaming Data Ingestion and Delivery

Scaling data ingestion pipeline to billions of events and delivery across thousands of Partitions(s3-prefixes):

Dynamic Partitioning redesign: Akka Streams architecture achieving 400% throughput improvement
Backpressure handling and flow control in reactive systems
Lease management for distributed coordination
SQS/DynamoDB for workload distribution and state management

Performance Engineering

Approaches to performance optimization:

Profiling Java services for memory leaks and GC tuning
Native memory analysis, exploring various unix memory allocators(malloc vs JEMalloc) and heap dump investigation
Identifying hot paths and optimizing critical sections
Monitoring and observability for production systems

Research Interests

Current areas of exploration and experimentation:

RDMA-enabled storage: Low-latency data access for AI workloads
Caching hierarchies: Multi-tier caching strategies (node-local, rack-level, datacenter)
Storage cost optimization: Balancing performance and cost for training workloads
Benchmarking methodologies: Standardized approaches for storage performance evaluation

Technical Writing & Speaking

As an aspiring technical author, I’m interested in writing about:

Storage architecture for AI/ML infrastructure
Distributed systems design patterns
Performance optimization techniques
Real-world case studies from production systems

Interested in collaboration, co-authorship, or speaking opportunities? Reach out.

Multi-Protocol Storage for ML Workloads#

Tier-0 Caching Strategies for GPU Clusters#

Storage Benchmarking for Transformer Training#

High-Throughput Streaming Data Ingestion and Delivery#

Performance Engineering#

Research Interests#

Technical Writing & Speaking#