<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>File-System on Hemant Sethi</title>
    <link>https://www.sethihemant.com/tags/file-system/</link>
    <description>Recent content in File-System on Hemant Sethi</description>
    <generator>Hugo -- 0.146.0</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 08 Dec 2024 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://www.sethihemant.com/tags/file-system/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Google File System</title>
      <link>https://www.sethihemant.com/notes/gfs-2003/</link>
      <pubDate>Sun, 08 Dec 2024 00:00:00 +0000</pubDate>
      <guid>https://www.sethihemant.com/notes/gfs-2003/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt; &lt;a href=&#34;https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf&#34;&gt;Google File System&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;google-file-system--distributed-file-system&#34;&gt;Google File System / Distributed File System&lt;/h2&gt;
&lt;h3 id=&#34;goal&#34;&gt;Goal&lt;/h3&gt;
&lt;p&gt;Design a &lt;strong&gt;distributed file system&lt;/strong&gt; to store huge files (terabyte and larger). The system should be &lt;strong&gt;scalable&lt;/strong&gt;, &lt;strong&gt;reliable&lt;/strong&gt;, and &lt;strong&gt;highly available&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Developed by Google for its large data-intensive applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;background&#34;&gt;Background&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GFS was built for handling batch processing on large data sets and is designed for system-to-system interaction, not user-to-system interaction.&lt;/li&gt;
&lt;li&gt;Was designed with following goals in mind:&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;gfs-use-cases&#34;&gt;GFS Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Built for distributed data-intensive applications like &lt;strong&gt;Gmail&lt;/strong&gt; or &lt;strong&gt;Youtube&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Google’s BigTable uses GFS to store &lt;strong&gt;log files&lt;/strong&gt; and &lt;strong&gt;data files&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;apis&#34;&gt;APIs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GFS doesn’t provide a standard posix-like API. Instead user-level APIs are provided.&lt;/li&gt;
&lt;li&gt;Files organized hierarchically in directories and identified by their path names.&lt;/li&gt;
&lt;li&gt;Supports usual file system operations:&lt;/li&gt;
&lt;li&gt;Additional Special Operations&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;high-level-architecture&#34;&gt;High Level Architecture&lt;/h3&gt;
&lt;h3 id=&#34;agenda&#34;&gt;Agenda&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Chunks&lt;/li&gt;
&lt;li&gt;Chunk Handle&lt;/li&gt;
&lt;li&gt;Cluster&lt;/li&gt;
&lt;li&gt;Chunk Server&lt;/li&gt;
&lt;li&gt;Master&lt;/li&gt;
&lt;li&gt;Client
A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;chunk&#34;&gt;Chunk&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;As files stored in GFS tend to be very large, GFS breaks files into multiple fixed-size chunks where each chunk is 64 megabytes in size.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;chunk-handle&#34;&gt;Chunk Handle&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each chunk is identified by an Immutable and globally unique 64-bit ID number called chunk handle. Allows 2^64 unique chunks.&lt;/li&gt;
&lt;li&gt;Total allowed storage space = 2^64 * 64MB = 10^9 exabytes&lt;/li&gt;
&lt;li&gt;Files are split into Chunks, so the job of GFS is to provide a mapping from files to Chunks, and then to support standard operations on Files, mapping down operations to individual chunks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;cluster&#34;&gt;Cluster&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GFS is organized into a network of computers(nodes) called a cluster. A GFS cluster contains 3 types of entities:
&lt;img loading=&#34;lazy&#34; src=&#34;https://www.sethihemant.com/images/notes/gfs-2003/image-1.png&#34;&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;chunk-server&#34;&gt;Chunk Server&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Nodes which stores chunks on local disks as linux files&lt;/li&gt;
&lt;li&gt;Read or write chunk data specified by chunk handle and byte-range.&lt;/li&gt;
&lt;li&gt;For reliability, each chunk is replicated to multiple chunk servers.&lt;/li&gt;
&lt;li&gt;By default, GFS stores &lt;strong&gt;three replicas&lt;/strong&gt;, though different replication factors can be specified on a &lt;strong&gt;per-file&lt;/strong&gt; basis.
&lt;img loading=&#34;lazy&#34; src=&#34;https://www.sethihemant.com/images/notes/gfs-2003/image-2.png&#34;&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;master&#34;&gt;Master&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Coordinator&lt;/strong&gt; of GFS cluster. Responsible for &lt;strong&gt;keeping track of filesystem metadata&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Metadata stored at master includes:&lt;/li&gt;
&lt;li&gt;Master also &lt;strong&gt;controls system-wide activities&lt;/strong&gt; such as:&lt;/li&gt;
&lt;li&gt;Periodically &lt;strong&gt;communicates with each ChunkServer&lt;/strong&gt; in &lt;strong&gt;HeartBeat messages&lt;/strong&gt; to give it instructions and collect its state.&lt;/li&gt;
&lt;li&gt;For performance and fast random access, &lt;strong&gt;all metadata is stored in the master’s main memory&lt;/strong&gt;, i.e. entire filesystem namespace as well as all the name-to-chunk mappings.&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;fault tolerance&lt;/strong&gt; and &lt;strong&gt;to handle a master crash&lt;/strong&gt;, all metadata changes(every operation to File System) are written to the disk onto an &lt;strong&gt;operation log(similar to Journal)&lt;/strong&gt; which is &lt;strong&gt;replicated&lt;/strong&gt; to remote machines.&lt;/li&gt;
&lt;li&gt;The benefit of having a single, centralized master is that it has a &lt;strong&gt;global view of the file system&lt;/strong&gt;, and hence, it can make optimum management decisions, for example, related to chunk placement.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;client&#34;&gt;Client&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Application/Entity that makes read/write requests to GFS using GFS Client library.&lt;/li&gt;
&lt;li&gt;This library communicates with the master for all metadata-related operations like creating or deleting files, looking up files, etc.&lt;/li&gt;
&lt;li&gt;To read or write data, the client(library) interacts directly with the ChunkServers that hold the data.&lt;/li&gt;
&lt;li&gt;Neither the client nor the ChunkServer caches file data.&lt;/li&gt;
&lt;li&gt;ChunkServers rely on the &lt;strong&gt;buffer cache&lt;/strong&gt; in Linux to maintain frequently accessed data in memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;single-master-and-large-chunk-size&#34;&gt;Single Master and Large Chunk Size&lt;/h3&gt;
&lt;h3 id=&#34;agenda-1&#34;&gt;Agenda&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Single Master&lt;/li&gt;
&lt;li&gt;Chunk Size&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;single-master&#34;&gt;Single Master&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Having a single master vastly &lt;strong&gt;simplifies GFS design&lt;/strong&gt; and enables the master to make &lt;strong&gt;sophisticated chunk placement&lt;/strong&gt; and &lt;strong&gt;replication decisions&lt;/strong&gt; using &lt;strong&gt;global knowledge&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;GFS minimizes the master’s involvement in reads and writes, so that it does not become a bottleneck.
&lt;img loading=&#34;lazy&#34; src=&#34;https://www.sethihemant.com/images/notes/gfs-2003/image-3.png&#34;&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;chunk-size&#34;&gt;Chunk Size&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GFS has chosen 64 MB, which is much larger than typical filesystem block sizes (which are often around 4KB). One of the key design parameters.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advantages of large chunk size&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;lazy-space-allocation&#34;&gt;Lazy space Allocation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Each chunk replica is stored as a plain Linux file on a ChunkServer. GFS does not allocate the whole 64MB of disk space when creating a chunk. Instead, as the client appends data, the ChunkServer, lazily extends the chunk&lt;/li&gt;
&lt;li&gt;One disadvantage of having a large chunk size is the handling of small files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;metadata&#34;&gt;Metadata&lt;/h3&gt;
&lt;p&gt;Let&amp;rsquo;s explore how GFS manages file system metadata.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p><strong>Paper:</strong> <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">Google File System</a></p>
<hr>
<h2 id="google-file-system--distributed-file-system">Google File System / Distributed File System</h2>
<h3 id="goal">Goal</h3>
<p>Design a <strong>distributed file system</strong> to store huge files (terabyte and larger). The system should be <strong>scalable</strong>, <strong>reliable</strong>, and <strong>highly available</strong>.</p>
<ul>
<li>Developed by Google for its large data-intensive applications.</li>
</ul>
<h3 id="background">Background</h3>
<ul>
<li>GFS was built for handling batch processing on large data sets and is designed for system-to-system interaction, not user-to-system interaction.</li>
<li>Was designed with following goals in mind:</li>
</ul>
<h3 id="gfs-use-cases">GFS Use Cases</h3>
<ul>
<li>Built for distributed data-intensive applications like <strong>Gmail</strong> or <strong>Youtube</strong>.</li>
<li>Google’s BigTable uses GFS to store <strong>log files</strong> and <strong>data files</strong>.</li>
</ul>
<h3 id="apis">APIs</h3>
<ul>
<li>GFS doesn’t provide a standard posix-like API. Instead user-level APIs are provided.</li>
<li>Files organized hierarchically in directories and identified by their path names.</li>
<li>Supports usual file system operations:</li>
<li>Additional Special Operations</li>
</ul>
<h3 id="high-level-architecture">High Level Architecture</h3>
<h3 id="agenda">Agenda</h3>
<ul>
<li>Chunks</li>
<li>Chunk Handle</li>
<li>Cluster</li>
<li>Chunk Server</li>
<li>Master</li>
<li>Client
A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients.</li>
</ul>
<h3 id="chunk">Chunk</h3>
<ul>
<li>As files stored in GFS tend to be very large, GFS breaks files into multiple fixed-size chunks where each chunk is 64 megabytes in size.</li>
</ul>
<h3 id="chunk-handle">Chunk Handle</h3>
<ul>
<li>Each chunk is identified by an Immutable and globally unique 64-bit ID number called chunk handle. Allows 2^64 unique chunks.</li>
<li>Total allowed storage space = 2^64 * 64MB = 10^9 exabytes</li>
<li>Files are split into Chunks, so the job of GFS is to provide a mapping from files to Chunks, and then to support standard operations on Files, mapping down operations to individual chunks.</li>
</ul>
<h3 id="cluster">Cluster</h3>
<ul>
<li>GFS is organized into a network of computers(nodes) called a cluster. A GFS cluster contains 3 types of entities:
<img loading="lazy" src="/images/notes/gfs-2003/image-1.png"></li>
</ul>
<h3 id="chunk-server">Chunk Server</h3>
<ul>
<li>Nodes which stores chunks on local disks as linux files</li>
<li>Read or write chunk data specified by chunk handle and byte-range.</li>
<li>For reliability, each chunk is replicated to multiple chunk servers.</li>
<li>By default, GFS stores <strong>three replicas</strong>, though different replication factors can be specified on a <strong>per-file</strong> basis.
<img loading="lazy" src="/images/notes/gfs-2003/image-2.png"></li>
</ul>
<h3 id="master">Master</h3>
<ul>
<li><strong>Coordinator</strong> of GFS cluster. Responsible for <strong>keeping track of filesystem metadata</strong>.</li>
<li>Metadata stored at master includes:</li>
<li>Master also <strong>controls system-wide activities</strong> such as:</li>
<li>Periodically <strong>communicates with each ChunkServer</strong> in <strong>HeartBeat messages</strong> to give it instructions and collect its state.</li>
<li>For performance and fast random access, <strong>all metadata is stored in the master’s main memory</strong>, i.e. entire filesystem namespace as well as all the name-to-chunk mappings.</li>
<li>For <strong>fault tolerance</strong> and <strong>to handle a master crash</strong>, all metadata changes(every operation to File System) are written to the disk onto an <strong>operation log(similar to Journal)</strong> which is <strong>replicated</strong> to remote machines.</li>
<li>The benefit of having a single, centralized master is that it has a <strong>global view of the file system</strong>, and hence, it can make optimum management decisions, for example, related to chunk placement.</li>
</ul>
<h3 id="client">Client</h3>
<ul>
<li>Application/Entity that makes read/write requests to GFS using GFS Client library.</li>
<li>This library communicates with the master for all metadata-related operations like creating or deleting files, looking up files, etc.</li>
<li>To read or write data, the client(library) interacts directly with the ChunkServers that hold the data.</li>
<li>Neither the client nor the ChunkServer caches file data.</li>
<li>ChunkServers rely on the <strong>buffer cache</strong> in Linux to maintain frequently accessed data in memory.</li>
</ul>
<h3 id="single-master-and-large-chunk-size">Single Master and Large Chunk Size</h3>
<h3 id="agenda-1">Agenda</h3>
<ul>
<li>Single Master</li>
<li>Chunk Size</li>
</ul>
<h3 id="single-master">Single Master</h3>
<ul>
<li>Having a single master vastly <strong>simplifies GFS design</strong> and enables the master to make <strong>sophisticated chunk placement</strong> and <strong>replication decisions</strong> using <strong>global knowledge</strong>.</li>
<li>GFS minimizes the master’s involvement in reads and writes, so that it does not become a bottleneck.
<img loading="lazy" src="/images/notes/gfs-2003/image-3.png"></li>
</ul>
<h3 id="chunk-size">Chunk Size</h3>
<ul>
<li>GFS has chosen 64 MB, which is much larger than typical filesystem block sizes (which are often around 4KB). One of the key design parameters.</li>
<li><strong>Advantages of large chunk size</strong>:</li>
</ul>
<h3 id="lazy-space-allocation">Lazy space Allocation</h3>
<ul>
<li>Each chunk replica is stored as a plain Linux file on a ChunkServer. GFS does not allocate the whole 64MB of disk space when creating a chunk. Instead, as the client appends data, the ChunkServer, lazily extends the chunk</li>
<li>One disadvantage of having a large chunk size is the handling of small files.</li>
</ul>
<h3 id="metadata">Metadata</h3>
<p>Let&rsquo;s explore how GFS manages file system metadata.</p>
<h3 id="agenda-2">Agenda</h3>
<ul>
<li>
<p>Storing Metadata in memory</p>
</li>
<li>
<p>Chunk Location</p>
</li>
<li>
<p>Operation Log
Master stores 3 types of metadata:</p>
</li>
<li>
<p>File and Chunk name spaces(Directory hierarchy).</p>
</li>
<li>
<p>Mapping from files to chunks.</p>
</li>
<li>
<p>Location of each chunk’s replica.
3 Aspects of how master stores this metadata:</p>
</li>
<li>
<p>Keeps all the metadata in memory.</p>
</li>
<li>
<p>File and Chunk namespaces and file-to-Chunk mapping are also persisted on Master’s local disk.</p>
</li>
<li>
<p>Chunk’s replica locations are not persisted on to local disk.</p>
</li>
</ul>
<h3 id="storing-metadata-in-memory">Storing Metadata in Memory</h3>
<ul>
<li>Quick operations due to metadata being accessible in-memory.</li>
<li>Efficient for the master to periodically scan through its entire state in the background. Periodic scanning is used for three functions:</li>
<li>Capacity of the whole system(or How many chunks can the metadata store) is limited by how much memory the master has. Not a problem in practice.</li>
<li>If the need to support a larger file system arises, cost of adding extra memory to master is a smaller price to pay for reliability, simplicity, performance, and flexibility by storing metadata in-memory.</li>
</ul>
<h3 id="chunk-location">Chunk Location</h3>
<ul>
<li>The master does not keep a persistent record of which ChunkServers have a replica of a given chunk.</li>
<li>By having the ChunkServer as the ultimate source of truth of each chunk’s location, GFS eliminates the problem of keeping the master and ChunkServers in sync</li>
<li>It is not beneficial to maintain a consistent view of chunk locations on the master, because errors on a ChunkServer may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled, or ChunkServer is renamed or failed, etc.)</li>
</ul>
<h3 id="operation-log">Operation Log</h3>
<ul>
<li>The master maintains an operation log that contains the namespace and file- to-chunk mappings and stores it on the local disk.</li>
<li>Specifically, this log stores a historical persistent record of all the metadata changes and serves as a logical timeline that defines the order of concurrent operations.</li>
<li>For <strong>fault tolerance</strong> and <strong>reliability</strong>, this operation log is <strong>synchronously replicated</strong> on multiple remote machines, and changes to the metadata are not made visible to clients until they have been persisted on all replicas.(Similar to the High Water Mark concept in Kafka).</li>
<li>The master batches several log records together before flushing, thereby reducing the impact of flushing and replicating on overall system throughput.</li>
<li>Upon restart, the master can restore its file-system state by replaying the operation log.</li>
<li>This log must be kept small to minimize the startup time, and that is achieved by periodically checkpointing it.<strong>(What does this mean?)</strong></li>
</ul>
<h3 id="checkpointing">Checkpointing</h3>
<ul>
<li>Master’s state is periodically serialized to disk and then replicated, so that on recovery, a master may load the checkpoint into memory, replay any subsequent operations from the operation log, and be available very quickly.</li>
<li>To further speed up the recovery and improve availability, <strong>GFS stores the checkpoint in a compact B-tree like format</strong> that can be directly mapped into memory and used for namespace lookup without extra parsing.</li>
<li>The checkpoint process can take time, therefore, to avoid delaying incoming mutations, the master switches to a new log file and creates the new checkpoint in a separate thread.</li>
</ul>
<h3 id="master-operations">Master Operations</h3>
<h3 id="agenda-3">Agenda</h3>
<ul>
<li>
<p>Namespace management and locking</p>
</li>
<li>
<p>Replica placement</p>
</li>
<li>
<p>Replica creation and re-replication</p>
</li>
<li>
<p>Replica rebalancing</p>
</li>
<li>
<p>Stale replica detection
Master is responsible for:</p>
</li>
<li>
<p>Making replica placement decision</p>
</li>
<li>
<p>Creating new Chunks and assigning replicas</p>
</li>
<li>
<p>Making sure that the chunks are fully replicated as per replication factor</p>
</li>
<li>
<p>Balancing the load across chunk servers</p>
</li>
<li>
<p>Reclaimed unused storage.</p>
</li>
</ul>
<h3 id="namespace-management-and-locking">Namespace management and locking</h3>
<ul>
<li>The master acquires locks over a namespace region to ensure proper serialization and to allow multiple operations at the master.</li>
<li>GFS does not have an <strong>i-node</strong> like tree structure for directories and files.</li>
<li>Instead, it has a <strong>hash-map</strong> that maps a filename to its metadata, and reader-writer locks are applied on each node of the hash table for synchronization.</li>
</ul>
<h3 id="replica-placement">Replica placement</h3>
<ul>
<li>To ensure <strong>maximum</strong> <strong>data availability</strong> and <strong>integrity</strong>, the master distributes replicas on different racks(“<strong>Rack Aware”</strong>), so that clients can still read or write in case of a rack failure.</li>
<li>As the in and out bandwidth of a rack may be less than the sum of the bandwidths of individual machines, placing the data in various racks can help clients exploit reads from multiple racks.</li>
<li>For ‘write’ operations, multiple racks are actually disadvantageous as data has to travel longer distances. It is an intentional tradeoff that GFS made.</li>
<li>Data is lost when all replicas of a chunk are lost.</li>
</ul>
<h3 id="replica-creation-and-re-replication">Replica creation and re-replication</h3>
<ul>
<li>The goals of a master are to place replicas on servers with less-than-average disk utilization, and spread replicas across racks.</li>
<li>Reduce the number of ‘recent’ creations on each ChunkServer (even though writes are cheap, they are followed by heavy write traffic) which might create additional load.</li>
<li>Chunks need to be re-replicated as soon as the number of available replicas falls (due to data corruption on a server or a replica being unavailable) below the user-specified replication factor.</li>
<li>Instead of re-replicating all of such chunks at once, the master prioritizes the client operations re-replication to prevent these cloning operations from becoming bottlenecks.<strong>What?</strong></li>
<li>Restrictions are placed on the bandwidth of each server for re-replication so that client requests are not compromised.</li>
<li>How are chunks prioritized for re-replication?</li>
</ul>
<h3 id="replica-rebalancing">Replica rebalancing</h3>
<ul>
<li>Master rebalances replicas regularly to achieve load balancing and better disk space usage.</li>
<li>Any new ChunkServer added to the cluster is filled up gradually by the master rather than flooding it with a heavy traffic of write operations.</li>
</ul>
<h3 id="stale-replica-detection">Stale replica detection</h3>
<ul>
<li>Chunk replicas may become stale if a ChunkServer fails and misses mutations to the chunk while it is down</li>
<li>For each chunk, the master maintains a chunk Version Number to distinguish between up-to-date and stale replicas.</li>
<li>The master increments the chunk version every time it grants a lease and informs all up-to-date replicas.</li>
<li>The master and these replicas all record the new version number in their persistent state.</li>
<li>Master removes stale replicas during regular garbage collection.</li>
<li>Stale replicas are not given to clients when they ask the master for a chunk location, and they are not involved in mutations either.</li>
<li>However, because a client caches a chunk’s location, it may read from a stale replica before the data is resynced.</li>
</ul>
<h3 id="anatomy-of-a-read-operation">Anatomy of a Read Operation</h3>
<p>Let’s learn how GFS handles a read operation. A typical interaction with GFS Cluster goes like this:</p>
<p><img loading="lazy" src="/images/notes/gfs-2003/image-4.png"></p>
<ul>
<li>Client translates the filename and byte offset specified by the application into a chunk index within the file.</li>
<li>Client sends RPC request with File Name and Chunk Index to the master.</li>
<li>Master replies with Chunk Handle and replica locations(holding chunk).</li>
<li>Client caches this metadata using FileName and ChunkIndex as the key.</li>
<li>Client sends request to one of the closest replicas specifying a chunk handle and a byte range within that chunk.</li>
<li>Replica chunk server replies with requested data.</li>
<li>Master is involved only at the start and is then completely out of loop, implementing a separation of control and data flows.</li>
</ul>
<h3 id="anatomy-of-write-operation">Anatomy of Write Operation</h3>
<h3 id="what-is-a-chunk-lease">What is a chunk lease?</h3>
<ul>
<li>To safeguard against concurrent writes at two different replicas of a chunk, GFS makes use of chunk lease.</li>
<li>When a mutation (i.e., a write, append or delete operation) is requested for a chunk, the master finds the ChunkServers which hold that chunk and grants a chunk lease (for 60 seconds) to one of them.</li>
<li>The server with the lease is called the primary and is responsible for <strong>providing a serial order</strong> for all the currently pending concurrent mutations to that chunk.</li>
<li>There is only one lease per chunk at any time, so that if two write requests go to the master, both see the same lease denoting the same primary.</li>
<li>A global ordering is provided by the ordering of the chunk leases combined with the order determined by that primary.</li>
<li>The primary can request lease extensions if needed</li>
<li>When the master grants the lease, it increments the chunk version number and informs all replicas containing that chunk of the new version number.</li>
<li><strong>Failure modes??</strong></li>
</ul>
<h3 id="data-writing">Data Writing?</h3>
<p>Writing of data is split into two phases:</p>
<ul>
<li>
<p><strong>Sending</strong></p>
</li>
<li>
<p><strong>Writing</strong>
<strong>Stepwise breakdown of data transfer:</strong></p>
</li>
<li>
<p>Client asks master which chunk server holds the current lease of chunk and locations of other replicas.</p>
</li>
<li>
<p>Master replies with the identity and location of primary and secondary replicas.</p>
</li>
<li>
<p>Client pushes data to the closest replica.</p>
</li>
<li>
<p>Once all replicas have acknowledged receiving the data, the client sends the write request to the primary.</p>
</li>
<li>
<p>The primary assigns consecutive serial numbers to all the mutations it receives, providing serialization. It applies mutations in serial number order.</p>
</li>
<li>
<p>Primary forwards the write request to all secondary replicas. They apply mutations in the same serial number order.</p>
</li>
<li>
<p>Secondary replicas reply to primary indicating they have completed operation.</p>
</li>
<li>
<p>Primary replies to the client with success or error messages.</p>
</li>
<li>
<p>The key point to note is that the data flow is different from the control flow.</p>
</li>
<li>
<p>Chunk <strong>version numbers</strong> are used to detect if any replica has stale data which has not been updated because that ChunkServer was down during some update.
<img loading="lazy" src="/images/notes/gfs-2003/image-5.png"></p>
</li>
</ul>
<p>Another edge case with write operation is that, if we have two concurrent write operations spanning multiple chunks, and the chunks have two different primary chunk servers, which decide on the single order, <strong>it could happen that you may have interleaved concurrent writes</strong> in those cases. See example below. From the Jordan video here at time 22:30 onwards. Only solution for that is <strong>Distributed Locking Service</strong>(Distributed Consensus), which is going to be an expensive operation.</p>
<p><img loading="lazy" src="/images/notes/gfs-2003/image-6.png"></p>
<h3 id="anatomy-of-append-operation">Anatomy of Append operation?</h3>
<ul>
<li><strong>Record append operation</strong> is optimized in a unique way that distinguishes GFS from other distributed file systems.</li>
<li>In a normal write, the client specifies the offset at which data is to be written. Concurrent writes to the same region can experience race conditions, and the region may end up containing data fragments from multiple clients.</li>
<li>In a <strong>record append</strong>, however, the client specifies only the data(<strong>up to 1/4th of a chunk size</strong> ~~ 16 MB). GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.</li>
<li><strong>Record Append</strong> is a kind of mutation that changes the contents of the metadata of a chunk.</li>
<li><strong>[Data Transfer to Replicas]</strong> When an application tries to append data on a chunk by sending a request to the client, the client pushes the data to all replicas of the last chunk of the file just like the write operation.</li>
<li><strong>[Command to serialize the write]</strong> When the client forwards the request to the master, the primary checks whether appending the record to the existing chunk will increase the chunk’s size more than its limit (maximum size of a chunk is 64MB).</li>
<li><strong>[Pads the existing Chunk]</strong> If this happens, it pads the chunk to the maximum limit, commands the secondary to do the same, and requests the clients to try to append to the next chunk.</li>
<li><strong>[Append to the primary replica’s chunk and notify secondary]</strong> If the record fits within the maximum size, the primary appends the data to its replica, tells the secondary to write the data at the exact offset where it has, and finally replies success to the client.</li>
<li><strong>[Failure Mode]</strong> If an append operation fails at any replica, the client retries the operation.</li>
</ul>
<h3 id="implications-for-writesjordan-video2740">Implications for Writes(Jordan Video:27:40)</h3>
<ul>
<li>Prefer appends to writes.</li>
<li>No Interleaving.</li>
<li>Readers need to be able to handle padding and/or duplicates(can happen due to failed retries or partial failures on some of the replicas).</li>
<li>If making <strong>multi-chunk writes</strong>, writers should take <strong>checkpoints</strong> as each of those individual write chunks goes through.</li>
</ul>
<h3 id="gfs-consistency-model-and-snapshotting">GFS consistency model and Snapshotting</h3>
<h3 id="gfs-consistency-model">GFS Consistency model</h3>
<ul>
<li>GFS has a relaxed consistency model.(<strong>Don’t know what that means</strong>)</li>
<li>Metadata operations (e.g., file creation) are atomic.</li>
<li>Namespace locking guarantees atomicity and correctness.</li>
<li><strong>Master’s operation log</strong> defines a <strong>global total order</strong> of these operations.</li>
<li>In data mutations, there is an important distinction between <strong>write</strong> and <strong>append</strong> operations.</li>
<li><strong>Write</strong> operations specify an offset at which mutations should occur, whereas appends are always applied at the end of the file.</li>
<li>This means that for the write operation, the offset in the chunk is predetermined, whereas for append , the system decides.</li>
<li>Concurrent writes to the same location are not serializable and may result in corrupted regions of the file.</li>
<li>With append operations, GFS guarantees the append will happen at-least-once and atomically (that is, as a contiguous sequence of bytes).</li>
<li>The system does not guarantee that all copies of the chunk will be identical (some may have duplicate data).</li>
</ul>
<h3 id="snapshotting">Snapshotting</h3>
<ul>
<li>A snapshot is a <strong>copy of some subtree of the global namespace</strong> as it exists at a given point in time.</li>
<li>GFS clients use snapshotting to efficiently <strong>branch two versions of the same data</strong>.</li>
<li>Snapshots in GFS are initially <strong>zero-copy</strong>.</li>
<li>When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files to snapshot.</li>
<li>It waits for leases to be revoked or expired and logs the snapshot operation to the operation log.</li>
<li>The snapshot is then made by duplicating the metadata for the source directory tree.</li>
<li>When a client makes a request to write to one of these chunks, the master detects that it is a copy-on-write chunk by examining its reference count (which will be more than one).</li>
<li>At this point, the master asks each ChunkServer holding the replica to make a copy of the chunk and store it locally.</li>
<li>Once the copy is complete, the master issues a lease for the new copy, and the write proceeds.</li>
</ul>
<h3 id="fault-tolerance-high-availability-and-data-integrity">Fault Tolerance, High Availability, and Data Integrity</h3>
<h3 id="agenda-4">Agenda</h3>
<ul>
<li>Fault Tolerance</li>
<li>High Availability through chunk replication</li>
<li>Data Integrity through checksum.</li>
</ul>
<h3 id="fault-tolerance">Fault Tolerance</h3>
<p>To make the system fault tolerant, and available, GFS uses two strategies:</p>
<ul>
<li>
<p><strong>Fast recovery</strong> in case of component failures.</p>
</li>
<li>
<p><strong>Replication</strong> for high availability.
<strong>Lets see how GFS recovers from Master or Replica Failure:</strong></p>
</li>
<li>
<p><strong>On Master Failure</strong></p>
</li>
<li>
<p><strong>On Primary Replica Failure</strong></p>
</li>
<li>
<p><strong>On Secondary Replica Failure</strong></p>
</li>
<li>
<p>Stale replicas might be exposed to clients. It depends on the application programmer to deal with these stale reads.</p>
</li>
</ul>
<h3 id="high-availability-through-chunk-replication">High Availability through chunk replication</h3>
<ul>
<li>Each chunk is replicated on multiple ChunkServers on different racks.</li>
<li>Users can specify different replication levels(<strong>Default: 3</strong>) for different parts of the file namespace.</li>
<li>The master clones the existing replicas to keep each chunk fully replicated as ChunkServers go offline or when the master detects corrupted replicas through checksum verification.</li>
<li>A chunk is lost irreversibly only if all its replicas are lost before GFS can react. Even in this case, the data becomes unavailable, not corrupted, which means applications receive clear errors rather than corrupt data.</li>
</ul>
<h3 id="data-integrity-through-checksum">Data Integrity through checksum</h3>
<ul>
<li>
<p>Checksumming is used by each ChunkServer to detect the corruption of stored data.</p>
</li>
<li>
<p>The chunk is broken down into 64 KB blocks.</p>
</li>
<li>
<p>Each 64 KB block has a corresponding 32-bit checksum.</p>
</li>
<li>
<p>Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.</p>
</li>
<li>
<p><strong>For Reads</strong>: the ChunkServer verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another ChunkServer. <strong>ChunkServers will not propagate corruptions to other machines</strong>.</p>
</li>
<li>
<p><strong>For Writes</strong>:
<img loading="lazy" src="/images/notes/gfs-2003/image-7.png"></p>
</li>
<li>
<p><strong>For Appends</strong>:</p>
</li>
<li>
<p>During idle periods, ChunkServers can scan and verify the contents of inactive chunks (prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk).</p>
</li>
<li>
<p>Checksumming has little effect on read performance for the following reasons:</p>
</li>
</ul>
<h3 id="garbage-collection">Garbage Collection</h3>
<p><strong>How does GFS implement Garbage Collection?</strong></p>
<h3 id="agenda-5">Agenda</h3>
<ul>
<li>Garbage collection through lazy deletion</li>
<li>Advantages of lazy deletion</li>
<li>Disadvantages of lazy deletion</li>
</ul>
<h3 id="garbage-collection-through-lazy-deletion">Garbage collection through lazy deletion</h3>
<ul>
<li>When a file is deleted, GFS does not immediately reclaim the physical space used by that file. Instead, it follows a <strong>lazy garbage collection</strong> strategy.</li>
<li>When the client issues a delete file operation, GFS does two things:</li>
<li>The file can still be read under the new, special name and can also be undeleted by renaming it back to normal.</li>
<li>To reclaim the physical storage, the master, while performing regular scans of the file system, removes any such hidden files if they have existed for more than <strong>three days</strong> (this interval is <strong>configurable</strong>) and also deletes its in-memory metadata.</li>
<li>This lazy deletion scheme provides a window of opportunity to a user who deleted a file by mistake to recover the file.</li>
<li>The master, while performing regular scans of the chunk namespace, deletes the metadata of all chunks that are not part of any file.</li>
<li>Also, during the exchange of regular HeartBeat messages with the master, each ChunkServer reports a subset of the chunks it has, and the master replies with a list of chunks from that subset that are no longer present in the master’s database; such chunks are then deleted from the ChunkServer.</li>
</ul>
<h3 id="advantages-of-lazy-deletion">Advantages of lazy deletion</h3>
<ul>
<li><strong>Simple and reliable</strong>: If the chunk deletion message is lost, the master does not have to retry. The ChunkServer can perform the garbage collection with the subsequent heartbeat messages.</li>
<li>GFS merges storage reclamation into regular background activities of the master, such as the regular scans of the filesystem or the exchange of HeartBeat messages. Thus, it is done in batches, and the <strong>cost is amortized</strong>.</li>
<li>Garbage collection takes place when the master is relatively free.</li>
<li>Lazy deletion provides safety against accidental, irreversible deletions.</li>
</ul>
<h3 id="disadvantages-of-lazy-deletion">Disadvantages of lazy deletion</h3>
<ul>
<li>As we know, after deletion, storage space does not become available immediately. Applications that frequently create and delete files may not be able to reuse the storage right away. To overcome this, GFS provides following options:</li>
</ul>
<h3 id="criticism-on-gfs">Criticism on GFS</h3>
<h3 id="problems-associated-with-single-master">Problems associated with single master</h3>
<ul>
<li>Google has started to see the following problems with the centralized master scheme:</li>
</ul>
<h3 id="problems-associated-with-large-chunk-size">Problems associated with large chunk size</h3>
<ul>
<li>Large chunk size (64MB) in GFS has its disadvantages while reading. Since a small file will have one or a few chunks, the ChunkServers storing those chunks can become hotspots if a lot of clients are accessing the same file.</li>
<li>As a workaround for this problem, GFS stores extra copies of small files for distributing the load to multiple ChunkServers. Furthermore, GFS adds a random delay in the start times of the applications accessing such files.</li>
</ul>
<h3 id="summary">Summary</h3>
<ul>
<li><strong>Scalable</strong> <strong>distributed</strong> file storage system for large <strong>data-intensive applications</strong>.</li>
<li>Uses <strong>commodity hardware</strong> to reduce infrastructure costs.</li>
<li>Was designed with <strong>Fault Tolerance</strong> in mind(Software/hardware faults).</li>
<li>Reading workload is <strong>large streaming reads</strong> and small random reads.</li>
<li>Writing workload is <strong>many large sequential writes</strong> that appends data to files.</li>
<li>Provides APIs for file operations like <strong>create</strong>, <strong>delete</strong>, <strong>open</strong>, <strong>close</strong>, <strong>read</strong>, <strong>write</strong>, <strong>snapshot</strong> and <strong>record append</strong> operations. <strong>Record append</strong> allows multiple clients to concurrently append data to the same file while guaranteeing atomicity.</li>
<li>GFS <strong>cluster</strong> is <strong>single master</strong>, <strong>multiple chunk servers</strong> &amp; access by Multiple clients.</li>
<li>Files are broken into <strong>64 MB chunks</strong>, identified by <strong>Immutable</strong> and <strong>Globally unique</strong> <strong>64-bit</strong> <strong>Chunk Handle</strong>(assigned by master during chunk creation).</li>
<li><strong>Chunk servers</strong> store chunks on local disks as Linux files. For <strong>Reliability</strong>, each chunk is replicated to multiple chunk servers.</li>
<li><strong>Master</strong> is <strong>Coordinator</strong> for GFS cluster. Responsible for keeping track of <strong>all the filesystem metadata</strong>. Namespace, authorization, files-chunk mapping, chunk location.</li>
<li><strong>Master</strong> keeps <strong>all metadata in memory</strong> for faster operation. For <strong>Fault tolerance</strong>, and to <strong>handle master crash</strong>, all metadata changes are written onto disk into <strong>Operation Log</strong> which is replicated to other machines.</li>
<li><strong>Master</strong> doesn’t have a <strong>persistent record(only in-memory)</strong> of <strong>which chunk servers have replicas for a given chunk</strong>. Master asks each chunk server what chunks it holds at master startup, or whenever the chunk server joins the cluster.</li>
<li>For <strong>Quick recovery</strong>(Master failure), master’s state is periodically serialized to disk(Checkpointed) along with <strong>Operation log</strong> and is replicated. On recovery, master loads the checkpoint, and replays subsequent operations from <strong>Operation Log</strong>.</li>
<li>Master communicates with each chunk server via <strong>HeartBeat</strong> to collect state.</li>
<li>Applications use <strong>GFS Client code</strong>, which implements filesystem API, and communicates with the cluster. Clients interact with master for metadata(<strong>Control Flow</strong>), but all data transfer happens directly(<strong>Data Flow</strong>) between client and Chunk servers.</li>
<li><strong>Data Integrity:</strong> Each Chunk server uses Checksumming to detect corruption of stored data.</li>
<li><strong>Garbage Collection</strong>: Lazy Deletion.</li>
<li><strong>Consistency</strong>:Master guarantees data consistency by ensuring order of mutations on all replicas and using <strong>chunk version numbers</strong>. If a replica has an incorrect version, it is garbage collected.</li>
<li>GFS guarantees <strong>at-least-once writes.</strong> It is the responsibility of readers to deal with duplicate chunks. This is achieved by having <strong>Checksums</strong> and <strong>serial numbers</strong> in the chunks, which help readers to filter and discard duplicate data.</li>
<li><strong>Cache</strong>: Neither the client or chunk servers cache data. However, Clients do cache metadata.</li>
</ul>
<h3 id="system-design-patterns">System Design Patterns</h3>
<ul>
<li><strong>Write-Ahead-Log</strong> - Operation Log</li>
<li><strong>HeartBeat</strong> - B/w Master and Chunk servers.</li>
<li><strong>CheckSum</strong> - Data Integrity</li>
<li>Copy-On-Write Snapshotting.</li>
<li>Lazy Garbage collection.</li>
</ul>
<h3 id="references">References</h3>
<ul>
<li>GFS Paper</li>
<li>BigTable Paper</li>
<li>GFS Evolution on Fast-Forward</li>
<li>Jordan Video would give a quick summary of the above.</li>
</ul>
<hr>
<p><strong>Paper Link:</strong> <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf</a></p>
<hr>
<p><em>Last updated: March 15, 2026</em></p>
<p><em>Questions or discussion? <a href="mailto:sethi.hemant@gmail.com">Email me</a></em></p>
]]></content:encoded>
    </item>
    <item>
      <title>Hadoop Distributed File System</title>
      <link>https://www.sethihemant.com/notes/hdfs-2010/</link>
      <pubDate>Sun, 08 Dec 2024 00:00:00 +0000</pubDate>
      <guid>https://www.sethihemant.com/notes/hdfs-2010/</guid>
      <description>&lt;p&gt;&lt;strong&gt;Paper:&lt;/strong&gt; &lt;a href=&#34;https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf&#34;&gt;Hadoop Distributed File System&lt;/a&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;hadoop-distributed-file-system&#34;&gt;Hadoop Distributed File System&lt;/h2&gt;
&lt;h3 id=&#34;goal&#34;&gt;Goal&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Design a &lt;strong&gt;distributed&lt;/strong&gt; system that can store &lt;strong&gt;huge files (terabyte and larger)&lt;/strong&gt;. The system should be &lt;strong&gt;scalable&lt;/strong&gt;, &lt;strong&gt;reliable&lt;/strong&gt;, and &lt;strong&gt;highly available.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;what-is-hadoop-distributed-file-system&#34;&gt;What is Hadoop Distributed File System&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HDFS&lt;/strong&gt; is a distributed file system and was built to store &lt;strong&gt;unstructured data&lt;/strong&gt;. It is designed to store huge files &lt;strong&gt;reliably&lt;/strong&gt; and &lt;strong&gt;stream&lt;/strong&gt; those files at high bandwidth to user applications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HDFS&lt;/strong&gt; is a variant and a simplified version of the Google File System (&lt;strong&gt;GFS&lt;/strong&gt;). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient &lt;strong&gt;data processing pattern&lt;/strong&gt; is a &lt;strong&gt;write-once, read-many-times&lt;/strong&gt; pattern.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;background&#34;&gt;Background&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Apache Hadoop is a software framework that provides a distributed file storage system(HDFS) and distributed computing for analyzing and transforming very large data sets using the MapReduce programming model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HDFS&lt;/strong&gt; is the &lt;strong&gt;default&lt;/strong&gt; file storage system in &lt;strong&gt;Hadoop&lt;/strong&gt;. It is designed to be a &lt;strong&gt;distributed&lt;/strong&gt;, &lt;strong&gt;scalable&lt;/strong&gt;, &lt;strong&gt;fault-tolerant&lt;/strong&gt; file system that primarily caters to the needs of the MapReduce paradigm.&lt;/li&gt;
&lt;li&gt;Both HDFS and GFS were built to store very large files and scale to store petabytes of storage.&lt;/li&gt;
&lt;li&gt;Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end- users.&lt;/li&gt;
&lt;li&gt;Like GFS, &lt;strong&gt;HDFS is also not POSIX-compliant&lt;/strong&gt; and is not a mountable file system on its own. It is typically accessed via &lt;strong&gt;HDFS clients&lt;/strong&gt; or by using application programming interface (&lt;strong&gt;API&lt;/strong&gt;) calls from the Hadoop libraries.&lt;/li&gt;
&lt;li&gt;Given HDFS design, following applications are not a good fit for HDFS,&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;api&#34;&gt;API&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Provides &lt;strong&gt;user-level&lt;/strong&gt; APIs(and not standard POSIX-like APIs).&lt;/li&gt;
&lt;li&gt;Files are organized &lt;strong&gt;hierarchically&lt;/strong&gt; in directories and identified by their &lt;strong&gt;pathnames&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Supports the usual file system operations on files and directories. &lt;strong&gt;Create&lt;/strong&gt;, &lt;strong&gt;Delete&lt;/strong&gt;, &lt;strong&gt;Rename&lt;/strong&gt;, &lt;strong&gt;Move&lt;/strong&gt;, and &lt;strong&gt;Symbolic Links(unlike GFS)&lt;/strong&gt; etc.&lt;/li&gt;
&lt;li&gt;All read and write operations are done in an &lt;strong&gt;append-only&lt;/strong&gt; fashion.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;high-level-architecture&#34;&gt;High Level Architecture&lt;/h3&gt;
&lt;h3 id=&#34;hdfs-architecture&#34;&gt;HDFS Architecture&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Files are broken into &lt;strong&gt;128 MB&lt;/strong&gt; fixed-size blocks (configurable on a per-file basis).&lt;/p&gt;</description>
      <content:encoded><![CDATA[<p><strong>Paper:</strong> <a href="https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf">Hadoop Distributed File System</a></p>
<hr>
<h2 id="hadoop-distributed-file-system">Hadoop Distributed File System</h2>
<h3 id="goal">Goal</h3>
<ul>
<li>Design a <strong>distributed</strong> system that can store <strong>huge files (terabyte and larger)</strong>. The system should be <strong>scalable</strong>, <strong>reliable</strong>, and <strong>highly available.</strong></li>
</ul>
<h3 id="what-is-hadoop-distributed-file-system">What is Hadoop Distributed File System</h3>
<ul>
<li><strong>HDFS</strong> is a distributed file system and was built to store <strong>unstructured data</strong>. It is designed to store huge files <strong>reliably</strong> and <strong>stream</strong> those files at high bandwidth to user applications.</li>
<li><strong>HDFS</strong> is a variant and a simplified version of the Google File System (<strong>GFS</strong>). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient <strong>data processing pattern</strong> is a <strong>write-once, read-many-times</strong> pattern.</li>
</ul>
<h3 id="background">Background</h3>
<ul>
<li>Apache Hadoop is a software framework that provides a distributed file storage system(HDFS) and distributed computing for analyzing and transforming very large data sets using the MapReduce programming model.</li>
<li><strong>HDFS</strong> is the <strong>default</strong> file storage system in <strong>Hadoop</strong>. It is designed to be a <strong>distributed</strong>, <strong>scalable</strong>, <strong>fault-tolerant</strong> file system that primarily caters to the needs of the MapReduce paradigm.</li>
<li>Both HDFS and GFS were built to store very large files and scale to store petabytes of storage.</li>
<li>Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end- users.</li>
<li>Like GFS, <strong>HDFS is also not POSIX-compliant</strong> and is not a mountable file system on its own. It is typically accessed via <strong>HDFS clients</strong> or by using application programming interface (<strong>API</strong>) calls from the Hadoop libraries.</li>
<li>Given HDFS design, following applications are not a good fit for HDFS,</li>
</ul>
<h3 id="api">API</h3>
<ul>
<li>Provides <strong>user-level</strong> APIs(and not standard POSIX-like APIs).</li>
<li>Files are organized <strong>hierarchically</strong> in directories and identified by their <strong>pathnames</strong>.</li>
<li>Supports the usual file system operations on files and directories. <strong>Create</strong>, <strong>Delete</strong>, <strong>Rename</strong>, <strong>Move</strong>, and <strong>Symbolic Links(unlike GFS)</strong> etc.</li>
<li>All read and write operations are done in an <strong>append-only</strong> fashion.</li>
</ul>
<h3 id="high-level-architecture">High Level Architecture</h3>
<h3 id="hdfs-architecture">HDFS Architecture</h3>
<ul>
<li>
<p>Files are broken into <strong>128 MB</strong> fixed-size blocks (configurable on a per-file basis).</p>
</li>
<li>
<p>File has two parts: the <strong>actual file data</strong> and the <strong>metadata</strong>.</p>
</li>
<li>
<p><strong>Metadata</strong></p>
</li>
<li>
<p>HDFS cluster primarily consists of a <strong>NameNode(Master:<strong>GFS</strong>)</strong> that <strong>manages the file system metadata</strong> and <strong>DataNodes(Chunk Server:<strong>GFS</strong>)</strong> that <strong>store the actual data</strong>.</p>
</li>
<li>
<p>All blocks of a file are of the same size except the last one.</p>
</li>
<li>
<p>HDFS uses large block sizes because it is designed to store extremely large files to enable MapReduce jobs to process them efficiently.</p>
</li>
<li>
<p>Each block is identified by a unique <strong>64-bit ID</strong> called <strong>BlockID(Similar to Chunk in GFS)</strong>. All read/write operations in HDFS operate at the block level.</p>
</li>
<li>
<p>DataNodes store each block in a separate file on the local file system and provide read/write access.</p>
</li>
<li>
<p>When a DataNode starts up, it scans through its local file system and sends the list of hosted data blocks (called BlockReport) to the NameNode.(Similar to how Master gets state information in GFS from Chunk Servers).
<img loading="lazy" src="/images/notes/hdfs-2010/image-1.png"></p>
</li>
<li>
<p>The <strong>NameNode</strong> maintains two <strong>on-disk data structures</strong> to store the file system’s state: an <strong>FsImage</strong> <strong>file(Oplog Checkpoint:GFS)</strong> and an <strong>EditLog(Operation Log: GFS)</strong>.</p>
</li>
<li>
<p><strong>FsImage</strong> is a <strong>checkpoint</strong> of the f<strong>ile system metadata</strong> at some point in time, while the <strong>EditLog</strong> is a log of all of the file system metadata transactions <strong>since the image file</strong> was last created. These two files help NameNode to recover from failure.</p>
</li>
<li>
<p>User applications interact with HDFS through its client. HDFS Client interacts with NameNode for metadata, but all data transfers happen directly between the client and DataNodes.</p>
</li>
<li>
<p>To achieve high-availability, HDFS creates multiple copies of the data and distributes them on nodes throughout the cluster.</p>
</li>
</ul>
<h3 id="comparison-bw-gfs-and-hdfs">Comparison b/w GFS and HDFS</h3>
<p><img loading="lazy" src="/images/notes/hdfs-2010/image-2.png"></p>
<p><img loading="lazy" src="/images/notes/hdfs-2010/image-3.png"></p>
<p><img loading="lazy" src="/images/notes/hdfs-2010/image-4.png"></p>
<h3 id="deep-dive">Deep Dive</h3>
<h3 id="cluster-topology">Cluster Topology</h3>
<ul>
<li>Hadoop clusters typically have about 30 to 40 servers per rack.</li>
<li>Each rack has a dedicated <strong>gigabit switch</strong> that connects all of its servers and an <strong>uplink</strong> to a <strong>core switch or router</strong>, whose bandwidth is shared by many racks in the data center.</li>
<li>When HDFS is deployed on a cluster, each of its servers is configured and mapped to a particular rack. The network distance between servers is measured in hops, where one hop corresponds to one link in the topology.</li>
<li>Hadoop assumes a <strong>tree-style topology</strong>, and the distance between two servers is the sum of their distances to their closest common ancestor.
<img loading="lazy" src="/images/notes/hdfs-2010/image-5.png"></li>
</ul>
<h3 id="rack-aware-replication">Rack aware replication</h3>
<ul>
<li>HDFS employs a <strong>rack-aware replica placement policy</strong> to improve data <strong>reliability</strong>, <strong>availability</strong>, and network <strong>bandwidth utilization</strong>.</li>
<li>The idea behind HDFS’s replica placement is to be able to <strong>tolerate node</strong> and <strong>rack failures.</strong></li>
<li>If the replication factor is three, HDFS attempts to place the</li>
<li>This rack-aware replication scheme <strong>slows the write operation</strong> as the data needs to be replicated onto different racks, tradeoff between reliability and performance.
<img loading="lazy" src="/images/notes/hdfs-2010/image-6.png"></li>
</ul>
<h3 id="synchronization-semantics">Synchronization Semantics</h3>
<ul>
<li>Early versions of HDFS followed <strong>strict immutable semantics</strong>. Once a file was written, it could never again be re-opened for writes; files could still be deleted.</li>
<li>Current versions of HDFS support <strong>append</strong>.</li>
<li>This design choice in HDFS was because most MapReduce workloads follow the <strong>write once and read many data-access</strong> patterns.</li>
<li>MapReduce is a restricted computational model with predefined stages. The reducers in MapReduce write independent files to HDFS as output. HDFS focuses on fast read access for multiple clients at a time.</li>
</ul>
<h3 id="hdfs-consistency-model">HDFS Consistency Model</h3>
<ul>
<li>HDFS follows a <strong>strong consistency</strong> model.</li>
<li>To ensure <strong>strong consistency</strong>, a write is declared <strong>successful</strong> only when <strong>all replicas have been written successfully</strong>.</li>
<li>HDFS does not allow multiple concurrent writers to write to an HDFS file, so implementing strong consistency becomes relatively easy.</li>
</ul>
<h3 id="anatomy-of-a-read-operation">Anatomy of a Read Operation</h3>
<h3 id="hdfs-read-process">HDFS Read Process</h3>
<ul>
<li><strong>(1)</strong> When a file is opened for reading, <strong>the HDFS client</strong> initiates a <strong>read</strong> request, by calling the <strong>open()</strong> method of the <em><strong>Distributed FileSystem</strong></em> object. The client specifies the <strong>file name</strong>, <strong>start offset</strong>, and the <strong>read range</strong> length**.**</li>
<li><strong>(2)</strong> The <strong>Distributed FileSystem</strong> object calculates what blocks need to be read based on the given <strong>offset</strong> and <strong>range length</strong>, and requests the <strong>locations</strong> of the blocks from the <strong>NameNode</strong>.</li>
<li><strong>(3)</strong> <strong>NameNode</strong> has <strong>metadata</strong> for all blocks’ <strong>locations</strong>. It provides the client a <strong>list of blocks</strong> and the <strong>locations of each block replica</strong>. As the blocks are replicated, NameNode finds the <strong>closest replica to the client</strong> when providing a particular block’s location. The <strong>closest locality</strong> of each block is determined as follows:</li>
<li><strong>(4)</strong> After getting the block locations, the client calls the <strong>read()</strong> method of <em><strong>FSData InputStream</strong></em>,which takes care of all the interactions with the <strong>DataNodes</strong>.</li>
<li><strong>(5)</strong> Once the client invokes the <strong>read()</strong> method, the input stream object <strong>establishes a connection with the closest DataNode</strong> with the first block of the file.</li>
<li><strong>(5b)</strong> The data is read in the form of <strong>streams</strong> and passed to the requesting application. Hence, the block <strong>does not have to be transferred in its entirety</strong> before the client application starts processing it.</li>
<li><strong>(6)</strong> Once the <em><strong>FSData</strong></em>* *<em><strong>InputStream</strong></em> receives all data of a block, it <strong>closes the connection</strong> and moves on to connect the DataNode for the <strong>next block</strong>. It repeats this process until it finishes reading all the required blocks of the file.</li>
<li><strong>(7)</strong> Once the client finishes reading all the required blocks, it calls the <strong>close()</strong> method of the input stream object.
<img loading="lazy" src="/images/notes/hdfs-2010/image-7.png"></li>
</ul>
<h3 id="short-circuit-read">Short Circuit Read</h3>
<ul>
<li>If the data and the client are on the same machine, HDFS can directly read the file bypassing the DataNode. This scheme is called <strong>short circuit read</strong> and is quite efficient as it reduces overhead and other processing resources.</li>
</ul>
<h3 id="anatomy-of-a-write-process">Anatomy of a Write Process</h3>
<ul>
<li>HDFS client initiates a write request by calling the <strong>create()</strong> method of the <em><strong>Distributed FileSystem</strong></em> object.</li>
<li><em><strong>Distributed FileSystem</strong></em> object sends a file creation request to the NameNode.</li>
<li>NameNode verifies that the file does not already exist and that the client has permission to create the file. If both these conditions are verified, the NameNode creates a new file record and sends an acknowledgment.</li>
<li>Client proceeds to write the file using <em><strong>FSData OutputStream.</strong></em></li>
<li><em><strong>FSData OutputStream</strong></em> writes data to a local queue called <strong>‘Data Queue.’</strong> The data is kept in the queue until a complete block of data is accumulated.</li>
<li>Once the queue has a complete block, another component called <strong>DataStreamer</strong> is notified to manage data transfer to the DataNode.</li>
<li><strong>DataStreamer</strong> first asks the <strong>NameNode</strong> to allocate a new block on DataNodes, thereby picking desirable DataNodes to be used for replication.</li>
<li>The <strong>NameNode</strong> provides a <strong>list of blocks</strong> and the <strong>locations of each block replica</strong>.</li>
<li>Upon receiving the block locations from the NameNode, the <strong>DataStreamer</strong> starts transferring the <strong>blocks from the internal queue</strong> to the <strong>nearest DataNode</strong>.</li>
<li>Each block is written to the <strong>first DataNode</strong>, which then <strong>pipelines the block to other DataNodes</strong> in order <strong>to write replicas</strong> of the block.</li>
<li>Once the <strong>DataStreamer</strong> finishes writing all blocks, it <strong>waits for</strong> <strong>acknowledgments</strong> from all the DataNodes.</li>
<li>Once all acknowledgments are received, the client calls the <strong>close()</strong> method of the <em><strong>OutputStream</strong></em>.</li>
<li>Finally, the <em><strong>Distributed FileSystem</strong></em> contacts the NameNode to notify that the file write operation is complete. At this point, the NameNode commits the file creation operation, which makes the file available to be read.
<img loading="lazy" src="/images/notes/hdfs-2010/image-8.png"></li>
</ul>
<h3 id="data-integrityblock-scanner--caching">Data Integrity(Block Scanner) &amp; Caching</h3>
<h3 id="data-integrity">Data Integrity</h3>
<ul>
<li>Data Integrity refers to ensuring the correctness of the data.</li>
<li>When a client retrieves a block from a DataNode, the data may arrive corrupted. This corruption can occur because of faults in the storage device, network, or the software itself.</li>
<li>HDFS client <strong>uses checksum to verify the file contents</strong>.</li>
<li>When a <strong>client stores a file</strong> in HDFS, it <strong>computes a checksum of each block</strong> of the file and <strong>stores these checksums in a separate hidden file</strong> in the same HDFS namespace.</li>
<li>When a <strong>client retrieves file contents</strong>, it <strong>verifies</strong> that the <strong>data</strong> it <strong>received</strong> from each DataNode <strong>matches</strong> the <strong>checksum stored in the associated checksum</strong> file.</li>
<li>If not, then the client can opt to retrieve that block from another replica.</li>
</ul>
<h3 id="block-scanner">Block Scanner</h3>
<ul>
<li>A block scanner process periodically runs on each DataNode to scan blocks stored on that DataNode and verify that the stored checksums match the block data.</li>
<li>Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica.</li>
<li>Whenever a client or a block scanner detects a corrupt block, it notifies the NameNode.</li>
<li>The NameNode marks the replica as corrupt and initiates the process to create a new good replica of the block.</li>
</ul>
<h3 id="caching">Caching</h3>
<ul>
<li>Normally, blocks are read from the disk, but for frequently accessed files, blocks may be explicitly cached in the DataNode’s memory, in an <strong>off-heap block cache</strong>.</li>
<li>HDFS offers a <strong>Centralized Cache Management</strong> scheme to allow its <strong>clients to specify to the NameNode file paths which need to be cached</strong>.</li>
<li>NameNode communicates with the DataNodes that have the desired blocks on disk and instructs them to cache the blocks in <strong>off-heap caches.</strong></li>
<li><strong>Advantages</strong> of <strong>Centralized Cache</strong> management in HDFS:</li>
</ul>
<h3 id="fault-tolerance">Fault Tolerance</h3>
<h3 id="agenda">Agenda</h3>
<ul>
<li>How does HDFS handle DataNode failures?</li>
<li>What happens when the NameNode fails?</li>
</ul>
<h3 id="how-does-hdfs-handle-datanode-failures">How does HDFS handle DataNode failures?</h3>
<h3 id="replication">Replication</h3>
<ul>
<li>As the blocks are replicated to multiple(Default 3) datanodes&rsquo; replicas, if one DataNode becomes inaccessible, its data can be read from other replicas.</li>
</ul>
<h3 id="heartbeat">HeartBeat</h3>
<ul>
<li>The NameNode keeps track of DataNodes through a heartbeat mechanism. Each DataNode sends periodic heartbeat messages (every few seconds) to the NameNode.</li>
<li>If a DataNode dies, the heartbeats will stop, and the NameNode will detect that the DataNode has died. The NameNode will then mark the DataNode as dead and will no longer forward any read/write request to that DataNode.</li>
<li>Because of replication, the blocks stored on that DataNode have additional replicas on other DataNodes.</li>
<li>The NameNode performs regular status checks on the file system to discover under-replicated blocks and performs a <strong>cluster rebalance</strong> process to replicate blocks that have less than the desired number of replicas.</li>
</ul>
<h3 id="what-happens-when-the-namenode-fails">What happens when the NameNode fails?</h3>
<h3 id="fsimage-and-editlog">FsImage and EditLog</h3>
<ul>
<li>NameNode is a single point of failure (SPOF). Will bring the entire file system down.</li>
<li>Internally, the NameNode maintains two <strong>on-disk data structures</strong> that store the file system’s state: an <strong>FsImage file</strong> and an <strong>EditLog</strong>. <strong>FsImage</strong> is a <strong>checkpoint</strong> (or the image) of the file system metadata at some point in time, while the <strong>EditLog</strong> is a log of all of the file system metadata transactions since the image file was last created.</li>
<li>All incoming changes to the file system metadata are written to the <strong>EditLog</strong>.</li>
<li>At periodic intervals, the <strong>EditLog</strong> and <strong>FsImage</strong> files are merged to create a new image file snapshot, and the edit log is cleared out.</li>
</ul>
<h3 id="metadata-backup">Metadata backup</h3>
<ul>
<li>On a <strong>NameNode failure</strong>, <strong>the metadata would be unavailable</strong>, and <strong>a disk failure</strong> on the NameNode would be catastrophic because the <strong>file metadata would be permanently lost</strong> since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.</li>
<li>Thus, it is crucial to make the NameNode resilient to failure, and HDFS provides <strong>two</strong> mechanisms for this:
<img loading="lazy" src="/images/notes/hdfs-2010/image-9.png"></li>
</ul>
<h3 id="hdfs-high-availability">HDFS High Availability</h3>
<h3 id="agenda-1">Agenda</h3>
<ul>
<li>HDFS high availability architecture</li>
<li>Failover and fencing</li>
</ul>
<h3></h3>
<h3 id="hdfs-high-availability-architecture">HDFS high availability architecture</h3>
<h3 id="problem">Problem</h3>
<ul>
<li>Although NameNode’s metadata is copied to multiple file systems to protect against <strong>data loss</strong>, it still does not provide high availability of the filesystem.</li>
<li>If the NameNode fails, no clients will be able to read, write, or list files, because the NameNode is the sole repository of the metadata and the file-to-block mapping.</li>
<li>In such an event, the whole Hadoop system would effectively be out of service until a new NameNode is brought online.</li>
<li>To recover from a failed NameNode scenario, an administrator will start a new primary NameNode with one of the filesystem metadata replicas and configure DataNodes and clients to use this new NameNode.</li>
<li>The new NameNode is not able to serve requests until it has</li>
<li>On large clusters with many files and blocks, it can take half an hour or more to perform a cold start of a NameNode.</li>
<li>Furthermore, this long recovery time is a problem for routine maintenance.</li>
</ul>
<h3 id="solution">Solution</h3>
<ul>
<li>Hadoop 2.0 added support for High Availability(HA).</li>
<li>There are two (or more) NameNodes in an active-standby configuration.</li>
<li>The active NameNode is responsible for all client operations in the cluster,</li>
<li>Standby is simply acting as a follower of the active, maintaining enough state to provide a fast failover when required.</li>
<li>For the Standby nodes to keep their state synchronized with the active node, HDFS made a few architectural changes:</li>
</ul>
<h3 id="quorum-journal-managerqjm">Quorum Journal Manager(QJM)</h3>
<ul>
<li>Provide a <strong>highly available EditLog</strong>.</li>
<li>QJM runs as a group(usually 3 where 1 can fail) of journal nodes, and each edit must be written to a quorum (or majority) of the journal nodes.</li>
<li>Similar to the way Zookeeper works except QJM doesn’t use ZooKeeper.</li>
<li>HDFS High Availability does use ZooKeeper for electing the active NameNode (<strong>Master Election</strong>).</li>
<li>The QJM process runs on all NameNodes and communicates all EditLog changes to journal nodes using RPC.</li>
<li>Since the Standby NameNodes have the latest state of the metadata available in memory (both the latest EditLog and an up-to-date block mapping), any standby can take over very quickly (in a few seconds) if the active NameNode fails.</li>
<li>However, the actual failover time will be longer in practice (<strong>around a minute</strong>) because the system needs to be conservative in deciding that the active NameNode has failed(<strong>Failure Detection</strong>).</li>
<li>In the unlikely event of the Standbys being down when the active fails, the administrator can still do a cold start of a Standby. This is no worse than the non-HA case.</li>
</ul>
<h3 id="zookeeper">Zookeeper</h3>
<ul>
<li>The ZKFailoverController (ZKFC) is a ZooKeeper client that runs on each NameNode and is responsible for coordinating with the Zookeeper and also monitoring and managing the state of the NameNode.
<img loading="lazy" src="/images/notes/hdfs-2010/image-10.png"></li>
</ul>
<h3 id="failover-and-fencing">Failover and fencing</h3>
<ul>
<li>A <strong>Failover Controller</strong> manages the transition from the active NameNode to the Standby. The default implementation of the failover controller uses <strong>ZooKeeper</strong> to ensure that only <strong>one NameNode is active(Single Leader)</strong>. Failover Controller runs as a lightweight process on each NameNode and monitors the NameNode for failures (<strong>Failure Detection</strong> using Heartbeat), and triggers a failover when the active NameNode fails(<strong>New Leader Election</strong>).</li>
<li><strong>Graceful failover:</strong> For routine maintenance, an administrator can manually initiate a failover. This is known as a graceful failover, since the failover controller arranges an orderly transition from the active NameNode to the Standby.</li>
<li><strong>Ungraceful failover:</strong> In the case of an ungraceful failover, however, it is impossible to be sure that the failed NameNode has stopped running. For example, a slow network or a network partition can trigger a failover transition, even though the previously active NameNode is still running and thinks it is still the active NameNode.</li>
<li>The <strong>HA implementation</strong> uses the mechanism of <strong>Fencing</strong> to prevent this <strong>“split-brain”</strong> scenario and ensure that the previously active NameNode is prevented from doing any damage and causing corruption.</li>
</ul>
<h3 id="fencing">Fencing</h3>
<ul>
<li><strong>Fencing</strong> is the idea of putting a fence around a <strong>previously active NameNode(Old Leader)</strong> so that it cannot access cluster resources and hence stop serving any read/write request. <strong>Two</strong> Fencing techniques:</li>
</ul>
<h3 id="hdfs-characteristics">HDFS Characteristics</h3>
<p>Explore some important aspects of HDFS architecture.</p>
<h3 id="agenda-2">Agenda</h3>
<ul>
<li>Security and permission</li>
<li>HDFS federation</li>
<li>Erasure coding</li>
<li>HDFS in practice</li>
</ul>
<h3 id="security-and-permission">Security and permission</h3>
<ul>
<li>Permission Model for files and director similar to POSIX.</li>
<li>Each file and directory is associated with an <strong>owner</strong> and a <strong>group</strong> and has separate permission for Owner, vs Group members, vs Others, similar to POSIX.</li>
<li>Similar 3 types of permissions R/W/X like POSIX:</li>
<li><strong>Optional support for POSIX ACLs</strong> to augment file permissions with finer-grained rules for named specific users or groups.</li>
</ul>
<h3 id="hdfs-federationnamenode-partitioning">HDFS federation(NameNode Partitioning)</h3>
<ul>
<li>Namenode keeps whole metadata in memory. Memory becomes a performance bottleneck for extremely large clusters and to server all metadata requests from a single node.</li>
<li>To solve this problem, <strong>HDFS Federation</strong> was Introduced in HDFS 2.x.</li>
<li>Allows a cluster to scale by adding <strong>NameNodes</strong>, each of which manages a portion of the filesystem namespace. /user &amp; /root managed by NN1 and NN2.</li>
<li>Under federation:</li>
<li>Multiple NN can generate the same 64-bit BlockID for their blocks.</li>
<li>To avoid this problem, a namespace uses one or more Block Pools, where a unique ID identifies each block pool in a cluster.</li>
<li>A block pool belongs to a single namespace and does not cross the namespace boundary.</li>
<li>The extended block ID, which is a tuple of <strong>(Block Pool ID, Block ID)</strong>, is used for block identification in HDFS Federation.</li>
</ul>
<h3 id="erasure-coding">Erasure coding</h3>
<ul>
<li>By default, HDFS stores three copies of each block, resulting in a 200% overhead (to store two extra copies) in storage space and other resources (e.g., network bandwidth).</li>
<li><strong>Erasure Coding (EC)</strong> provides the same level of fault tolerance with much less storage space. In a typical EC setup, the storage overhead is no more than 50%.</li>
<li>This fundamentally <strong>doubles the storage space</strong> capacity by bringing down the <strong>replication factor from 3x to 1.5x</strong>.</li>
<li>Under EC, data is broken down into fragments, expanded, encoded with redundant data pieces, and stored across different DataNodes.</li>
<li>If, at some point, data is lost on a DataNode due to corruption, etc., then it can be reconstructed using the other fragments stored on other DataNodes.</li>
<li>Although EC is <strong>more CPU intensive</strong>, it greatly reduces the storage needed for reliably storing a large data set.</li>
<li>References:</li>
</ul>
<h3 id="hdfs-in-practice">HDFS in practice</h3>
<ul>
<li>Was primarily designed to support Hadoop MapReduce jobs by providing Distributed File System for Map and Reduce Operations.</li>
<li>HDFS is now used with Many Big-Data Tools, e.g. in Several Apache Projects built on top of Hadoop, incl, Pig, Hive, HBase, Giraph etc.. Also GraphLab.</li>
<li><strong>Advantages of HDFS?</strong></li>
<li><strong>Disadvantages of HDFS?</strong></li>
</ul>
<h3 id="summary">Summary</h3>
<ul>
<li>Scalable distributed file system for large distributed data intensive applications.</li>
<li>Uses commodity hardware to reduce infrastructure costs.</li>
<li>POSIX-like(but not compatible) APIs for file operations.</li>
<li>Random writes are not possible. Append-Only.</li>
<li>Doesn’t support multiple concurrent writers to append to the same chunk like GFS.</li>
<li>Single NameNode and Multiple DataNodes in initial architecture.</li>
<li>Files are broken into 128 MB Blocks identified by 64-bit Globally unique block ID.</li>
<li>Blocks are replicated to multiple machines(default 3,Configurable) to provide redundancy. 200% overhead on replication. Can be reduced to 50% by using <strong>Erasure Coding</strong>.</li>
<li>DataNodes stores blocks on local disks as Linux Files.</li>
<li>NameNode is coordinator for HDFS Cluster. Keeps track of all filesystem metadata.</li>
<li>NameNode keeps all metadata in memory(for faster access). For Fault Tolerance(Node Crash), in-memory metadata changes are written to a Write Ahead Log(<strong>EditLog</strong>). For Disk Crash Tolerance, Edit Log can be replicated to a Remote File System(NFS) or QJM(Quorum Journal Manager) V2, or secondary NameNode(V1).</li>
<li>NameNode doesn’t keep records of block replica locations on DataNodes. NameNode Heartbeats and Collects states from DataNodes and asks on which block replicas it holds at NameNode Startup or when DataNode joins the cluster.</li>
<li><strong>FsImage</strong>: NameNode checkpoints the EditLog into <strong>FsImage</strong> and serialized to disk and replicated to other nodes, so in case of fail-over or NameNode start, it can quickly use the Checkpoint and subsequent EditLog to build the state again.</li>
<li>User applications Interact with HDFS using HDFS client, which interacts with NameNode for metadata, and directly talks to DataNode for read/write operations.</li>
<li>DataNode and Clients use <strong>Checksums</strong> to validate <strong>data integrity</strong> of Blocks. Informs the NameNode to repair the replica if corrupted.</li>
<li><strong>Lazy Collection</strong>: Deleted file is renamed to hiddle name to be GC’ed later.</li>
<li>HDFS is a <strong>strongly Consistent</strong> FS. Write is declared successful only if it is replicated to all the replicas.</li>
<li>Cache: For Frequently accessed files, user specified file paths/blocks to the NameNode server, can be explicitly cached in DataNode’s memory in an <strong>Off-heap block cache</strong>.(GFS just uses Linux’s <strong>Buffer Cache</strong>).</li>
</ul>
<h3 id="system-design-patterns">System Design Patterns</h3>
<ul>
<li><strong>Write Ahead Log</strong> - Fault Tolerance/Reliability.</li>
<li><strong>HeartBeat</strong></li>
<li><strong>Split Brain</strong></li>
<li><strong>CheckSum</strong> - Data Integrity.</li>
</ul>
<h3 id="reference">Reference</h3>
<ul>
<li>HDFS Paper</li>
<li>HDFS High Availability</li>
<li>HDFS Architecture</li>
<li>Distributed File Systems:A Survey</li>
</ul>
<hr>
<p><strong>Paper Link:</strong> <a href="https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf">https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf</a></p>
<hr>
<p><em>Last updated: March 15, 2026</em></p>
<p><em>Questions or discussion? <a href="mailto:sethi.hemant@gmail.com">Email me</a></em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
