File-System on Hemant Sethi

Google File System

Sun, 08 Dec 2024 00:00:00 +0000

Google File System / Distributed File System

Goal

Design a distributed file system to store huge files (terabyte and larger). The system should be scalable, reliable, and highly available.

Developed by Google for its large data-intensive applications.

Background

GFS was built for handling batch processing on large data sets and is designed for system-to-system interaction, not user-to-system interaction.
Was designed with following goals in mind:

GFS Use Cases

Built for distributed data-intensive applications like Gmail or Youtube.
Google’s BigTable uses GFS to store log files and data files.

APIs

GFS doesn’t provide a standard posix-like API. Instead user-level APIs are provided.
Files organized hierarchically in directories and identified by their path names.
Supports usual file system operations:
Additional Special Operations

High Level Architecture

Agenda

Chunks
Chunk Handle
Cluster
Chunk Server
Master
Client A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients.

Chunk

As files stored in GFS tend to be very large, GFS breaks files into multiple fixed-size chunks where each chunk is 64 megabytes in size.

Chunk Handle

Each chunk is identified by an Immutable and globally unique 64-bit ID number called chunk handle. Allows 2^64 unique chunks.
Total allowed storage space = 2^64 * 64MB = 10^9 exabytes
Files are split into Chunks, so the job of GFS is to provide a mapping from files to Chunks, and then to support standard operations on Files, mapping down operations to individual chunks.

Cluster

GFS is organized into a network of computers(nodes) called a cluster. A GFS cluster contains 3 types of entities:

Chunk Server

Nodes which stores chunks on local disks as linux files
Read or write chunk data specified by chunk handle and byte-range.
For reliability, each chunk is replicated to multiple chunk servers.
By default, GFS stores three replicas, though different replication factors can be specified on a per-file basis.

Master

Coordinator of GFS cluster. Responsible for keeping track of filesystem metadata.
Metadata stored at master includes:
Master also controls system-wide activities such as:
Periodically communicates with each ChunkServer in HeartBeat messages to give it instructions and collect its state.
For performance and fast random access, all metadata is stored in the master’s main memory, i.e. entire filesystem namespace as well as all the name-to-chunk mappings.
For fault tolerance and to handle a master crash, all metadata changes(every operation to File System) are written to the disk onto an operation log(similar to Journal) which is replicated to remote machines.
The benefit of having a single, centralized master is that it has a global view of the file system, and hence, it can make optimum management decisions, for example, related to chunk placement.

Client

Application/Entity that makes read/write requests to GFS using GFS Client library.
This library communicates with the master for all metadata-related operations like creating or deleting files, looking up files, etc.
To read or write data, the client(library) interacts directly with the ChunkServers that hold the data.
Neither the client nor the ChunkServer caches file data.
ChunkServers rely on the buffer cache in Linux to maintain frequently accessed data in memory.

Single Master and Large Chunk Size

Agenda

Single Master
Chunk Size

Single Master

Having a single master vastly simplifies GFS design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge.
GFS minimizes the master’s involvement in reads and writes, so that it does not become a bottleneck.

Chunk Size

GFS has chosen 64 MB, which is much larger than typical filesystem block sizes (which are often around 4KB). One of the key design parameters.
Advantages of large chunk size:

Lazy space Allocation

Each chunk replica is stored as a plain Linux file on a ChunkServer. GFS does not allocate the whole 64MB of disk space when creating a chunk. Instead, as the client appends data, the ChunkServer, lazily extends the chunk
One disadvantage of having a large chunk size is the handling of small files.

Metadata

Let’s explore how GFS manages file system metadata.

Agenda

Storing Metadata in memory
Chunk Location
Operation Log Master stores 3 types of metadata:
File and Chunk name spaces(Directory hierarchy).
Mapping from files to chunks.
Location of each chunk’s replica. 3 Aspects of how master stores this metadata:
Keeps all the metadata in memory.
File and Chunk namespaces and file-to-Chunk mapping are also persisted on Master’s local disk.
Chunk’s replica locations are not persisted on to local disk.

Storing Metadata in Memory

Quick operations due to metadata being accessible in-memory.
Efficient for the master to periodically scan through its entire state in the background. Periodic scanning is used for three functions:
Capacity of the whole system(or How many chunks can the metadata store) is limited by how much memory the master has. Not a problem in practice.
If the need to support a larger file system arises, cost of adding extra memory to master is a smaller price to pay for reliability, simplicity, performance, and flexibility by storing metadata in-memory.

Chunk Location

The master does not keep a persistent record of which ChunkServers have a replica of a given chunk.
By having the ChunkServer as the ultimate source of truth of each chunk’s location, GFS eliminates the problem of keeping the master and ChunkServers in sync
It is not beneficial to maintain a consistent view of chunk locations on the master, because errors on a ChunkServer may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled, or ChunkServer is renamed or failed, etc.)

Operation Log

The master maintains an operation log that contains the namespace and file- to-chunk mappings and stores it on the local disk.
Specifically, this log stores a historical persistent record of all the metadata changes and serves as a logical timeline that defines the order of concurrent operations.
For fault tolerance and reliability, this operation log is synchronously replicated on multiple remote machines, and changes to the metadata are not made visible to clients until they have been persisted on all replicas.(Similar to the High Water Mark concept in Kafka).
The master batches several log records together before flushing, thereby reducing the impact of flushing and replicating on overall system throughput.
Upon restart, the master can restore its file-system state by replaying the operation log.
This log must be kept small to minimize the startup time, and that is achieved by periodically checkpointing it.(What does this mean?)

Checkpointing

Master’s state is periodically serialized to disk and then replicated, so that on recovery, a master may load the checkpoint into memory, replay any subsequent operations from the operation log, and be available very quickly.
To further speed up the recovery and improve availability, GFS stores the checkpoint in a compact B-tree like format that can be directly mapped into memory and used for namespace lookup without extra parsing.
The checkpoint process can take time, therefore, to avoid delaying incoming mutations, the master switches to a new log file and creates the new checkpoint in a separate thread.

Master Operations

Agenda

Namespace management and locking
Replica placement
Replica creation and re-replication
Replica rebalancing
Stale replica detection Master is responsible for:
Making replica placement decision
Creating new Chunks and assigning replicas
Making sure that the chunks are fully replicated as per replication factor
Balancing the load across chunk servers
Reclaimed unused storage.

Namespace management and locking

The master acquires locks over a namespace region to ensure proper serialization and to allow multiple operations at the master.
GFS does not have an i-node like tree structure for directories and files.
Instead, it has a hash-map that maps a filename to its metadata, and reader-writer locks are applied on each node of the hash table for synchronization.

Replica placement

To ensure maximum data availability and integrity, the master distributes replicas on different racks(“Rack Aware”), so that clients can still read or write in case of a rack failure.
As the in and out bandwidth of a rack may be less than the sum of the bandwidths of individual machines, placing the data in various racks can help clients exploit reads from multiple racks.
For ‘write’ operations, multiple racks are actually disadvantageous as data has to travel longer distances. It is an intentional tradeoff that GFS made.
Data is lost when all replicas of a chunk are lost.

Replica creation and re-replication

The goals of a master are to place replicas on servers with less-than-average disk utilization, and spread replicas across racks.
Reduce the number of ‘recent’ creations on each ChunkServer (even though writes are cheap, they are followed by heavy write traffic) which might create additional load.
Chunks need to be re-replicated as soon as the number of available replicas falls (due to data corruption on a server or a replica being unavailable) below the user-specified replication factor.
Instead of re-replicating all of such chunks at once, the master prioritizes the client operations re-replication to prevent these cloning operations from becoming bottlenecks.What?
Restrictions are placed on the bandwidth of each server for re-replication so that client requests are not compromised.
How are chunks prioritized for re-replication?

Replica rebalancing

Master rebalances replicas regularly to achieve load balancing and better disk space usage.
Any new ChunkServer added to the cluster is filled up gradually by the master rather than flooding it with a heavy traffic of write operations.

Stale replica detection

Chunk replicas may become stale if a ChunkServer fails and misses mutations to the chunk while it is down
For each chunk, the master maintains a chunk Version Number to distinguish between up-to-date and stale replicas.
The master increments the chunk version every time it grants a lease and informs all up-to-date replicas.
The master and these replicas all record the new version number in their persistent state.
Master removes stale replicas during regular garbage collection.
Stale replicas are not given to clients when they ask the master for a chunk location, and they are not involved in mutations either.
However, because a client caches a chunk’s location, it may read from a stale replica before the data is resynced.

Anatomy of a Read Operation

Let’s learn how GFS handles a read operation. A typical interaction with GFS Cluster goes like this:

Client translates the filename and byte offset specified by the application into a chunk index within the file.
Client sends RPC request with File Name and Chunk Index to the master.
Master replies with Chunk Handle and replica locations(holding chunk).
Client caches this metadata using FileName and ChunkIndex as the key.
Client sends request to one of the closest replicas specifying a chunk handle and a byte range within that chunk.
Replica chunk server replies with requested data.
Master is involved only at the start and is then completely out of loop, implementing a separation of control and data flows.

Anatomy of Write Operation

What is a chunk lease?

To safeguard against concurrent writes at two different replicas of a chunk, GFS makes use of chunk lease.
When a mutation (i.e., a write, append or delete operation) is requested for a chunk, the master finds the ChunkServers which hold that chunk and grants a chunk lease (for 60 seconds) to one of them.
The server with the lease is called the primary and is responsible for providing a serial order for all the currently pending concurrent mutations to that chunk.
There is only one lease per chunk at any time, so that if two write requests go to the master, both see the same lease denoting the same primary.
A global ordering is provided by the ordering of the chunk leases combined with the order determined by that primary.
The primary can request lease extensions if needed
When the master grants the lease, it increments the chunk version number and informs all replicas containing that chunk of the new version number.
Failure modes??

Data Writing?

Writing of data is split into two phases:

Sending
Writing Stepwise breakdown of data transfer:
Client asks master which chunk server holds the current lease of chunk and locations of other replicas.
Master replies with the identity and location of primary and secondary replicas.
Client pushes data to the closest replica.
Once all replicas have acknowledged receiving the data, the client sends the write request to the primary.
The primary assigns consecutive serial numbers to all the mutations it receives, providing serialization. It applies mutations in serial number order.
Primary forwards the write request to all secondary replicas. They apply mutations in the same serial number order.
Secondary replicas reply to primary indicating they have completed operation.
Primary replies to the client with success or error messages.
The key point to note is that the data flow is different from the control flow.
Chunk version numbers are used to detect if any replica has stale data which has not been updated because that ChunkServer was down during some update.

Another edge case with write operation is that, if we have two concurrent write operations spanning multiple chunks, and the chunks have two different primary chunk servers, which decide on the single order, it could happen that you may have interleaved concurrent writes in those cases. See example below. From the Jordan video here at time 22:30 onwards. Only solution for that is Distributed Locking Service(Distributed Consensus), which is going to be an expensive operation.

Anatomy of Append operation?

Record append operation is optimized in a unique way that distinguishes GFS from other distributed file systems.
In a normal write, the client specifies the offset at which data is to be written. Concurrent writes to the same region can experience race conditions, and the region may end up containing data fragments from multiple clients.
In a record append, however, the client specifies only the data(up to 1/4th of a chunk size ~~ 16 MB). GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.
Record Append is a kind of mutation that changes the contents of the metadata of a chunk.
[Data Transfer to Replicas] When an application tries to append data on a chunk by sending a request to the client, the client pushes the data to all replicas of the last chunk of the file just like the write operation.
[Command to serialize the write] When the client forwards the request to the master, the primary checks whether appending the record to the existing chunk will increase the chunk’s size more than its limit (maximum size of a chunk is 64MB).
[Pads the existing Chunk] If this happens, it pads the chunk to the maximum limit, commands the secondary to do the same, and requests the clients to try to append to the next chunk.
[Append to the primary replica’s chunk and notify secondary] If the record fits within the maximum size, the primary appends the data to its replica, tells the secondary to write the data at the exact offset where it has, and finally replies success to the client.
[Failure Mode] If an append operation fails at any replica, the client retries the operation.

Implications for Writes(Jordan Video:27:40)

Prefer appends to writes.
No Interleaving.
Readers need to be able to handle padding and/or duplicates(can happen due to failed retries or partial failures on some of the replicas).
If making multi-chunk writes, writers should take checkpoints as each of those individual write chunks goes through.

GFS consistency model and Snapshotting

GFS Consistency model

GFS has a relaxed consistency model.(Don’t know what that means)
Metadata operations (e.g., file creation) are atomic.
Namespace locking guarantees atomicity and correctness.
Master’s operation log defines a global total order of these operations.
In data mutations, there is an important distinction between write and append operations.
Write operations specify an offset at which mutations should occur, whereas appends are always applied at the end of the file.
This means that for the write operation, the offset in the chunk is predetermined, whereas for append , the system decides.
Concurrent writes to the same location are not serializable and may result in corrupted regions of the file.
With append operations, GFS guarantees the append will happen at-least-once and atomically (that is, as a contiguous sequence of bytes).
The system does not guarantee that all copies of the chunk will be identical (some may have duplicate data).

Snapshotting

A snapshot is a copy of some subtree of the global namespace as it exists at a given point in time.
GFS clients use snapshotting to efficiently branch two versions of the same data.
Snapshots in GFS are initially zero-copy.
When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files to snapshot.
It waits for leases to be revoked or expired and logs the snapshot operation to the operation log.
The snapshot is then made by duplicating the metadata for the source directory tree.
When a client makes a request to write to one of these chunks, the master detects that it is a copy-on-write chunk by examining its reference count (which will be more than one).
At this point, the master asks each ChunkServer holding the replica to make a copy of the chunk and store it locally.
Once the copy is complete, the master issues a lease for the new copy, and the write proceeds.

Fault Tolerance, High Availability, and Data Integrity

Agenda

Fault Tolerance
High Availability through chunk replication
Data Integrity through checksum.

Fault Tolerance

To make the system fault tolerant, and available, GFS uses two strategies:

Fast recovery in case of component failures.
Replication for high availability. Lets see how GFS recovers from Master or Replica Failure:
On Master Failure
On Primary Replica Failure
On Secondary Replica Failure
Stale replicas might be exposed to clients. It depends on the application programmer to deal with these stale reads.

High Availability through chunk replication

Each chunk is replicated on multiple ChunkServers on different racks.
Users can specify different replication levels(Default: 3) for different parts of the file namespace.
The master clones the existing replicas to keep each chunk fully replicated as ChunkServers go offline or when the master detects corrupted replicas through checksum verification.
A chunk is lost irreversibly only if all its replicas are lost before GFS can react. Even in this case, the data becomes unavailable, not corrupted, which means applications receive clear errors rather than corrupt data.

Data Integrity through checksum

Checksumming is used by each ChunkServer to detect the corruption of stored data.
The chunk is broken down into 64 KB blocks.
Each 64 KB block has a corresponding 32-bit checksum.
Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.
For Reads: the ChunkServer verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another ChunkServer. ChunkServers will not propagate corruptions to other machines.
For Writes:
For Appends:
During idle periods, ChunkServers can scan and verify the contents of inactive chunks (prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk).
Checksumming has little effect on read performance for the following reasons:

Garbage Collection

How does GFS implement Garbage Collection?

Agenda

Garbage collection through lazy deletion
Advantages of lazy deletion
Disadvantages of lazy deletion

Garbage collection through lazy deletion

When a file is deleted, GFS does not immediately reclaim the physical space used by that file. Instead, it follows a lazy garbage collection strategy.
When the client issues a delete file operation, GFS does two things:
The file can still be read under the new, special name and can also be undeleted by renaming it back to normal.
To reclaim the physical storage, the master, while performing regular scans of the file system, removes any such hidden files if they have existed for more than three days (this interval is configurable) and also deletes its in-memory metadata.
This lazy deletion scheme provides a window of opportunity to a user who deleted a file by mistake to recover the file.
The master, while performing regular scans of the chunk namespace, deletes the metadata of all chunks that are not part of any file.
Also, during the exchange of regular HeartBeat messages with the master, each ChunkServer reports a subset of the chunks it has, and the master replies with a list of chunks from that subset that are no longer present in the master’s database; such chunks are then deleted from the ChunkServer.

Advantages of lazy deletion

Simple and reliable: If the chunk deletion message is lost, the master does not have to retry. The ChunkServer can perform the garbage collection with the subsequent heartbeat messages.
GFS merges storage reclamation into regular background activities of the master, such as the regular scans of the filesystem or the exchange of HeartBeat messages. Thus, it is done in batches, and the cost is amortized.
Garbage collection takes place when the master is relatively free.
Lazy deletion provides safety against accidental, irreversible deletions.

Disadvantages of lazy deletion

As we know, after deletion, storage space does not become available immediately. Applications that frequently create and delete files may not be able to reuse the storage right away. To overcome this, GFS provides following options:

Criticism on GFS

Problems associated with single master

Google has started to see the following problems with the centralized master scheme:

Problems associated with large chunk size

Large chunk size (64MB) in GFS has its disadvantages while reading. Since a small file will have one or a few chunks, the ChunkServers storing those chunks can become hotspots if a lot of clients are accessing the same file.
As a workaround for this problem, GFS stores extra copies of small files for distributing the load to multiple ChunkServers. Furthermore, GFS adds a random delay in the start times of the applications accessing such files.

Summary

Scalable distributed file storage system for large data-intensive applications.
Uses commodity hardware to reduce infrastructure costs.
Was designed with Fault Tolerance in mind(Software/hardware faults).
Reading workload is large streaming reads and small random reads.
Writing workload is many large sequential writes that appends data to files.
Provides APIs for file operations like create, delete, open, close, read, write, snapshot and record append operations. Record append allows multiple clients to concurrently append data to the same file while guaranteeing atomicity.
GFS cluster is single master, multiple chunk servers & access by Multiple clients.
Files are broken into 64 MB chunks, identified by Immutable and Globally unique 64-bit Chunk Handle(assigned by master during chunk creation).
Chunk servers store chunks on local disks as Linux files. For Reliability, each chunk is replicated to multiple chunk servers.
Master is Coordinator for GFS cluster. Responsible for keeping track of all the filesystem metadata. Namespace, authorization, files-chunk mapping, chunk location.
Master keeps all metadata in memory for faster operation. For Fault tolerance, and to handle master crash, all metadata changes are written onto disk into Operation Log which is replicated to other machines.
Master doesn’t have a persistent record(only in-memory) of which chunk servers have replicas for a given chunk. Master asks each chunk server what chunks it holds at master startup, or whenever the chunk server joins the cluster.
For Quick recovery(Master failure), master’s state is periodically serialized to disk(Checkpointed) along with Operation log and is replicated. On recovery, master loads the checkpoint, and replays subsequent operations from Operation Log.
Master communicates with each chunk server via HeartBeat to collect state.
Applications use GFS Client code, which implements filesystem API, and communicates with the cluster. Clients interact with master for metadata(Control Flow), but all data transfer happens directly(Data Flow) between client and Chunk servers.
Data Integrity: Each Chunk server uses Checksumming to detect corruption of stored data.
Garbage Collection: Lazy Deletion.
Consistency:Master guarantees data consistency by ensuring order of mutations on all replicas and using chunk version numbers. If a replica has an incorrect version, it is garbage collected.
GFS guarantees at-least-once writes. It is the responsibility of readers to deal with duplicate chunks. This is achieved by having Checksums and serial numbers in the chunks, which help readers to filter and discard duplicate data.
Cache: Neither the client or chunk servers cache data. However, Clients do cache metadata.

System Design Patterns

Write-Ahead-Log - Operation Log
HeartBeat - B/w Master and Chunk servers.
CheckSum - Data Integrity
Copy-On-Write Snapshotting.
Lazy Garbage collection.

References

GFS Paper
BigTable Paper
GFS Evolution on Fast-Forward
Jordan Video would give a quick summary of the above.

Paper Link: https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Hadoop Distributed File System

Sun, 08 Dec 2024 00:00:00 +0000

Paper: Hadoop Distributed File System

Hadoop Distributed File System

Goal

Design a distributed system that can store huge files (terabyte and larger). The system should be scalable, reliable, and highly available.

What is Hadoop Distributed File System

HDFS is a distributed file system and was built to store unstructured data. It is designed to store huge files reliably and stream those files at high bandwidth to user applications.
HDFS is a variant and a simplified version of the Google File System (GFS). A lot of HDFS architectural decisions are inspired by GFS design. HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern.

Background

Apache Hadoop is a software framework that provides a distributed file storage system(HDFS) and distributed computing for analyzing and transforming very large data sets using the MapReduce programming model.
HDFS is the default file storage system in Hadoop. It is designed to be a distributed, scalable, fault-tolerant file system that primarily caters to the needs of the MapReduce paradigm.
Both HDFS and GFS were built to store very large files and scale to store petabytes of storage.
Both were built for handling batch processing on huge data sets and were designed for data-intensive applications and not for end- users.
Like GFS, HDFS is also not POSIX-compliant and is not a mountable file system on its own. It is typically accessed via HDFS clients or by using application programming interface (API) calls from the Hadoop libraries.
Given HDFS design, following applications are not a good fit for HDFS,

API

Provides user-level APIs(and not standard POSIX-like APIs).
Files are organized hierarchically in directories and identified by their pathnames.
Supports the usual file system operations on files and directories. Create, Delete, Rename, Move, and Symbolic Links(unlike GFS) etc.
All read and write operations are done in an append-only fashion.

High Level Architecture

HDFS Architecture

Files are broken into 128 MB fixed-size blocks (configurable on a per-file basis).
File has two parts: the actual file data and the metadata.
Metadata
HDFS cluster primarily consists of a NameNode(Master:GFS) that manages the file system metadata and DataNodes(Chunk Server:GFS) that store the actual data.
All blocks of a file are of the same size except the last one.
HDFS uses large block sizes because it is designed to store extremely large files to enable MapReduce jobs to process them efficiently.
Each block is identified by a unique 64-bit ID called BlockID(Similar to Chunk in GFS). All read/write operations in HDFS operate at the block level.
DataNodes store each block in a separate file on the local file system and provide read/write access.
When a DataNode starts up, it scans through its local file system and sends the list of hosted data blocks (called BlockReport) to the NameNode.(Similar to how Master gets state information in GFS from Chunk Servers).
The NameNode maintains two on-disk data structures to store the file system’s state: an FsImage file(Oplog Checkpoint:GFS) and an EditLog(Operation Log: GFS).
FsImage is a checkpoint of the file system metadata at some point in time, while the EditLog is a log of all of the file system metadata transactions since the image file was last created. These two files help NameNode to recover from failure.
User applications interact with HDFS through its client. HDFS Client interacts with NameNode for metadata, but all data transfers happen directly between the client and DataNodes.
To achieve high-availability, HDFS creates multiple copies of the data and distributes them on nodes throughout the cluster.

Comparison b/w GFS and HDFS

Deep Dive

Cluster Topology

Hadoop clusters typically have about 30 to 40 servers per rack.
Each rack has a dedicated gigabit switch that connects all of its servers and an uplink to a core switch or router, whose bandwidth is shared by many racks in the data center.
When HDFS is deployed on a cluster, each of its servers is configured and mapped to a particular rack. The network distance between servers is measured in hops, where one hop corresponds to one link in the topology.
Hadoop assumes a tree-style topology, and the distance between two servers is the sum of their distances to their closest common ancestor.

Rack aware replication

HDFS employs a rack-aware replica placement policy to improve data reliability, availability, and network bandwidth utilization.
The idea behind HDFS’s replica placement is to be able to tolerate node and rack failures.
If the replication factor is three, HDFS attempts to place the
This rack-aware replication scheme slows the write operation as the data needs to be replicated onto different racks, tradeoff between reliability and performance.

Synchronization Semantics

Early versions of HDFS followed strict immutable semantics. Once a file was written, it could never again be re-opened for writes; files could still be deleted.
Current versions of HDFS support append.
This design choice in HDFS was because most MapReduce workloads follow the write once and read many data-access patterns.
MapReduce is a restricted computational model with predefined stages. The reducers in MapReduce write independent files to HDFS as output. HDFS focuses on fast read access for multiple clients at a time.

HDFS Consistency Model

HDFS follows a strong consistency model.
To ensure strong consistency, a write is declared successful only when all replicas have been written successfully.
HDFS does not allow multiple concurrent writers to write to an HDFS file, so implementing strong consistency becomes relatively easy.

Anatomy of a Read Operation

HDFS Read Process

(1) When a file is opened for reading, the HDFS client initiates a read request, by calling the open() method of the Distributed FileSystem object. The client specifies the file name, start offset, and the read range length**.**
(2) The Distributed FileSystem object calculates what blocks need to be read based on the given offset and range length, and requests the locations of the blocks from the NameNode.
(3) NameNode has metadata for all blocks’ locations. It provides the client a list of blocks and the locations of each block replica. As the blocks are replicated, NameNode finds the closest replica to the client when providing a particular block’s location. The closest locality of each block is determined as follows:
(4) After getting the block locations, the client calls the read() method of FSData InputStream,which takes care of all the interactions with the DataNodes.
(5) Once the client invokes the read() method, the input stream object establishes a connection with the closest DataNode with the first block of the file.
(5b) The data is read in the form of streams and passed to the requesting application. Hence, the block does not have to be transferred in its entirety before the client application starts processing it.
(6) Once the FSData* *InputStream receives all data of a block, it closes the connection and moves on to connect the DataNode for the next block. It repeats this process until it finishes reading all the required blocks of the file.
(7) Once the client finishes reading all the required blocks, it calls the close() method of the input stream object.

Short Circuit Read

If the data and the client are on the same machine, HDFS can directly read the file bypassing the DataNode. This scheme is called short circuit read and is quite efficient as it reduces overhead and other processing resources.

Anatomy of a Write Process

HDFS client initiates a write request by calling the create() method of the Distributed FileSystem object.
Distributed FileSystem object sends a file creation request to the NameNode.
NameNode verifies that the file does not already exist and that the client has permission to create the file. If both these conditions are verified, the NameNode creates a new file record and sends an acknowledgment.
Client proceeds to write the file using FSData OutputStream.
FSData OutputStream writes data to a local queue called ‘Data Queue.’ The data is kept in the queue until a complete block of data is accumulated.
Once the queue has a complete block, another component called DataStreamer is notified to manage data transfer to the DataNode.
DataStreamer first asks the NameNode to allocate a new block on DataNodes, thereby picking desirable DataNodes to be used for replication.
The NameNode provides a list of blocks and the locations of each block replica.
Upon receiving the block locations from the NameNode, the DataStreamer starts transferring the blocks from the internal queue to the nearest DataNode.
Each block is written to the first DataNode, which then pipelines the block to other DataNodes in order to write replicas of the block.
Once the DataStreamer finishes writing all blocks, it waits for acknowledgments from all the DataNodes.
Once all acknowledgments are received, the client calls the close() method of the OutputStream.
Finally, the Distributed FileSystem contacts the NameNode to notify that the file write operation is complete. At this point, the NameNode commits the file creation operation, which makes the file available to be read.

Data Integrity(Block Scanner) & Caching

Data Integrity

Data Integrity refers to ensuring the correctness of the data.
When a client retrieves a block from a DataNode, the data may arrive corrupted. This corruption can occur because of faults in the storage device, network, or the software itself.
HDFS client uses checksum to verify the file contents.
When a client stores a file in HDFS, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another replica.

Block Scanner

A block scanner process periodically runs on each DataNode to scan blocks stored on that DataNode and verify that the stored checksums match the block data.
Additionally, when a client reads a complete block and checksum verification succeeds, it informs the DataNode. The DataNode treats it as a verification of the replica.
Whenever a client or a block scanner detects a corrupt block, it notifies the NameNode.
The NameNode marks the replica as corrupt and initiates the process to create a new good replica of the block.

Caching

Normally, blocks are read from the disk, but for frequently accessed files, blocks may be explicitly cached in the DataNode’s memory, in an off-heap block cache.
HDFS offers a Centralized Cache Management scheme to allow its clients to specify to the NameNode file paths which need to be cached.
NameNode communicates with the DataNodes that have the desired blocks on disk and instructs them to cache the blocks in off-heap caches.
Advantages of Centralized Cache management in HDFS:

Fault Tolerance

Agenda

How does HDFS handle DataNode failures?
What happens when the NameNode fails?

How does HDFS handle DataNode failures?

Replication

As the blocks are replicated to multiple(Default 3) datanodes’ replicas, if one DataNode becomes inaccessible, its data can be read from other replicas.

HeartBeat

The NameNode keeps track of DataNodes through a heartbeat mechanism. Each DataNode sends periodic heartbeat messages (every few seconds) to the NameNode.
If a DataNode dies, the heartbeats will stop, and the NameNode will detect that the DataNode has died. The NameNode will then mark the DataNode as dead and will no longer forward any read/write request to that DataNode.
Because of replication, the blocks stored on that DataNode have additional replicas on other DataNodes.
The NameNode performs regular status checks on the file system to discover under-replicated blocks and performs a cluster rebalance process to replicate blocks that have less than the desired number of replicas.

What happens when the NameNode fails?

FsImage and EditLog

NameNode is a single point of failure (SPOF). Will bring the entire file system down.
Internally, the NameNode maintains two on-disk data structures that store the file system’s state: an FsImage file and an EditLog. FsImage is a checkpoint (or the image) of the file system metadata at some point in time, while the EditLog is a log of all of the file system metadata transactions since the image file was last created.
All incoming changes to the file system metadata are written to the EditLog.
At periodic intervals, the EditLog and FsImage files are merged to create a new image file snapshot, and the edit log is cleared out.

Metadata backup

On a NameNode failure, the metadata would be unavailable, and a disk failure on the NameNode would be catastrophic because the file metadata would be permanently lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
Thus, it is crucial to make the NameNode resilient to failure, and HDFS provides two mechanisms for this:

HDFS High Availability

Agenda

HDFS high availability architecture
Failover and fencing

HDFS high availability architecture

Problem

Although NameNode’s metadata is copied to multiple file systems to protect against data loss, it still does not provide high availability of the filesystem.
If the NameNode fails, no clients will be able to read, write, or list files, because the NameNode is the sole repository of the metadata and the file-to-block mapping.
In such an event, the whole Hadoop system would effectively be out of service until a new NameNode is brought online.
To recover from a failed NameNode scenario, an administrator will start a new primary NameNode with one of the filesystem metadata replicas and configure DataNodes and clients to use this new NameNode.
The new NameNode is not able to serve requests until it has
On large clusters with many files and blocks, it can take half an hour or more to perform a cold start of a NameNode.
Furthermore, this long recovery time is a problem for routine maintenance.

Solution

Hadoop 2.0 added support for High Availability(HA).
There are two (or more) NameNodes in an active-standby configuration.
The active NameNode is responsible for all client operations in the cluster,
Standby is simply acting as a follower of the active, maintaining enough state to provide a fast failover when required.
For the Standby nodes to keep their state synchronized with the active node, HDFS made a few architectural changes:

Quorum Journal Manager(QJM)

Provide a highly available EditLog.
QJM runs as a group(usually 3 where 1 can fail) of journal nodes, and each edit must be written to a quorum (or majority) of the journal nodes.
Similar to the way Zookeeper works except QJM doesn’t use ZooKeeper.
HDFS High Availability does use ZooKeeper for electing the active NameNode (Master Election).
The QJM process runs on all NameNodes and communicates all EditLog changes to journal nodes using RPC.
Since the Standby NameNodes have the latest state of the metadata available in memory (both the latest EditLog and an up-to-date block mapping), any standby can take over very quickly (in a few seconds) if the active NameNode fails.
However, the actual failover time will be longer in practice (around a minute) because the system needs to be conservative in deciding that the active NameNode has failed(Failure Detection).
In the unlikely event of the Standbys being down when the active fails, the administrator can still do a cold start of a Standby. This is no worse than the non-HA case.

Zookeeper

The ZKFailoverController (ZKFC) is a ZooKeeper client that runs on each NameNode and is responsible for coordinating with the Zookeeper and also monitoring and managing the state of the NameNode.

Failover and fencing

A Failover Controller manages the transition from the active NameNode to the Standby. The default implementation of the failover controller uses ZooKeeper to ensure that only one NameNode is active(Single Leader). Failover Controller runs as a lightweight process on each NameNode and monitors the NameNode for failures (Failure Detection using Heartbeat), and triggers a failover when the active NameNode fails(New Leader Election).
Graceful failover: For routine maintenance, an administrator can manually initiate a failover. This is known as a graceful failover, since the failover controller arranges an orderly transition from the active NameNode to the Standby.
Ungraceful failover: In the case of an ungraceful failover, however, it is impossible to be sure that the failed NameNode has stopped running. For example, a slow network or a network partition can trigger a failover transition, even though the previously active NameNode is still running and thinks it is still the active NameNode.
The HA implementation uses the mechanism of Fencing to prevent this “split-brain” scenario and ensure that the previously active NameNode is prevented from doing any damage and causing corruption.

Fencing

Fencing is the idea of putting a fence around a previously active NameNode(Old Leader) so that it cannot access cluster resources and hence stop serving any read/write request. Two Fencing techniques:

HDFS Characteristics

Explore some important aspects of HDFS architecture.

Agenda

Security and permission
HDFS federation
Erasure coding
HDFS in practice

Security and permission

Permission Model for files and director similar to POSIX.
Each file and directory is associated with an owner and a group and has separate permission for Owner, vs Group members, vs Others, similar to POSIX.
Similar 3 types of permissions R/W/X like POSIX:
Optional support for POSIX ACLs to augment file permissions with finer-grained rules for named specific users or groups.

HDFS federation(NameNode Partitioning)

Namenode keeps whole metadata in memory. Memory becomes a performance bottleneck for extremely large clusters and to server all metadata requests from a single node.
To solve this problem, HDFS Federation was Introduced in HDFS 2.x.
Allows a cluster to scale by adding NameNodes, each of which manages a portion of the filesystem namespace. /user & /root managed by NN1 and NN2.
Under federation:
Multiple NN can generate the same 64-bit BlockID for their blocks.
To avoid this problem, a namespace uses one or more Block Pools, where a unique ID identifies each block pool in a cluster.
A block pool belongs to a single namespace and does not cross the namespace boundary.
The extended block ID, which is a tuple of (Block Pool ID, Block ID), is used for block identification in HDFS Federation.

Erasure coding

By default, HDFS stores three copies of each block, resulting in a 200% overhead (to store two extra copies) in storage space and other resources (e.g., network bandwidth).
Erasure Coding (EC) provides the same level of fault tolerance with much less storage space. In a typical EC setup, the storage overhead is no more than 50%.
This fundamentally doubles the storage space capacity by bringing down the replication factor from 3x to 1.5x.
Under EC, data is broken down into fragments, expanded, encoded with redundant data pieces, and stored across different DataNodes.
If, at some point, data is lost on a DataNode due to corruption, etc., then it can be reconstructed using the other fragments stored on other DataNodes.
Although EC is more CPU intensive, it greatly reduces the storage needed for reliably storing a large data set.
References:

HDFS in practice

Was primarily designed to support Hadoop MapReduce jobs by providing Distributed File System for Map and Reduce Operations.
HDFS is now used with Many Big-Data Tools, e.g. in Several Apache Projects built on top of Hadoop, incl, Pig, Hive, HBase, Giraph etc.. Also GraphLab.
Advantages of HDFS?
Disadvantages of HDFS?

Summary

Scalable distributed file system for large distributed data intensive applications.
Uses commodity hardware to reduce infrastructure costs.
POSIX-like(but not compatible) APIs for file operations.
Random writes are not possible. Append-Only.
Doesn’t support multiple concurrent writers to append to the same chunk like GFS.
Single NameNode and Multiple DataNodes in initial architecture.
Files are broken into 128 MB Blocks identified by 64-bit Globally unique block ID.
Blocks are replicated to multiple machines(default 3,Configurable) to provide redundancy. 200% overhead on replication. Can be reduced to 50% by using Erasure Coding.
DataNodes stores blocks on local disks as Linux Files.
NameNode is coordinator for HDFS Cluster. Keeps track of all filesystem metadata.
NameNode keeps all metadata in memory(for faster access). For Fault Tolerance(Node Crash), in-memory metadata changes are written to a Write Ahead Log(EditLog). For Disk Crash Tolerance, Edit Log can be replicated to a Remote File System(NFS) or QJM(Quorum Journal Manager) V2, or secondary NameNode(V1).
NameNode doesn’t keep records of block replica locations on DataNodes. NameNode Heartbeats and Collects states from DataNodes and asks on which block replicas it holds at NameNode Startup or when DataNode joins the cluster.
FsImage: NameNode checkpoints the EditLog into FsImage and serialized to disk and replicated to other nodes, so in case of fail-over or NameNode start, it can quickly use the Checkpoint and subsequent EditLog to build the state again.
User applications Interact with HDFS using HDFS client, which interacts with NameNode for metadata, and directly talks to DataNode for read/write operations.
DataNode and Clients use Checksums to validate data integrity of Blocks. Informs the NameNode to repair the replica if corrupted.
Lazy Collection: Deleted file is renamed to hiddle name to be GC’ed later.
HDFS is a strongly Consistent FS. Write is declared successful only if it is replicated to all the replicas.
Cache: For Frequently accessed files, user specified file paths/blocks to the NameNode server, can be explicitly cached in DataNode’s memory in an Off-heap block cache.(GFS just uses Linux’s Buffer Cache).

System Design Patterns

Write Ahead Log - Fault Tolerance/Reliability.
HeartBeat
Split Brain
CheckSum - Data Integrity.

Reference

HDFS Paper
HDFS High Availability
HDFS Architecture
Distributed File Systems:A Survey

Paper Link: https://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf

Last updated: March 15, 2026

Questions or discussion? Email me