Kafka

Wed, 04 Dec 2024 00:00:00 +0000

Paper: Kafka

Kafka/Distributed Messaging System

Goal

Design a distributed messaging system that can reliably transfer a high throughput of messages between different entities.

Background

One common challenge in distributed systems is handling continuous influx of data from multiple sources.
E.g. Imagine a log aggregation service that can receive hundreds of log entries per second from different sources. Function of this log aggregation service is to store these logs on a disk at a shared server and build an index on top of these logs so that they can be searched later.
Challenges of this service?
Distributed Messaging Systems(or Asynchronous processing paradigm) can help.

What is a messaging System?

System responsible for transferring data amongst various disparate systems like apps, services, processes, servers etc, w/o introducing additional coupling b/w producers and consumers, and by providing asynchronous way of communicating b/w sender and receiver.
Two types of Messaging Systems

Queue

A Particular message can be consumed by one consumer only.
Once a message is consumed, it’s removed from the queue.
Limits the system as the same messages can’t be read by the multiple consumers.

In the Pub-Sub model, messages are written into Partitions/Topics.
Producers write the messages to topics that get persisted in the messaging system.
Subscribers subscribe to those topics to receive each message that was published.
Pub-Sub model allows multiple consumers to read the same message.
Messaging system that stores and handles messages is called a Broker.
Provides a loose coupling b/w producers and consumers so they don’t need to be synchronized. They can read and write messages at different rates.
Also provides fault-tolerance. Messages don’t get lost.
A messaging system can be deployed for various reasons:

Kafka

Agenda

What is Kafka
Background
Kafka Use Cases

What is Kafka?

Open source pub-sub messaging system
Can work as a message-queue as well.
Distributed, Fault tolerant, highly scalable by design.
Fundamentally a system that takes streams of messages from producers, store reliably on a central cluster(with a set of brokers), and allows those messages to be delivered to consumers.

Background

Created at LinkedIn in 2010 to track Page Views(events), Messages from Messaging Systems, and Logs from Various services.
Kafka is also known as a Distributed Commit log or Write Ahead Log or a Transaction Log.
Commit Log is an append-only data structure that can persistently store a sequence of records.
Records are always appended to the end of the log, and once added, records cannot be deleted or modified. Reading from a commit log always happens from left to right (or old to new).
Stores all messages on disk and reads and writes take advantage of sequential disk reads/writes.

Kafka Use Cases

Can be used for collecting huge amounts(Big Data) events and do real-time stream processing of those events.
Metrics: Can collect and aggregate monitoring data. Different services can write their metrics which can be later pulled from Kafka to produce aggregate statistics.
Log Aggregation: Collect logs from various sources, and make them available in standard format to multiple consumers.
Stream Processing: Cases where data undergoes transformation after reading. E.g. Raw data consumed from the topic is transformed, enriched, aggregated, and pushed to a new topic for further consumption. Sort of creating a derived view of data from the source of record data.
Commit Log: Can be used as an external commit log for distributed systems which can keep track of their states.
Website Activity Tracking: One of the original use cases was to build a User-Activity tracking pipeline. Like Page clicks searches, are published to separate topics. These topics are made available for later processing like Loading data into Hadoop(for Batch Processing), Data Warehousing systems for Analytics or reporting. Can also be fed into Product Suggestion or Recommendations systems which can power, Similar Products that you may like, or people have bought etc.

High Level Architecture

Agenda

Kafka Common Terms
High-Level Architecture

Kafka Common Terms

Brokers
Records
Topics
Producers
Consumers
In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers.

High Level Architecture

Kafka Cluster

Kafka is run as a cluster of one or more servers, where each server is responsible for running one Kafka broker.

ZooKeeper

ZooKeeper is a highly read optimized distributed key-value store and is used for coordination and storing configurations.
In Original Version of Kafka, Kafka had used Zookeeper to coordinate between Kafka brokers; ZooKeeper maintains metadata information about the Kafka cluster

Kafka Deep Dive

Related Notes: Alex XU II

Agenda

Topic Partitions
High-Water Mark Kafka is simply a collection of topics. As topics can get quite big, they are split into partitions of a smaller size for better performance and scalability.

Topic Partitions

Kafka Topics are partitioned, and these partitions are placed on separate nodes/brokers.
When a new message is published to a topic, it gets appended to one of the topic’s partitions, usually decided by using the Customer specified Partition Key.
A partition is an ordered sequence of messages.
Kafka Guarantees FIFO Ordering between messages of a single partition. No ordering guarantees across partitions or at a topic level.
A Unique Sequence ID called a Partition Offset gets assigned to every message added to a partition. Used to identify a message’s sequential position within a partition.
Offset sequences are unique to a single partition. Messages are uniquely located using (Topic, Partition, Offset).
Producers can choose to publish messages to any partition. If Ordering within a partition is not needed, a Round-Robin strategy can be used for evenly partitioning data across nodes.
Placing partitions on separate brokers allows for multiple consumers to read from a topic in a parallel, i.e. Different consumers can concurrently read different partitions on separate brokers. However, for multiple consumers within the same Consumer Group, only 1 consumer from a consumer Group can read the data from a Partition at any time.
Messages once written to a partition are immutable(Append Only Log).
Producer specifies a Partition Key, to any message that it publishes so that data is written to the same partition.
Each broker can manage a set of partitions from across various topics.
Follows the principle of Dumb Broker and Smart Consumer.
Kafka doesn’t keep a record of what records are read by the consumer. Consumers poll kafka for new messages and specify which records(specified by partition Offset) they want to read from the topic.
Consumers are allowed to increment/decrement the offset to replay and reprocess the messages.
Each Topic partition has one leader broker and multiple replica(followers) brokers.

Leaders and Followers

A leader is the node responsible for all reads and writes for the given partition. Every partition has one Kafka broker acting as a leader.
To handle Single Point of Failure and to enable Fault Tolerance, Kafka replicates partitions and distributes them across multiple brokers.
Each follower’s responsibility is to replicate the leader’s data to serve as a backup partition.
A follower can take over the leadership if the leader of a partition goes down.
Kafka stores the location of the leader of each partition in ZooKeeper
As all writes/reads happen at/from the leader, producers and consumers directly talk to ZooKeeper to find a partition leader.

In Sync Replicas(ISR)

An in-sync replica (ISR) is a broker that has the latest data for a given partition.
A follower is an in-sync replica only if it has fully caught up to the partition it is following.
Only ISRs are eligible to become partition leaders.
Kafka can choose the minimum number of ISRs required before the data becomes available for consumers to read.

High Water Mark

To ensure data consistency, the leader broker never returns (or exposes) messages which have not been replicated to a minimum set of ISRs.
Broker uses High Water Mark which is the highest offset that all ISRs of a particular partition share.
The leader exposes data only up to the high-water mark offset and propagates the high-water mark offset to all followers.
This avoids the case of a Non-Repeatable read in case the Leader crashes before Replicas get the latest messages.

Consumer Groups

Agenda

What is a Consumer Group?
Distributing Partitions to a consumer within Consumer Groups.

What is a Consumer Group?

A consumer group is basically a set of one or more consumers working together in parallel to consume messages from topic partitions.
No two consumers within the same Consumer group can attach to the same partition at a time. Thus no two consumers within CG receive the same message.

Distributing Partitions to a consumer within Consumer Groups.

Kafka ensures that only a single consumer reads messages from any partition within a consumer group
Topic partitions are a unit of parallelism
If a consumer stops, Kafka spreads partitions across the remaining consumers in the same consumer group
Every time a consumer is added to or removed from a group, the consumption is rebalanced within the group.
Parallelizing processing across multiple partitions of a topic, helps support very high Throughput.
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
Kafka uses any unused consumers as failovers when there are more consumers than partitions. Extra Consumers are idle in the meantime.
Rebalancing happens as Consumers are added and removed from the ConsumerGroups.

Kafka Workflow

Agenda

Kafka Workflow as Pub-Sub messaging
Kafka Workflow for Consumer Group
Kafka provides both pub-sub and queue-based messaging systems in a fast, reliable, persisted, fault-tolerance, and zero downtime manner.
In both cases, producers simply send the message to a topic, and consumers can choose any one type of messaging system depending on their need

Kafka Workflow as Pub-Sub Messaging

Producer publishes a message to a topic.
Broker stores messages in the partitions configured for that topic. If no partition keys were specified, Broker spreads the messages evenly across partitions.
Consumer subscribes to a specific Topic. Broker provides the current offset of that Topic back to Consumer and saves that Offset to ZooKeeper.
Consumers will request Brokers at regular intervals for new messages and process it once kafka sends those messages.
Once the consumer processes the message, it sends an acknowledgement back to the broker. Broker Updates the processed offsets in the ZooKeeper.
Consumers can rewind/skip to the desired offset and read subsequent messages.

Role of Zookeeper

Agenda

What is ZooKeeper?
ZooKeeper as Central Coordinator.

What is ZooKeeper?

Distributed configuration and synchronization service.
Serves as the coordination interface between the Kafka brokers, producers, and consumers.
Kafka stores basic metadata in ZooKeeper, such as information about brokers, topics, partitions, partition leader/followers, consumer offsets.

ZooKeeper as the central coordinator(Might be Stale info)

Kafka brokers are stateless; they rely on ZooKeeper to maintain and coordinate brokers, such as notifying consumers and producers of the arrival of a new broker or failure of an existing broker, as well as routing all requests to partition leaders.
Stores all sorts of Metadata about the Kafka Cluster

How do producers or consumers find out who the leader of a partition is?

In the older versions of Kafka, all clients (i.e., producers and consumers) used to directly talk to ZooKeeper to find the partition leader.
Kafka has moved away from this coupling, and in Kafka’s latest releases, clients fetch metadata information from Kafka brokers directly;
All the critical information is stored in the ZooKeeper and ZooKeeper replicates this data across its cluster, therefore, failure of Kafka broker (or ZooKeeper itself) does not affect the state of the Kafka cluster.
Zookeeper is also responsible for coordinating the partition leader election between the Kafka brokers in case of leader failure.

Controller Broker

Agenda

What is a Controller Broker?
Split Brain.
Generation Clock.

What is a Controller Broker?

Within the Kafka cluster, one broker is elected as the Controller.
Controller broker is responsible for admin operations, such as creating/deleting a topic, adding partitions, assigning leaders to partitions, monitoring broker failures by doing health checks on other brokers.
Communicates the result of the partition leader election to other brokers in the system.

Split Brain

When a controller node dies, kafka elects a new controller. One of the problems is that we cannot truly know if the leader has stopped for good(Crash Stop) or has experienced intermittent failures like Stop the World GC or process Pause, or a temporary network disruption.
Two split-brain controllers would be giving out conflicting commands in parallel. If something like this happens in a cluster, it can result in major inconsistencies. How do we handle this?

Generation Clock?

Split-brain is commonly solved with a generation clock, which is simply a monotonically increasing number to indicate a server’s generation.
In Kafka, the generation clock is implemented through an epoch number, Old leader = epoch 1, and new leader = epoch 2.
This epoch is included in every request that is sent from the Controller to other brokers.
Brokers can now easily differentiate the real Controller by simply trusting the Controller with the highest number.
This epoch number is stored in ZooKeeper.

Kafka Delivery Semantics?

Agenda

Producer Delivery Semantics
Consumer Delivery Semantics

Producer Delivery Semantics

A producer writes only to the leader broker, and the followers asynchronously replicate the data.
How can a producer know that the data is successfully stored at the leader or that the followers are keeping up with the leader?
Kafka offers three options to denote the number of brokers that must receive the record before the producer considers the write as successful:

Consumer Delivery Semantics

A consumer can read only those messages that have been written to a set of in-sync replicas(High Water Mark).
There are three ways of providing consistency to the consumer:

Kafka Characteristics

Agenda

Storing messages to disks
Record Retention in Kafka
Client Quota
Kafka Performance

Storing messages to disks

Kafka writes its messages to the local disk and does not keep anything in RAM. Disk storage is important for durability so that the messages will not disappear if the system dies and restarts.
Even though disk access is generally considered to be slow, there is a huge performance difference b/w Random Block Access and Sequential Access.
Random block access is slower because of numerous disk seeks, whereas the sequential nature of writing or reading, enables disk operations to be thousands of times faster than random access.
Because all writes and reads happen sequentially, Kafka has a very high throughput.
Writing or reading sequentially from disks are heavily optimized by the OS, via read-ahead (prefetch large block multiples) and write-behind (group small logical writes into big physical writes) techniques.
Also, modern operating systems cache the disk in free RAM. This is called Pagecache.
Since Kafka stores messages in a standardized binary format unmodified throughout the whole flow (producer → broker → consumer), it can make use of the zero-copy optimization.
Kafka has a protocol that groups messages together. This allows network requests to group messages together and reduces network overhead.

Record Retention in Kafka

By default, Kafka retains records until it runs out of disk space. We can set time-based limits (configurable retention period), size-based limits (configurable based on size), or compaction (keeps the latest version of record using the key).
For example, we can set a retention policy of three days, or two weeks, or a month, etc.
The records in the topic are available for consumption until discarded by time, size, or compaction.

Client Quota

Heavy Hitters(Noisy Neighbours) can exhaust broker resources, or can cause network saturation to multi-tenant kafka clusters, which can deny service to other clients and broker themselves.
In Kafka, quotas are byte-rate thresholds defined per client-ID(application).
The broker does not return an error when a client exceeds its quota but instead attempts to slow the client down by holding the client’s response for enough time to keep the client under the quota.
This also prevents clients from having to implement special back-off and retry behavior.

Kafka Performance

Scalability
Fault Tolerance and Reliability
Throughput
Low Latency?

System Design Pattern:

High Water Mark - To deal with Non-Repeatable reads and data consistency.
Leader and Follower - Leader serves read/writes. Followers do replication.
Split-Brain - Multiple Controller nodes active at a time(due to Zombie Controller). Generational Epoch number to resolve.
Segmented Log - Log segmentation to implement storage for its partitions. References:
Confluent Docs
NYTimes usecase
Kafka Summit 2019
Kafka Acks explained(TODO)
Kafka as distributed log
Minimizing Kafka Latency(TODO)
Kafka Internal Storage(TODO)
Exactly once semantics(TODO)
Split Brain(TODO) Open Questions:
Kafka stores the current offset per consumer group per topic per partition? What? Initially we said Kafka is DUMB and that Consumer tracks the offset? [Research]
In At-most-once consumer delivery semantics, Why can’t the consumer read from the previous offset? Why are messages said to be lost?[Research]
Exactly once semantics? How would transactions happen across 2 systems(consumer processing + Kafka Offset Commit). How are they suggesting the transaction would be rolled back?
Zero Copy Optimization
Page Cache optimization Kafka
How does replication internal work b/w leader follower?
Tombstoning in Kafka.

1️⃣ Zero Copy Optimizations & Page Cache in Kafka and Other Systems

📌 What is Zero Copy?

Zero Copy is a kernel-level optimization that allows data to be transferred between disk and network without passing through user-space memory, reducing CPU overhead and increasing throughput.

🚀 Why is Zero Copy important? ✔ Reduces CPU usage (since data isn’t copied multiple times).

✔ Minimizes context switches (between user and kernel space).

✔ Improves I/O throughput (as memory copying is avoided).

📌 How Kafka Uses Zero Copy (Sendfile Optimization)

Kafka uses Zero Copy via the sendfile system call in Linux.

🔹 Without Zero Copy (Traditional Path)

Kafka reads a log file from disk → (Disk → Kernel Space).
The kernel copies data to Kafka’s user-space buffer → (Kernel Space → User Space).
Kafka writes the buffer to a network socket → (User Space → Kernel Space → Network).
The kernel sends data over the network. 🔹 With Zero Copy (Optimized Path)
Kafka calls sendfile() → Kernel directly transfers a log file to the network socket.
No user-space buffer required → Data goes directly from disk to network. ✔ Avoids unnecessary copies in user-space.✔ Greatly improves throughput (Kafka can achieve millions of messages per second).

📌 Zero Copy Optimizations in Other Systems

[Table content - requires manual formatting]

📌 What is Page Cache and How Kafka Optimizes It?

Kafka doesn’t need a traditional database cache. Instead, it relies on the OS page cache for fast reads.

✔ Page Cache: The Linux kernel automatically caches recently read disk pages in memory.

✔ Kafka uses the Page Cache to serve reads directly from memory without hitting disk.

🔹 How Page Cache Works in Kafka:

When a consumer reads a message, Kafka first checks the OS page cache.
If the data is cached, it is served directly from memory (zero disk I/O).
If the data isn’t in cache, Kafka reads it from disk, and the OS automatically caches it. 🚀 Optimizations in Kafka: ✔ Uses sendfile() to directly transfer from Page Cache to network.✔ Leverages sequential disk access (append-only logs) for high read efficiency.✔ Minimizes JVM heap memory usage by relying on OS caching.

2️⃣ How Replication Works Between Leaders and Followers in Kafka

Kafka ensures fault tolerance and high availability using replication.

📌 Basics of Kafka Replication

✔ Each Kafka topic is partitioned, and each partition has:

One Leader (handles all reads & writes).
One or more Followers (replicas of the leader’s data).

📌 Steps in Kafka Replication

1️⃣ Producer writes data to the Leader Partition.

2️⃣ Leader appends data to its local log segment.

3️⃣ Followers fetch new data from the leader.

4️⃣ Followers append data to their own log segment.

5️⃣ **Followers send an acknowledgment (ACK) once they persist the data.**6️⃣ If a majority of followers acknowledge, Kafka considers the message committed.

📌 Leader and Follower Sync Mechanism

✔ Kafka uses a pull-based replication model → Followers poll the leader to fetch new data.

✔ Offset tracking: Followers maintain an offset to track the latest committed message.

✔ ISR (In-Sync Replicas): Only replicas in sync with the leader are part of the ISR.

📌 How a New Leader is Elected?

✔ If the Leader fails, one of the ISR replicas is promoted.

✔ The new Leader starts serving read and write requests.

✔ If no ISR exists, the partition becomes temporarily unavailable until a new Leader is available.

📌 Replication Strategies

[Table content - requires manual formatting]

🚀 Tuning Replication Settings for Performance ✔ min.insync.replicas = 2 → Ensures durability (at least two replicas must ACK).

✔ unclean.leader.election = false → Prevents unsafe leader elections (data loss risk).

✔ replica.lag.time.max.ms = 10,000 → Defines when a slow follower is removed from ISR.

3️⃣ Tombstoning in Kafka

Kafka Tombstoning is used for deleting records in log-compacted topics.

📌 Why is Tombstoning Needed?

✔ Kafka never deletes data immediately.

✔ Instead, Kafka marks the record as deleted (tombstone message).

✔ The actual data is removed later during log compaction.

📌 How Tombstoning Works

Producer sends a null value for a key (marks it as deleted).
Kafka appends this tombstone message to the log.
The consumer sees the tombstone event and removes the record from its own storage.
Kafka’s log compaction eventually purges the tombstone message and the original record.

📌 Example of Tombstone Message

{

“key”: “user_123”,

“value”: null,

“timestamp”: 1700000000

}

✔ This soft deletes “user_123”.

✔ Log compaction later removes both the original record and the tombstone.

📌 How Log Compaction Works

Log compaction keeps only the latest value for each key.
Tombstones stay in the log until Kafka compacts the segment.
Kafka guarantees at least one copy of the latest record is retained (even after compaction). 🚀 Tuning Log Compaction ✔ log.cleanup.policy = compact → Enables log compaction.

✔ delete.retention.ms = 86400000 → Keeps tombstones for 24 hours before purging.

✔ log.segment.bytes → Defines when segments are compacted.

🔹 Summary & Key Takeaways

Zero Copy & Page Cache

✔ Kafka uses sendfile() for Zero Copy, avoiding unnecessary memory copies.✔ Page Cache stores recent messages, reducing disk I/O.

Replication Between Leaders and Followers

✔ Kafka uses asynchronous, pull-based replication for performance.✔ ISR (In-Sync Replicas) ensures durability.✔ New leaders are elected from ISR in case of failure.

Tombstoning & Log Compaction

✔ Kafka uses tombstones (null values) for soft deletes.✔ Log compaction removes older records but keeps the latest one per key.

Paper Link: https://notes.stephenholiday.com/Kafka.pdf

Last updated: March 15, 2026

Questions or discussion? Email me

Append-Only-Log on Hemant Sethi

Kafka

Kafka/Distributed Messaging System

Goal

Background

What is a messaging System?

Queue

Publish-Subscribe Messaging System

Kafka

Agenda

What is Kafka?

Background

Kafka Use Cases

High Level Architecture

Agenda

Kafka Common Terms

High Level Architecture

Kafka Cluster

ZooKeeper

Kafka Deep Dive

Agenda

Topic Partitions

Leaders and Followers

In Sync Replicas(ISR)

High Water Mark

Consumer Groups

Agenda

What is a Consumer Group?

Distributing Partitions to a consumer within Consumer Groups.

Kafka Workflow

Agenda

Kafka Workflow as Pub-Sub Messaging

Role of Zookeeper

Agenda

What is ZooKeeper?

ZooKeeper as the central coordinator(Might be Stale info)

How do producers or consumers find out who the leader of a partition is?

Controller Broker

Agenda

What is a Controller Broker?

Split Brain

Generation Clock?

Kafka Delivery Semantics?

Agenda

Producer Delivery Semantics

Consumer Delivery Semantics

Kafka Characteristics

Agenda

Storing messages to disks

Record Retention in Kafka

Client Quota

Kafka Performance

System Design Pattern:

1️⃣ Zero Copy Optimizations & Page Cache in Kafka and Other Systems

📌 What is Zero Copy?

📌 How Kafka Uses Zero Copy (Sendfile Optimization)

📌 Zero Copy Optimizations in Other Systems

📌 What is Page Cache and How Kafka Optimizes It?

2️⃣ How Replication Works Between Leaders and Followers in Kafka

📌 Basics of Kafka Replication

📌 Steps in Kafka Replication

📌 Leader and Follower Sync Mechanism

📌 How a New Leader is Elected?

📌 Replication Strategies

3️⃣ Tombstoning in Kafka

📌 Why is Tombstoning Needed?

📌 How Tombstoning Works

📌 Example of Tombstone Message

📌 How Log Compaction Works

🔹 Summary & Key Takeaways

Zero Copy & Page Cache

Replication Between Leaders and Followers

Tombstoning & Log Compaction