Imagine this scenario, you are working on a Kafka cluster, everything is working perfectly, but suddenly it all goes wrong. All the graphs, charts, and dashboards just go blank. You are totally in the dark and so are your customers. Even if this scenario lasts for a short while, you will still risk the wrath of frustrated customers who’ve suffered data loss. For your organization, it would prove to be a terrible disaster, losing valuable clients.
So, it is safe to say, no business wants to face a ‘Kafka Apocalypse’. So, how do we avoid it? Well, to avoid it we had to understand everything about it.
In our last article, we learned about some improvement tips for tuning Kafka. Let’s explore how to manage the Kafka Cluster and what we can do if we face the Kafka Apocalypse.
Here’s a Rundown of How Kafka Works
Kafka’s retention policy governs how it retains messages. Decide how much data or how long data should be retained. After which, Kafka clears out the messages in-order, regardless of whether the message has been consumed or not. The next proceedings jump where partitions are split into segments. So, firstly, Kafka needs to find the messages regularly that need purging on disk. This operation is slow and error-prone due to a single very long file of a partition’s messages. To fix that the partition is split into segments.
Kafka writes to a partition and the segment at the same time i.e. active segment. After reaching the segment’s size limit, a new segment is opened that becomes the new active segment. Segments are named by their base offset. The base offset of a segment which is greater than offsets in previous segments and less than or equal to offsets in that segment.
Only one broker is the leader for each replicated partition. It’s the leader of a partition that producers and consumers interact with.
Monitoring Kafka While Maintaining Sanity
Three very important things to monitor and alert on for Kafka clusters:
Retention: How much data could be stored on disk for each topic partition?
Replication: How many copies of the data could be made?
Consumer Lag: How to monitor how far behind our consumer applications are from the producers?
Here Are Some Tips Are given for Managing Kafka Cluster.
Configure both space and time retention settings
Though, while having a cap on space protects you from overwhelming the server, an increase in volume can still make it so you hold less data than you think.
To protect against this, There are some written code that monitors data rates of our partitions, the size on disk, and the current topic configurations and sends that data to the Insights. Kafka data retention in Insights query shows the ratio of actual bytes on disk compared to the configured maximum bytes for the topic, by topic partition. If the ratio gets above 100, data can be deleted by size instead of by time, which is how to handle retention.
Keep your replicated data in separate failure domains
Kafka’s throughput is limited by how fast you can write to your disks, so it’s important to monitor disk I/O and network traffic-and the two are usually correlated. You need to make sure that the number of topic partitions is evenly distributed among the brokers, but since all topics aren’t created equal, in terms of traffic, you’ll want to take that into account as well.
Balancing leadership refers to making sure the number of partitions that a broker is leading at any one time isn’t totally lopsided because leaders do more work.
Make sure your lag monitoring still works, especially if all consumers stop committing
These measurements don’t come out-of-the-box with Kafka.
How to avoid your own Kafka Apocalypse
Here are some tips to overcome Kafka Apocalypse.
- Ensure that you monitor both space and time retention.
- Keep as little transient data around but have a quick way to increase retention.
- Balance your cluster by I/O and leadership to accounting for broker failures.
- Use multi-tiered replication alerts to detect problems and use dashboards and queries to investigate them.
So, these are some tips to manage Kafka Cluster and how to act on Kafka Apocalypse. The final part of improvement tips in Kafka is about going down in the details of the role of an Offset.
About the Author
Lakshman Dhullipalla has 14 years of information technology experience covering all phases of the systems and software implementation. Over the period of last 13 years, he worked as Solution Architect, Platform Administrator for Hadoop clusters, Kafka, Informatica & Talend ETL tools, Oracle databases, and ETL consultant in Business Technology Solutions, Enterprise Application Integration, Data Migration and Data Integration domains. Apart from this long-standing experience, Lakshman loves reading technical books, listening to music, farming, and watching movies.