5 signs you've outgrown Apache Kafka

Struggling with low throughput, high latencies, and data loss? It might be time for an upgrade

By
on
May 3, 2023

Apache Kafka® reshaped the world of event streaming back in the day, but many organizations are now focusing heavily on mobile, AI, and edge applications that can process trillions of events daily. Kafka simply can’t keep up with today’s ultra-intensive data requirements, leading organizations to seek Kafka alternatives for high-performance, high-throughput, low-latency event streaming at a much lower cost.

In this post, we’ll walk through five warning signs that your organization has outgrown Kafka, and introduce you to Redpanda—a streaming data platform for developers built from the ground up with a native Kafka API and engineered to eliminate complexity, maximize performance, and reduce total costs.

With that, let’s get into the five signs that you’ve outgrown Apache Kafka.

#1: Your data is unbalanced in your cluster

When producing your data to Kafka, the producer client chooses which partitions to produce to, so if your cluster has imbalanced leadership, you will get overloaded brokers. Skew across the cluster can affect consumption performance, impose extra load on resources, and cause disk capacity issues—leading to broker failure.

The common approach to redistribute partitions in open-source Kafka is to either use the kafka-reassign-partitions tool or leverage and manage an additional platform called Cruise Control. None of these are fun, and each has its own drawbacks.

Kafka-reassign-partitions is a two-step process:

  • Identify the movement plan and externalize it in a configuration file that can (and should) be analyzed.
  • Execute the partition reassignment. During this process, there can be performance issues, especially if the throughput and skew are significant.

Cruise Control is a complex platform that does balancing continuously. Heavy Kafka shops rely on Cruise Control and have a lot of success with it. The downside is that the compendium of Cruise Control configurations can be daunting and usually requires much analysis at the onset, as well as monitoring and tweaking to find just the right configs for your workloads over time.

There are “goals” that make some of these configurations easier, but you still need to determine which of the goals best suits your needs. Cruise Control also requires maintaining an additional platform (adding to the complexity of an already sprawling infrastructure), which some consider an absolute requirement due to the issues presented by the more manual approach of kafka-reassign-partitions.

#2: The risk of data loss is dangerously high

In the early days of Redpanda, we conducted a survey to understand the pain points of Kafka, and among respondents, approximately 5% of the respondents said data loss. This raised some eyebrows because Kafka has been long considered resilient and durable.

Many Kafka users don’t know that when they receive an acknowledgment from Kafka, it has only been committed to the page cache of the leader and then replicated to the page cache of the followers in the in-sync replica (ISR) set. This means that if there is a coordinated failure for any reason (regardless of multi-AZ replication with rack awareness), the data that has not been synced to disk can go missing.

Platform architects who know about this issue tend to request an acknowledgment after every message, assuming this will help their durability. However, the improvement is minimal. This is because the flush interval on the Kafka brokers determines how frequently the data in the page cache is flushed to disk, not the producer flush method, which simply controls sending the message and synchronously getting an ack for that message on the client side.

Despite the rare occurrence of the above scenario (often perceived as an “edge case”), any risk of data loss can limit the adoption of Kafka for application teams that have a hard requirement for durability. For example: payment processing, order management, and other operational or transaction-oriented use cases.

#3: Costs and complexity are ballooning

Here at Redpanda, we regularly encounter customers with Kafka implementations that have ballooned to hundreds of nodes and require several full-time employees to keep them running optimally.

For these platform teams, Kafka simply becomes too prohibitive in cloud or infrastructure spend and administration – unmanageable and unsustainable as demands for streaming data skyrocket.

In the most basic cluster, in addition to the sheer volume of nodes, you also have to manage Apache ZooKeeper™ (or a collection of quorum servers if you’re using KRaft), as well as the Kafka cluster itself. You then have two failure domains to reason about regarding high availability (HA), to diagnose and fix when there are problems.

If you’re using a Schema Registry or an HTTP Proxy, those additional components have their own HA patterns in their own contexts to diagnose and fix on failure. Add Cruise Control and monitoring extensions to the mix, and you have a proper “Kafkaesque” beast!

#4: Latency is too poor for real-time applications

Historically, analytics have made up the primary target use cases for Kafka, due to its core value proposition of being able to “fan in” from many different sources and “fan out” to many different targets in a single topic/partition. In the analytics space, experiencing periodic latency in seconds doesn’t cause much concern. But if you’re interested in application architecture patterns against Kafka, this can become problematic in terms of accumulated end-to-end latency.

Adding brokers and/or resources to existing brokers typically doesn’t make a significant impact on latency (hence why we see clusters in the wild consisting of hundreds of brokers). So, you’re left with an option of decreasing durability in favor of performance, using configurations like acks=1 or similar.

#5: Efforts to scale throughput keep falling short

Some Kafka platform owners would say adoption is a “good problem to have,” but that excitement quickly wanes when peak throughput starts pushing towards hundreds of megabytes per second. The common approach to address scale is to deploy additional clusters, which leads to additional complexity, longer work hours, poor latency, and unreliable clusters. You’re suddenly playing distributed systems whack-a-mole. Hit 1GB/s, and you’re deploying hundreds of brokers and suffering from migraine!  

Another challenge with high throughput is disk capacity. If you don’t want to configure extremely short retention times for your topics, you need to continue to add more nodes, deploy a cluster with much larger disks, or add more clusters.

Simplify deployments, reduce costs, and scale easily with Redpanda

If you found yourself nodding at any of the five signs above, then it’s time for an upgrade.

Redpanda has you covered. Its event-driven architecture taps into your hardware’s full potential to give you higher throughputs at lower latencies—without sacrificing your event streaming platform’s reliability or durability. Redpanda also offers a simple but powerful Kafka web console and built-in Prometheus endpoints for visibility into your data streams.

Some call us the “next generation Kafka” and for good reason. Here’s a quick run through how Redpanda levels up your event streaming experience:

  • Single binary: Redpanda is deployed as a self-contained, single binary. It’s JVM-free, ZooKeeper™-free, deploys in minutes, spins up in seconds, and runs efficiently wherever you develop — whether that’s on containers, laptops, x86 and ARM hardware, edge platforms, or cloud instances.
  • 10x faster tail latency: Written from scratch in C++, with a completely different internal architecture than Kafka, Redpanda is designed to keep latencies consistent and low. The Redpanda vs. Kafka performance benchmark proves that Redpanda performs at least 10 times faster than Kafka at tail latencies (p99.99).
  • Higher throughputs: Within the Redpanda Community, it’s common to see hundreds of megabytes per second workloads on just a few nodes. Redpanda’s Tiered Storage capability allows you to set lower retention times on your local volumes while offloading your log segments to object storage.
  • Durability: Redpanda is Jepsen-verified for data safety. It enlists the Raft consensus protocol to manage its replicated log. This gives you sound primitives for configuration and data replication and provides data safety at any scale—without sacrificing performance.
  • Intelligent rebalancing: Redpanda offers continuous data and leadership balancing capabilities to spare you from running onerous processes.

The bottom line is that Redpanda gives you a better offering for far less cost in terms of infrastructure, licensing, and person-hours to support. In fact, our Redpanda vs. Kafka report shows you can reduce your streaming data costs by up to 6x! Redpanda is also fully compatible with the Kafka API ecosystem, so it works natively with your Kafka streaming apps and tools without any code changes.

To see it in action, take Redpanda for a test drive! You can also join the Redpanda Community on Slack to talk directly with our engineers, or contact us to get started on your move to a faster, simpler streaming data platform.

Graphic for downloading streaming data report
“Always-on” production memory profiling in just 5 instructions
Stephan Dollberg
&
&
&
August 27, 2024
Text Link
Data plane atomicity and the vision of a simpler cloud
Alexander Gallego
&
Camilo Aguilar
&
&
August 21, 2024
Text Link
Write caching: drive your workloads up to 90% faster
Matt Schumpert
&
Brandon Allard
&
Bharath Vissapragada
&
Nicolae Vartolomei
July 16, 2024
Text Link