8 tips to optimize your Amazon MSK spend
How to balance performance, resources, and operational costs more efficiently on MSK — or simply switch to Redpanda BYOC
It’s a running joke in the tech world that optimizing your Amazon Web Service (AWS) spend requires professional help. Matched with the inherent complexity of Apache Kafka®, you need to be both an AWS architect and a Kafka expert to optimize your AWS Managed Streaming for Kafka (MSK) deployments to minimize cost while maximizing performance.
At Redpanda, we’re obsessively building a drop-in replacement for Kafka to simplify and optimize streaming data for everyone — from enterprise data engineers to the casual developer.
In this post, we’ll walk you through eight practical tips to optimize your MSK deployment and help you take control of your MSK bill. We’ll divide this handy advice into three categories:
- Operational costs – keeping the lights on
- Performance – balancing cost with performance
- Resource utilization – balancing cost with business needs
Optimizing your MSK operational costs
The best place to start is a holistic approach to cost optimization by looking at your team, your relationship with AWS, and your internal IT practices.
- Know the market rates for Kafka expertise
Since MSK doesn’t come with access to Kafka engineers and experts, consider in-house Kafka expertise or at least contracting Kafka expertise to help perform upgrades, performance testing, and interpret observed operational metrics.
Currently, the average Kafka expert can run you $140k/year. If you need part-time help, you can contact your AWS account manager for an introduction to the various AWS Service Delivery partners. Expect around $200/hr as a starting point for part-time Kafka expertise, so budget accordingly.
Or...the simplest and most cost-efficient approach is to switch to Redpanda BYOC — a fully managed cloud service that includes 24/7 support so you can leave the heavy lifting to the experts. All the benefits of fully managed cloud — with none of the tedious ops and maintenance.
Not convinced yet? Okay, keep reading.
- Negotiate a volume discount for your AWS bill
AWS’s EC2 MSK instances are not eligible for sustained usage discounts like generic EC2 compute allows. Typically, customers save 30-40% when deploying EC2 capacity with one or three-year commitments (regardless of payment terms). Sadly, this isn’t provided for MSK instances.
From speaking with customers, one option we’ve seen is to negotiate account-wide discounts for committed spend. If you’re a large enough organization, your CIO (or equivalent business owner) might be able to push for a generic discount to help reduce MSK costs across the board. Otherwise, expect to pay the list price.
Or...you can deploy Redpanda BYOC in your AWS environment and take advantage of those 30-40% sustained usage discounts as Redpanda deploys generic im4gn instance types — in addition to whatever you might negotiate company-wide! Even if you decide to use Redpanda Dedicated instances, you can still burn down your AWS commit with Redpanda Cloud.
Warming up to BYOC now? There’s more.
- Have a monitoring plan (and use what you monitor)
Assuming you’ve taken the previous advice to bring in Kafka expertise, you’ll want them to create an observability strategy for the MSK logs and have operational metrics sent to AWS CloudWatch.
Keep in mind that the dashboards and data you collect and send to CloudWatch warrant their own cost estimation. Avoid collecting metrics you have no plans for alerting or incorporating into your operational practices, as they add additional overhead costs.
Heads up: only the DEFAULT level MSK metrics are free. Per-broker and per-topic monitoring will contribute to your CloudWatch bill. Plus, don’t forget you’ll need to monitor Apache ZooKeeper™ nodes.
Or...deploy Redpanda BYOC and let our SRE and Support team monitor critical infrastructure metrics, like storage utilization and availability. You can still do capacity planning with our provided Prometheus endpoint using your preferred tools — without worrying about carrying a pager or monitoring ZooKeeper. This is how we can provide a 99.99% uptime SLA.
[CTA_MODULE]
Note: At the time of writing, Amazon MSK doesn’t support in-broker KRaft and requires deploying ZooKeeper nodes for cluster operations.
Optimizing your MSK performance
Right-sizing and tuning your MSK cluster helps you avoid falling into the common trap of paying for over-provisioned hardware. Here are a few tips to set you up for success.
- Follow best practices for right-sizing a new cluster
One of AWS’s Principal Streaming Architects wrote a comprehensive guide on best practices for sizing MSK. It’s 20 pages long, but it does a great job of objectively showing how to arrive at sizing, given all the factors in an MSK cluster. (A must-read for your previously mentioned Kafka engineer!)
Or...chat with a Redpanda Solutions Engineer. We’re here to help you minimize your total cost of ownership (TCO) while providing a seamless drop-in replacement for Kafka. No code to change, just replace and run as normal.
- Tune your existing deployment
Give yourself more headroom with your existing deployment and avoid the need to scale to keep future costs under control. Chances are there are numerous areas where you can extend the life of your existing footprint. Just make sure you optimize broker threads in your cluster if you’ve changed instance types, and tune things like log recovery to get better startup time in the event of unclean shutdowns. The “MSK best practices” docs page is a must-read.
Or...deploy Redpanda BYOC and take advantage of pre-tuned configurations, managed by the Redpanda SRE team, and don’t lose sleep wondering what you’re “leaving on the table” in terms of performance. Simple.
- Scale with intent
Auto-scaling Kafka-based systems help address increases in demand, but scaling a cluster back down comes with caveats (assuming you’re not using a Serverless offering). If you let MSK auto-expand as demand increases, it can’t scale back down. Specifically, EBS volumes can’t be shrunk once grown, so make sure you grow horizontally first!
And, don’t forget to run a Cruise Control instance to help automate partition reassignment when you grow your MSK cluster. Otherwise, you’ll need to remember to manually reassign partitions around the topology, or your new brokers won’t help shoulder the load.
Or...let Redpanda Cloud handle everything with no need for Cruise Control! Not only does Redpanda have a Serverless offering comparable to MSK’s, but we can also handle scaling your cluster up and down for you in a predictable manner, relying primarily on cloud-native object storage to keep a handle on TCO.
[CTA_MODULE]
Optimizing utilization for business requirements
The classic conundrum is bridging the gap between business requirements and IT capabilities. It’s easy for one to drive the other, but it’s not always ideal. With Kafka, you typically need to identify two primary concerns: data retention (storage) and data rates (throughput).
- Work with the business to identify data retention vs. value
It’s common for storage to dominate Kafka deployment costs as either producer throughputs increase or the business asks for longer data retention periods.
MSK adopted a version of Tiered Storage for Kafka, allowing for long-term data retention without relying on EBS volumes. This provides cost savings vs. vertically scaling your broker storage volumes but comes with a caveat. MSK charges a premium for the backing EC2 instances, they also charge a markup for using Amazon S3 as Tiered Storage. In fact, MSK bills at 3x the list price!
Or…you can use Redpanda’s Tiered Storage! Not only does it use Amazon S3, but it consumes it at standard billing rates at one-third the cost of MSK’s. It supports whole cluster restore in the event of a disaster and even lets you elastically scale consumer workloads with Remote Read Replicas.
[CTA_MODULE]
- Model your end-to-end data throughputs
Kafka’s design decouples producers from consumers, allowing for highly scalable data architectures. While decoupled, it’s important to understand the full data lifecycle end to end, from production to consumption, as modern clouds charge for network traffic. Luckily, MSK does not charge for data replication between brokers, but this does not extend to the producer and consumer interactions with the cluster!
If you deploy MSK with multiple availability zones or multiple regions, you need to understand the implications of cross-zone or cross-region networking on the clients. This can be difficult to model if you don’t control the application design. However, Kafka provides a rack-awareness concept to allow clients to prioritize their “local brokers.” AWS helpfully provides a lengthy guide on using it with MSK.
Or…you guessed it: just use Redpanda BYOC! Redpanda natively supports follower fetching and, while we can’t offer you free intra-cluster networking, the cost savings across the board dwarf what you might spend in cross-AZ traffic.
If you don’t believe us, read how we saved India’s largest social media company 70% on cloud spend thanks to follower fetching, Tiered Storage, and a stellar support team!
The simplest way to optimize MSK? Switching to Redpanda BYOC
There’s a lot to learn when it comes to optimizing your MSK deployment, so hopefully these tips help you start bringing your MSK spend under control. On the other hand, if you’re ready to start saving on both costs and complexity with a simpler, more affordable streaming data architecture, check out our blog on how to migrate from MSK to Redpanda.
If you feel overwhelmed just at the thought of migrating, you can always offload the heavy lifting to trusted experts. Just sign up for a free trial of Redpanda BYOC and our team will take it from there.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.