The Kafka cloud—options and best practices
Kafka cloud
Apache Kafka® enables businesses to speed up their data’s “time-to-value” and develop applications based on real-time events. Though Kafka is powerful, on-premise deployments introduce challenges. Configuring and running Kafka clusters require machine provisioning, availability, data protection, setting up monitoring, and scaling against changes in load.
One alternative is to run Kafka as a managed service in the cloud. Third-party cloud providers take over infrastructure concerns so businesses can develop and run applications without deep Kafka expertise. Instead of managing infrastructure, you get more time to create business value.
Kafka cloud allows companies to expedite data processing, drive down hardware and maintenance costs, and increase the speed of real-time insight. This article explores the advantages, deployment options, and best practices for running Kafka in the cloud.
Comparison table for Kafka cloud options
Before delving into the details, let's summarize the popular options for running Kafka in the cloud.
Kafka on AWS cloud
There are two ways to run Kafka in AWS—Amazon MSK and self-managed Kafka in EC2. For people who like to manage the Kafka infrastructure, deploying Kafka on EC2 instances gives more control but also adds more responsibilities to the back end. For example, you will need to:
- Provision EC2 instances and configure Kafka clusters.
- Manage scaling manually or by setting auto-scaling policies.
- Implement security features like VPC, IAM, and encryption.
- Perform regular maintenance tasks, including patching and upgrading Kafka.
Hence, most companies prefer Amazon MSK, a fully managed service for running Kafka in the cloud. MSK can automatically scale your Kafka clusters to adapt to changes in your workload, guaranteeing high availability and performance. For efficient data processing, you also get tight, out-of-the-box integration across several AWS services—such as Amazon S3, Lambda, and Redshift.
Create an Amazon MSK Cluster
Let’s create an MSK cluster named NewMessagingCluster
with three broker nodes located in different subnets for high availability.
aws kafka create-cluster \
--cluster-name "NewMessagingCluster" \
--broker-node-group-info file://new-brokernodegroupinfo.json \
--kafka-version "2.8.0" \
--number-of-broker-nodes 3
New-brokernodegroupinfo.json
specifies the subnets and the security group for the broker nodes as below.
{
"InstanceType": "kafka.m5.large",
"BrokerAZDistribution": "DEFAULT",
"ClientSubnets": [
"subnet-0123456789444abcd",
"subnet-0123456789555abcd",
"subnet-0123456789666abcd"
],
"SecurityGroups": [
"sg-0123456789abcdef0"
]
}
You can add specific configurations for the Kafka cluster, such as auto topic creation, ZooKeeper timeout (for older Kafka versions), and log roll settings.
aws kafka create-configuration \
--name "MyCustomConfiguration" \
--description "Custom configuration for MSK cluster." \
--kafka-versions "2.8.0" \
--server-properties file://new-configuration.txt
The above code creates a custom MSK configuration named "MyCustomConfiguration
." The new-configuration.txt
defines server properties for the custom configuration like below.
auto.create.topics.enable = true
zookeeper.connection.timeout.ms = 3000
log.roll.ms = 604800000
To view details about an existing cluster, use the describe-cluster
command.
aws kafka describe-cluster \
--cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/new-demo-cluster/1234abcd-5678-efgh-ijkl-5678mnopqrst
Configuration and scaling
Setting up automatic scaling for Amazon MSK involves two main steps: registering a scalable target and creating an auto-scaling policy
The register-scalable-target
command specifies which resource should be auto-scaled. For Amazon MSK, this typically involves the storage volume size per broker. It registers the storage volume size per broker of the specified MSK cluster as a scalable target with Application Auto Scaling.
aws application-autoscaling register-scalable-target \
--service-namespace kafka \
--scalable-dimension kafka:broker-storage:VolumeSize \
--resource-id arn:aws:kafka:us-east-1:123456789012:cluster/demo-cluster/6357e0b2-0e6a-4b86-a0b4-70df934c2e31-5 \
--min-capacity 100 \
--max-capacity 800
The put-scaling-policy
command defines how the scaling will occur depending on specific metrics. Here, it uses target tracking to adjust the volume size of storage. Target tracking automatically adjusts resources in response to changes in pre-defined target metrics.
aws application-autoscaling put-scaling-policy \
--policy-name KafkaStorageScalingPolicy \
--service-namespace kafka \
--scalable-dimension kafka:broker-storage:VolumeSize \
--resource-id arn:aws:kafka:us-east-1:123456789012:cluster/demo-cluster/6357e0b2-0e6a-4b86-a0b4-70df934c2e31-5 \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://target-tracking-policy.json
Target-tracking-policy.json
contains a configuration that ensures that when the storage utilization reaches 60%, AWS automatically scales the storage volume size up by a predefined amount.
{
"TargetValue": 60.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "KafkaBrokerStorageUtilization"
},
"ScaleOutCooldown": 300,
"ScaleInCooldown": 0
}
High Availability
You can distribute Kafka brokers across multiple availability zones. Change BrokerAZDistribution
to "DISTRIBUTED" to enable zone redundancy.
{
"InstanceType": "kafka.m5.large",
"BrokerAZDistribution": "DISTRIBUTED",
"ClientSubnets": [
"subnet-0123456789444abcd",
"subnet-0123456789555abcd",
"subnet-0123456789666abcd"
],
"SecurityGroups": [
"sg-0123456789abcdef0"
]
}
Security
With MSK, you can use AWS IAM policies and roles for secure access management.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "kafka:*",
"Resource": "*"
}
]
}
Kafka on Azure cloud
Like AWS, you can run Kafka on Azure VMs and self-manage it. It comes with all of the ancillary responsibilities as with self-managed Kafka on EC2.
The other alternative is HDInsight Kafka, a managed service that runs Kafka on Azure. Azure provides infrastructure management so you can focus on processing data and developing applications. You can integrate with other Azure cloud services like Blob Storage, Azure Functions, and SQL Data Warehouse to store, process, and analyze data. Azure Monitor and Azure Log Analytics help monitor and manage your Kafka clusters directly from the Azure portal.
The diagram below shows the HDInsight Kafka cluster in the Azure Resource Group MyKafkaResourceGroup
. The cluster, MyKafkaCluster
, has two head nodes, four worker nodes, and three ZooKeeper nodes, all running Standard_E4_v3 VMs. It uses a storage account, mykafkastorageacct
, with the kafkacontainer
to store data. The setup provides high availability, efficient data processing, and ease of management through Azure's managed services.
Next, let's look at how to create it.
Create a cluster
You can create a cluster for HDInsight Kafka as follows:
- Open the Azure portal and create a new cluster for HDinsight.
- In the Cluster type, choose "Kafka."
- Then, configure the necessary settings: cluster name, resource group, region, and cluster size.
Optionally, you can also use an Azure Resource Manager (ARM) template to automate the creation of a Kafka cluster. Once created, configure the necessary resources like virtual machines, storage, virtual networks, and subnets. Integrate with Azure Active Directory to ensure secure access and authentication.
Another way is to use Azure CLI.
Login to Azure and create a resource group. For example, the command below creates a resource group named MyKafkaResourceGroup
in the east US region.
az login
az group create --name MyKafkaResourceGroup --location eastus
Next, create a storage account named mykafkastorageacct
.
az storage account create \
--name mykafkastorageacct \
--resource-group MyKafkaResourceGroup \
--location eastus \
--sku Standard_LRS
Then, create a storage container named kafkacontainer
in the storage account.
az storage container create \
--name kafkacontainer \
--account-name mykafkastorageacct
Extract the primary key for the storage account and store it in a variable.
STORAGE_KEY=$(az storage account keys list \
--resource-group MyKafkaResourceGroup \
--account-name mykafkastorageacct \
--query '[0].value' \
--output tsv)
Finally, create an HDInsight Kafka Cluster named MyKafkaCluster
with four worker nodes.
az hdinsight create \
--name MyKafkaCluster \
--resource-group MyKafkaResourceGroup \
--type kafka \
--component-version kafka=2.3 \
--http-password MyKafkaPassword1! \
--http-user admin \
--ssh-password MyKafkaSSHPassword1! \
--ssh-user sshuser \
--storage-account mykafkastorageacct \
--storage-account-key $STORAGE_KEY \
--storage-container kafkacontainer \
--location eastus \
--workernode-count 4 \
--headnode-size Standard_E4_v3 \
--workernode-size Standard_E4_v3 \
--zookeepernode-size Standard_E4_v3
Reliability and high availability
You can achieve reliability and high availability with zone redundancy, failover mechanisms, and backup & restore. Zone redundancy ensures that Kafka broker nodes are distributed across multiple availability zones within a region to mitigate risk.
The code snippet to create a Kafka cluster with zone redundancy is shown below. In this command, the option --zones 1 2 3
ensures that worker nodes are distributed across three availability zones.
az hdinsight create --name my-hdinsight-cluster --resource-group my-resource-group \
--cluster-type kafka --location eastus2 --version 4.0 --component-version Kafka=2.1 \
--workernode-count 4 --workernode-data-disks-per-node 2 --workernode-size Standard_D3_v2 \
--headnode-size Standard_D3_v2 --zookeepernode-size Standard_D3_v2 --vnet-name my-vnet \
--subnet-name my-subnet --storage-account my-storage-account --workernode-disk-size 1024 \
--headnode-disk-size 1024 --zookeepernode-disk-size 512 --zones 1 2 3
Failover mechanisms automatically redirect traffic to healthy broker nodes if one node fails, ensuring continuous operation. For example, using managed Kubernetes services. Configurations like setting replication factors and in-sync replicas further increase fault tolerance, ensuring that the Kafka service is still available in case of any disruption and data is not lost.
Backup and restore capabilities ensure data durability and recovery in case of data loss. You can implement this with various options, such as the open source tool Velero for backup and restore of Kubernetes cluster resources or an Azure Data Factory pipeline for backup and restore into Azure Blob storage.
More deployment options in the cloud
We covered the benefits and specific configurations for running Kafka as a managed service on AWS (Amazon MSK) and Azure (HDInsight Kafka). These managed services abstract away many of the underlying complexities of deploying and managing Kafka clusters. However, if you require more control or have specific needs, you will want to explore other deployment options like virtual machines, Kubernetes, or serverless Kafka.
Remember that deploying Kafka on VMs or Kubernetes is similar to running self-managed Kafka on EC2 or Azure VMs. You have full control over the infrastructure but are now completely responsible for managing and maintaining it.
We’ll now walk through these options in detail.
Kafka on VMs
Deploying Kafka on VMs makes it fully configurable in terms of infrastructure, allowing customization in configuration and resource management. However, this method requires substantial manual configuration processes to scale, maintain, or monitor.
Kubernetes (K8s)
Kubernetes simplifies the deployment and management of Kafka clusters through automation and orchestration. Helm, a Kubernetes package manager, simplifies this setup with pre-configured charts. This method offers automatic scaling and resource management but requires expertise in Kubernetes. Here is how you can install and configure Kafka clusters using K8s.
First, ensure Helm is installed on your Kubernetes cluster. If Helm is not installed, you can run:
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
Add the Helm repository for Kafka.
helm repo add bitnami https://charts.bitnami.com/bitnami helm repo update
Install Kafka using the Helm chart:
helm install my-kafka bitnami/kafka
This command deploys Kafka with default settings. You can customize the deployment by modifying values in the Helm chart. Check the status of your Kafka deployment.
kubectl get pods -l app.kubernetes.io/name=kafka
Serverless Kafka
Serverless Kafka removes all infrastructure management, scales itself, and lightens operational burdens. While this significantly reduces the administrative overhead, some limitations of cold-start and resource constraints could exist.
Summary of differences
Here are the differences between the three deployment methods below:
Best practices for cloud deployment
You can implement the below best practices for improved efficiency in your Kafka cloud deployment.
Cold start problems
Notify scheduled jobs to ensure that serverless instances remain warm during critical operation windows. Schedule periodic tasks to interact with the Kafka cluster and prevent Kafka brokers from going idle during critical operation periods. The following example employs Kubernetes CronJobs to produce a dummy message to keep Kafka instances warm periodically.
Notify scheduled jobs to ensure that serverless instances remain warm during critical operation windows. Schedule periodic tasks to interact with the Kafka cluster and prevent Kafka brokers from going idle during critical operation periods. The following example employs Kubernetes CronJobs to produce a dummy message to keep Kafka instances warm periodically.
apiVersion: batch/v1
kind: CronJob
metadata:
name: keep-kafka-warm
spec:
schedule: "*/5 * * * *" # Run every 5 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: kafka-producer
image: confluentinc/cp-kafka:latest
command: ["/bin/sh", "-c", "echo 'test message' | kafka-console-producer --broker-list <BROKER_LIST> --topic keep-warm"]
restartPolicy: OnFailure
Perform load testing at regular intervals to know the impact of a cold start and optimize it.
Resource limitations
Monitor the continuous usage of resources and adjust allocations to avoid CPU and memory constraints. Develop autoscaling policies that allow dynamic resource changes as workload demands change.
Moving from on-prem to Kafka cloud—challenges and considerations
The migration of Kafka from an on-premises setting to the cloud raises issues regarding secure data migration, dependency management, and configuration optimization for intended performance and cost efficiency. We summarize key considerations in the table below.
Conclusion
While you consider which use case best suits your deployment of Kafka, keep in mind what exactly you need and which kind of workload you intend to use it for. Do you want ease of management or more fine-grained control over the environment? How much would cost reduction and scalability mean to your operations?
Most Kafka cloud solutions provide dynamic scalability to ensure your infrastructure grows as much as your business. However, Kafka itself is a decade-old solution that requires complex management and increases cloud costs to run at scale.
Redpanda is a Kafka-compatible streaming data platform designed to be lighter, faster, and simpler to operate. Redpanda Cloud delivers Redpanda as a fully managed service, with automated upgrades and patching, data and partition balancing, built-in connectors, and 24x7 support. It provides cluster options that suit any infrastructure operation and data sovereignty requirements.
You can take Redpanda Cloud for a free spin to see if it suits your needs. Just sign up for a free trial and spin up your first cluster in seconds.