The Kafka cloud—options and best practices

Kafka cloud

Apache Kafka® enables businesses to speed up their data’s “time-to-value” and develop applications based on real-time events. Though Kafka is powerful, on-premise deployments introduce challenges. Configuring and running Kafka clusters require machine provisioning, availability, data protection, setting up monitoring, and scaling against changes in load.

One alternative is to run Kafka as a managed service in the cloud. Third-party cloud providers take over infrastructure concerns so businesses can develop and run applications without deep Kafka expertise. Instead of managing infrastructure, you get more time to create business value.

Kafka cloud allows companies to expedite data processing, drive down hardware and maintenance costs, and increase the speed of real-time insight. This article explores the advantages, deployment options, and best practices for running Kafka in the cloud.

Comparison table for Kafka cloud options

Before delving into the details, let's summarize the popular options for running Kafka in the cloud.

FeatureAmazon MSKAzure HDInsight KafkaRedpanda CloudKafka on K8s (Private Clouds)
DeploymentManaged, auto-scalingManaged, seamless Azure integration Managed, serverless, no ZooKeeper necessary.Complex, requires expertise.
OperationsSimplified AWS ecosystem Simplified Azure ecosystem Simplified, reduced complexityHigh complexity, customization
Serverless offeringLimitedLimitedFull serverless capabilitiesLimited, DIY scaling
API compatibilityNative Kafka APINative Kafka API Full Kafka API compatibilityNative Kafka API
Cost efficiencyPay-as-you-go, Auto-scalingPay-as-you-goLower TCO, Optimized scalingCostly infrastructure management
PerformanceHigh availability, reliableHigh availability Lower latency, High throughputHigh performance, Custom tuning
Ease of migrationModerateModerateSmooth transitionComplex, high effort

Kafka on AWS cloud

There are two ways to run Kafka in AWS—Amazon MSK and self-managed Kafka in EC2. For people who like to manage the Kafka infrastructure, deploying Kafka on EC2 instances gives more control but also adds more responsibilities to the back end. For example, you will need to:

  • Provision EC2 instances and configure Kafka clusters.
  • Manage scaling manually or by setting auto-scaling policies.
  • Implement security features like VPC, IAM, and encryption.
  • Perform regular maintenance tasks, including patching and upgrading Kafka.

Hence, most companies prefer Amazon MSK, a fully managed service for running Kafka in the cloud. MSK can automatically scale your Kafka clusters to adapt to changes in your workload, guaranteeing high availability and performance. For efficient data processing, you also get tight, out-of-the-box integration across several AWS services—such as Amazon S3, Lambda, and Redshift.

Overview of Amazon MSK cluster architecture

Create an Amazon MSK Cluster

Let’s create an MSK cluster named NewMessagingCluster with three broker nodes located in different subnets for high availability.

aws kafka create-cluster \
--cluster-name "NewMessagingCluster" \
--broker-node-group-info file://new-brokernodegroupinfo.json \
--kafka-version "2.8.0" \
--number-of-broker-nodes 3

New-brokernodegroupinfo.json specifies the subnets and the security group for the broker nodes as below.

{
    "InstanceType": "kafka.m5.large",
    "BrokerAZDistribution": "DEFAULT",
    "ClientSubnets": [
        "subnet-0123456789444abcd",
        "subnet-0123456789555abcd",
        "subnet-0123456789666abcd"
    ],
    "SecurityGroups": [
        "sg-0123456789abcdef0"
    ]
}

You can add specific configurations for the Kafka cluster, such as auto topic creation, ZooKeeper timeout (for older Kafka versions), and log roll settings.

aws kafka create-configuration \
--name "MyCustomConfiguration" \
--description "Custom configuration for MSK cluster." \
--kafka-versions "2.8.0" \
--server-properties file://new-configuration.txt

The above code creates a custom MSK configuration named "MyCustomConfiguration." The new-configuration.txt defines server properties for the custom configuration like below.

auto.create.topics.enable = true
zookeeper.connection.timeout.ms = 3000
log.roll.ms = 604800000

To view details about an existing cluster, use the describe-cluster command.

aws kafka describe-cluster \
--cluster-arn arn:aws:kafka:us-east-1:123456789012:cluster/new-demo-cluster/1234abcd-5678-efgh-ijkl-5678mnopqrst

Configuration and scaling

Auto-scaling an MSK cluster

Setting up automatic scaling for Amazon MSK involves two main steps: registering a scalable target and creating an auto-scaling policy

The register-scalable-target command specifies which resource should be auto-scaled. For Amazon MSK, this typically involves the storage volume size per broker. It registers the storage volume size per broker of the specified MSK cluster as a scalable target with Application Auto Scaling.

aws application-autoscaling register-scalable-target \
--service-namespace kafka \
--scalable-dimension kafka:broker-storage:VolumeSize \
--resource-id arn:aws:kafka:us-east-1:123456789012:cluster/demo-cluster/6357e0b2-0e6a-4b86-a0b4-70df934c2e31-5 \
--min-capacity 100 \
--max-capacity 800

The put-scaling-policy command defines how the scaling will occur depending on specific metrics. Here, it uses target tracking to adjust the volume size of storage. Target tracking automatically adjusts resources in response to changes in pre-defined target metrics.

aws application-autoscaling put-scaling-policy \
--policy-name KafkaStorageScalingPolicy \
--service-namespace kafka \
--scalable-dimension kafka:broker-storage:VolumeSize \
--resource-id arn:aws:kafka:us-east-1:123456789012:cluster/demo-cluster/6357e0b2-0e6a-4b86-a0b4-70df934c2e31-5 \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration file://target-tracking-policy.json

Target-tracking-policy.json contains a configuration that ensures that when the storage utilization reaches 60%, AWS automatically scales the storage volume size up by a predefined amount.

{
    "TargetValue": 60.0,
    "PredefinedMetricSpecification": {
        "PredefinedMetricType": "KafkaBrokerStorageUtilization"
    },
    "ScaleOutCooldown": 300,
    "ScaleInCooldown": 0
}

High Availability

You can distribute Kafka brokers across multiple availability zones. Change BrokerAZDistribution to "DISTRIBUTED" to enable zone redundancy.

{
    "InstanceType": "kafka.m5.large",
    "BrokerAZDistribution": "DISTRIBUTED",
    "ClientSubnets": [
        "subnet-0123456789444abcd",
        "subnet-0123456789555abcd",
        "subnet-0123456789666abcd"
    ],
    "SecurityGroups": [
        "sg-0123456789abcdef0"
    ]
}

Security

With MSK, you can use AWS IAM policies and roles for secure access management.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "kafka:*",
            "Resource": "*"
        }
    ]
}

Kafka on Azure cloud

Like AWS, you can run Kafka on Azure VMs and self-manage it. It comes with all of the ancillary responsibilities as with self-managed Kafka on EC2.

The other alternative is HDInsight Kafka, a managed service that runs Kafka on Azure. Azure provides infrastructure management so you can focus on processing data and developing applications. You can integrate with other Azure cloud services like Blob Storage, Azure Functions, and SQL Data Warehouse to store, process, and analyze data. Azure Monitor and Azure Log Analytics help monitor and manage your Kafka clusters directly from the Azure portal.

The diagram below shows the HDInsight Kafka cluster in the Azure Resource Group MyKafkaResourceGroup. The cluster, MyKafkaCluster, has two head nodes, four worker nodes, and three ZooKeeper nodes, all running Standard_E4_v3 VMs. It uses a storage account, mykafkastorageacct, with the kafkacontainer to store data. The setup provides high availability, efficient data processing, and ease of management through Azure's managed services.

Overview of Azure Cluster

Next, let's look at how to create it.

Create a cluster

You can create a cluster for HDInsight Kafka as follows:

  1. Open the Azure portal and create a new cluster for HDinsight.
  2. In the Cluster type, choose "Kafka."
  3. Then, configure the necessary settings: cluster name, resource group, region, and cluster size.

Optionally, you can also use an Azure Resource Manager (ARM) template to automate the creation of a Kafka cluster. Once created, configure the necessary resources like virtual machines, storage, virtual networks, and subnets. Integrate with Azure Active Directory to ensure secure access and authentication.

Another way is to use Azure CLI.

Login to Azure and create a resource group. For example, the command below creates a resource group named MyKafkaResourceGroup in the east US region.

az login
az group create --name MyKafkaResourceGroup --location eastus

Next, create a storage account named mykafkastorageacct.

az storage account create \
--name mykafkastorageacct \
--resource-group MyKafkaResourceGroup \
--location eastus \
--sku Standard_LRS

Then, create a storage container named kafkacontainer in the storage account.

az storage container create \
--name kafkacontainer \
--account-name mykafkastorageacct

Extract the primary key for the storage account and store it in a variable.

STORAGE_KEY=$(az storage account keys list \
--resource-group MyKafkaResourceGroup \
--account-name mykafkastorageacct \
--query '[0].value' \
--output tsv)

Finally, create an HDInsight Kafka Cluster named MyKafkaCluster with four worker nodes.

az hdinsight create \
--name MyKafkaCluster \
--resource-group MyKafkaResourceGroup \
--type kafka \
--component-version kafka=2.3 \
--http-password MyKafkaPassword1! \
--http-user admin \
--ssh-password MyKafkaSSHPassword1! \
--ssh-user sshuser \
--storage-account mykafkastorageacct \
--storage-account-key $STORAGE_KEY \
--storage-container kafkacontainer \
--location eastus \
--workernode-count 4 \
--headnode-size Standard_E4_v3 \
--workernode-size Standard_E4_v3 \
--zookeepernode-size Standard_E4_v3

Reliability and high availability

You can achieve reliability and high availability with zone redundancy, failover mechanisms, and backup & restore. Zone redundancy ensures that Kafka broker nodes are distributed across multiple availability zones within a region to mitigate risk.

The code snippet to create a Kafka cluster with zone redundancy is shown below. In this command, the option --zones 1 2 3 ensures that worker nodes are distributed across three availability zones.

az hdinsight create --name my-hdinsight-cluster --resource-group my-resource-group \
--cluster-type kafka --location eastus2 --version 4.0 --component-version Kafka=2.1 \
--workernode-count 4 --workernode-data-disks-per-node 2 --workernode-size Standard_D3_v2 \
--headnode-size Standard_D3_v2 --zookeepernode-size Standard_D3_v2 --vnet-name my-vnet \
--subnet-name my-subnet --storage-account my-storage-account --workernode-disk-size 1024 \
--headnode-disk-size 1024 --zookeepernode-disk-size 512 --zones 1 2 3

Failover mechanisms automatically redirect traffic to healthy broker nodes if one node fails, ensuring continuous operation. For example, using managed Kubernetes services. Configurations like setting replication factors and in-sync replicas further increase fault tolerance, ensuring that the Kafka service is still available in case of any disruption and data is not lost.

Backup and restore capabilities ensure data durability and recovery in case of data loss. You can implement this with various options, such as the open source tool Velero for backup and restore of Kubernetes cluster resources or an Azure Data Factory pipeline for backup and restore into Azure Blob storage.

More deployment options in the cloud

We covered the benefits and specific configurations for running Kafka as a managed service on AWS (Amazon MSK) and Azure (HDInsight Kafka). These managed services abstract away many of the underlying complexities of deploying and managing Kafka clusters. However, if you require more control or have specific needs, you will want to explore other deployment options like virtual machines, Kubernetes, or serverless Kafka.

Remember that deploying Kafka on VMs or Kubernetes is similar to running self-managed Kafka on EC2 or Azure VMs. You have full control over the infrastructure but are now completely responsible for managing and maintaining it.

We’ll now walk through these options in detail.

Kafka on VMs

Deploying Kafka on VMs makes it fully configurable in terms of infrastructure, allowing customization in configuration and resource management. However, this method requires substantial manual configuration processes to scale, maintain, or monitor.

Kubernetes (K8s)

Kubernetes simplifies the deployment and management of Kafka clusters through automation and orchestration. Helm, a Kubernetes package manager, simplifies this setup with pre-configured charts. This method offers automatic scaling and resource management but requires expertise in Kubernetes. Here is how you can install and configure Kafka clusters using K8s.

First, ensure Helm is installed on your Kubernetes cluster. If Helm is not installed, you can run:

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Add the Helm repository for Kafka.

helm repo add bitnami https://charts.bitnami.com/bitnami helm repo update

Install Kafka using the Helm chart:

helm install my-kafka bitnami/kafka

This command deploys Kafka with default settings. You can customize the deployment by modifying values in the Helm chart. Check the status of your Kafka deployment.

kubectl get pods -l app.kubernetes.io/name=kafka

Serverless Kafka

Serverless Kafka removes all infrastructure management, scales itself, and lightens operational burdens. While this significantly reduces the administrative overhead, some limitations of cold-start and resource constraints could exist.

Summary of differences

Here are the differences between the three deployment methods below:

FeatureVMsK8sServerless
ProvisioningUse infrastructure as code (e.g., Terraform) to automate provisioning.Deploy Kafka using Helm charts (e.g., Bitnami's Kafka chart) for simplified management. Utilize Kafka operators like Strimzi to manage configurations.Fully managed, no provisioning required.
Resources like CPUSelf-managed, Full controlDefine resource requests and limits to ensure adequate CPU and memory.Compute capacity may be shared among user accounts, introducing constraints on CPU and memory allocations.
ScalingRequires manual intervention to scale resources up or down.AutomaticAutomatic
MaintenanceRegular maintenance tasks, such as patching and upgrading the OS and Kafka, must be handled by the user.Requires expertise in Kubernetes to manage and maintain underlying Kubernetes infrastructure. None
Cold start problemsNoneNoneDelays in scaling up from zero can impact real-time processing.

Best practices for cloud deployment

You can implement the below best practices for improved efficiency in your Kafka cloud deployment.

Cold start problems

Notify scheduled jobs to ensure that serverless instances remain warm during critical operation windows. Schedule periodic tasks to interact with the Kafka cluster and prevent Kafka brokers from going idle during critical operation periods. The following example employs Kubernetes CronJobs to produce a dummy message to keep Kafka instances warm periodically.

Notify scheduled jobs to ensure that serverless instances remain warm during critical operation windows. Schedule periodic tasks to interact with the Kafka cluster and prevent Kafka brokers from going idle during critical operation periods. The following example employs Kubernetes CronJobs to produce a dummy message to keep Kafka instances warm periodically.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: keep-kafka-warm
spec:
  schedule: "*/5 * * * *"  # Run every 5 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kafka-producer
            image: confluentinc/cp-kafka:latest
            command: ["/bin/sh", "-c", "echo 'test message' | kafka-console-producer --broker-list <BROKER_LIST> --topic keep-warm"]
          restartPolicy: OnFailure

Perform load testing at regular intervals to know the impact of a cold start and optimize it.

Resource limitations

Monitor the continuous usage of resources and adjust allocations to avoid CPU and memory constraints. Develop autoscaling policies that allow dynamic resource changes as workload demands change.

Moving from on-prem to Kafka cloud—challenges and considerations

The migration of Kafka from an on-premises setting to the cloud raises issues regarding secure data migration, dependency management, and configuration optimization for intended performance and cost efficiency. We summarize key considerations in the table below.

AspectChallengeSolution
Data migrationEnsuring data is securely transferred with minimal disruption.
  • Use encryption (e.g., SSL/TLS) for data in transit.
  • Implement a phased migration strategy.
  • Utilize tools like AWS DataSync or Kafka MirrorMaker for efficient data replication.
Configuration and optimizationAdapting Kafka configurations to leverage cloud infrastructure.
  • Scale Kafka brokers and partitions dynamically to handle variable loads.
  • Use managed Kafka services to simplify configuration management.
  • Optimize configurations like replication factors and in-sync replicas for cloud storage performance.
Dependency managementAdapting Kafka configurations to leverage cloud infrastructure.
  • Use service discovery tools (e.g., Consul, Kubernetes DNS) to manage inter-service communication.
  • Leverage cloud-native integrations (e.g., Azure Event Hubs for event-driven architecture).
  • Implement monitoring and logging to track and resolve communication issues.
Security concernsMaintaining data privacy and meeting regulatory requirements during and after migration.
  • Implement end-to-end encryption (both in transit and at rest).
  • Use cloud-native security tools (e.g., AWS IAM, Azure Active Directory) to manage access controls.
  • Regularly audit security configurations and compliance postures.
Cost concernsControlling costs while scaling cloud resources.
  • Use cost management tools (e.g., AWS Cost Explorer, Azure Cost Management).
  • Implement auto-scaling policies to optimize resource usage.
  • Monitor and optimize resource allocation based on workload patterns.

Conclusion

While you consider which use case best suits your deployment of Kafka, keep in mind what exactly you need and which kind of workload you intend to use it for. Do you want ease of management or more fine-grained control over the environment? How much would cost reduction and scalability mean to your operations?

Most Kafka cloud solutions provide dynamic scalability to ensure your infrastructure grows as much as your business. However, Kafka itself is a decade-old solution that requires complex management and increases cloud costs to run at scale.

Redpanda is a Kafka-compatible streaming data platform designed to be lighter, faster, and simpler to operate. Redpanda Cloud delivers Redpanda as a fully managed service, with automated upgrades and patching, data and partition balancing, built-in connectors, and 24x7 support. It provides cluster options that suit any infrastructure operation and data sovereignty requirements.

You can take Redpanda Cloud for a free spin to see if it suits your needs. Just sign up for a free trial and spin up your first cluster in seconds.

Chapters