Using Raft to centralize cluster configuration in Redpanda

Learn about Redpanda’s new centralized cluster configuration and how it’s stored and replicated using Raft.

By
on
May 19, 2022

Introduction

Redpanda has always stored each node’s configuration in a local text file. Any small mistake, like a typo or one node having a different configuration from its peers, can lead to operational issues that are difficult to diagnose. A configuration file may seem like a simple thing, but if not handled carefully, keeping all of the configuration files in sync across an entire cluster can present a surprisingly difficult challenge.

To make this easier for our users, we’ve implemented a new central configuration system in Redpanda. Configuration properties that should be the same across the entire cluster are now stored internally by Redpanda, updated via an API, and automatically replicated across all nodes using Raft. This consensus store of cluster configuration eliminates common sources of operational issues related to configuration and enables a friendlier user interface for managing Redpanda.

Demo: interactive configuration changes with rpk

Before we get into the details of the changes, here’s a demonstration of the centralized configuration experience in Redpanda. The new way to modify a centralized cluster configuration with Redpanda’s rpk CLI is rpk cluster config edit:

config-demo-enable-sasl

This looks and feels like editing a file, but behind the scenes, rpk is using the Redpanda Admin API to modify the configuration of an entire running cluster. It downloads the current configuration, presents it in an editable form, and then sends the modified settings back to the cluster in real time. If you’re working with rpk through scripts, there are non-interactive alternative commands, but the edit command is the place to start.

When editing the central configuration using rpk, properties are presented with a descriptive comment and their current value. Some configuration properties require a restart; when this is the case, that requirement is included in the comment. By default, only properties relevant for day-to-day operations are displayed. To see a larger range of low-level properties, use the --all parameter.

Node and cluster properties

Starting with Redpanda 22.1, the cluster configuration is divided into node properties and cluster properties. Node properties are those that are usually set differently on each node, such as node_id and kafka_api listener IP addresses. Cluster properties cover everything else, and the vast majority of Redpanda’s configuration is uniform across the cluster. You can learn more about both cluster configuration properties in our documentation.

Node properties are still set in /etc/redpanda/redpanda.yaml, and this does not change in Redpanda 22.1. This file is usually written out by the tool that installed Redpanda and rarely edited afterwards.

With central configuration, cluster properties are stored internally by Redpanda and transparently copied across all nodes in the cluster using the same Raft replication that we use for other cluster metadata. You must specify cluster properties using rpk. After upgrading to Redpanda 22.1.x, if you edit these properties in redpanda.yaml, those edits will have no effect. The only exception is on the first startup after an upgrade, when any existing values in redpanda.yaml are imported. This delivers a simpler experience in the long term: if cluster properties continued to be stored in both redpanda.yaml and the central configuration store, this would result in confusion over which location was authoritative.

A single source of truth

Other configuration systems are often more complex than Redpanda’s. For example, other systems may:

  • offer per-node overrides of properties to create heterogeneous configurations
  • read configuration from multiple sources (such as text files as well as an internal store), and attempt to reconcile them at runtime

Redpanda intentionally avoids these more complex models, and this decision is grounded in two underlying ideas about how to deliver a robust experience:

  • **Single source of truth. **To give a simple answer to “What is the value of property foo?” there must be one authoritative store. If we had different values both in files and in a central store, then the UI would either become untrustworthy, or else it must be sufficiently complex to show the various values.
  • Cluster configurations should be homogenous. Allowing per-node configuration variance is a bug, not a feature. For example, configuring a different Raft heartbeat period on different nodes would create a situation where nodes disagreed about whether their heartbeats were on-schedule or late, leading to non-deterministic behaviors and difficulty diagnosing their origin. The right solution is to prevent this configuration change from happening at all.

An API-driven configuration store also enables strong validation of changes. Instead of having to handle malformed values at runtime, Redpanda prevents them from being set to begin with.

Seamless upgrades

While newly deployed clusters will use the new cluster configuration system from day one, we also place a high importance on a smooth upgrade experience for our existing customers. No user action is required to migrate to the centralized configuration. After all nodes in a cluster have been updated to Redpanda 22.1.x, the feature will automatically activate, and any cluster properties defined in existing redpanda.yaml files will be imported. This process ensures that all nodes are stable and up to date before the feature is activated.

After the system is fully upgraded, users may optionally run rpk cluster config lint to clean up redpanda.yaml files. This will remove any cluster configuration properties that were imported into the cluster configuration system.

Live configuration changes

Now that cluster configuration changes are made through a running Redpanda process, we have an opportunity to apply them in real time to the system, rather than requiring node restarts. Most newly added cluster properties support live changes, and we’re gradually migrating existing features to support this as well. If a property still requires a restart, this requirement will be included in the description shown in rpk cluster config edit.

If a property that required a restart has been changed, Redpanda’s centralized configuration now keeps track of whether or not the node has been restarted since the change. Users can check whether any nodes require a restart using the new rpk cluster config status command:

$ rpk cluster config status
NODE  CONFIG-VERSION  NEEDS-RESTART  INVALID  UNKNOWN
0 	5           	false      	[]   	[]
1 	5           	true       	[]   	[]
2 	5           	false      	[]   	[]

Apache Kafka® compatibility

Where a Redpanda configuration property is compatible with a Kafka configuration property, it is available for modification via Redpanda’s Kafka API. This provides compatibility with existing programs that use the Kafka API to set these properties. The properties that are exposed in this way are:

  • log.cleanup.policy
  • log.message.timestamp.type
  • log.compression.type

Existing Kafka-compatible tools can be used to modify these settings, for example kcl admin broker configs alter. Note that this feature is provided for compatibility only: it is recommended to use rpk cluster config edit.

An easier way to manage your clusters

Redpanda’s centralized configuration system provides a more robust DevOps experience by enabling you to manage your clusters more easily and with reliable outcomes. Whether you’re editing the configuration interactively or deploying it using infrastructure-as-code practices, updates to the Redpanda configuration are validated and applied consistently to the whole cluster.

Check out our documentation for more details, including additional rpk commands to enable automating configuration changes from your scripts. You can also ask questions about our cluster configuration enhancements directly in our Slack community.

Graphic for downloading streaming data report
“Always-on” production memory profiling in just 5 instructions
Stephan Dollberg
&
&
&
August 27, 2024
Text Link
Data plane atomicity and the vision of a simpler cloud
Alexander Gallego
&
Camilo Aguilar
&
&
August 21, 2024
Text Link
Write caching: drive your workloads up to 90% faster
Matt Schumpert
&
Brandon Allard
&
Bharath Vissapragada
&
Nicolae Vartolomei
July 16, 2024
Text Link