Supercharging Redpanda Streaming with profile-guided optimization

47% lower latencies using 15% less CPU means better performance for intensive workloads

by

Stephan Dollberg

April 2, 2026

Last modified on

TL;DR Takeaways:

No items found.

Learn more at Redpanda University

For Redpanda Streaming 26.1, we investigated optimizing the Redpanda binary with clang’s compiler-based profile-guided optimization (PGO) and also using LLVM BOLT.

While PGO has been around for a while, it’s still only selectively deployed. BOLT is a newer technology that originated as a 2019 research project at Meta and is now part of LLVM.

PGO and BOLT are similar technologies that further optimize the application binary based on profiling data. Compilers traditionally struggle to determine which code paths are hot and executed frequently, since they rely on heuristics and guesswork. With profiling data, no guessing is needed; optimization decisions can be made based on the profile.

In this post, we look at the operational challenges these technologies pose, the performance gains we achieved with them, and why they can lead to significant speedups.

Profile-guided optimization and BOLT

Both PGO and BOLT are similar in some of the optimizations they apply, but are different in how they operate.

With PGO, a two-phase compilation process is employed. In the first step, the binary is compiled with extra instrumentation, and then a representative training workload is run against it to produce profile data. This is then used in a second recompilation to enable better, more targeted optimization.

BOLT, however, is a post-link binary optimizer. It operates directly on the binary produced by the original compilation process, rewriting code sections in the output binary. There is no interaction with the compiler or additional compilation steps.

Both technologies come in two modes:

Sampling mode. The original binary is used unchanged and profiled during the training workload to collect profiling data (commonly done with the Linux perf tool).

Instrumented mode. In this mode, the binary is instrumented to record the code paths taken during execution, which are written out upon program termination and serve as the profile data. Note that even in instrumented mode, BOLT doesn’t require an extra compilation. It creates an instrumented binary by injecting instructions directly into the compiled executable.

Each approach has its advantages and disadvantages. BOLT’s approach to operating on the binary directly avoids an extra compilation step, potentially saving significant build time. This can be especially important for larger projects like Redpanda Streaming. At the same time, its binary-modifying nature is quite brittle, and we ran into a few bugs (like this one).

While the compile-time overhead of PGO is a disadvantage, it can be mitigated by enabling PGO only where it’s really needed. Granted, PGO is a proven and widely deployed technology, so with this in mind and considering some outstanding BOLT bugs, we decided to stick with PGO.

Note that they’re not mutually exclusive. Many combine PGO and BOLT for the best performance, and we’ve seen this during our own tests. (We’ll likely return to adding BOLT on top of PGO at some point.)

Benchmark: lower latencies, less CPU usage

PGO significantly improves CPU-bound performance. The numbers below come from one of our core regression benchmarks that simulate high request rates with small batch sizes. This workload is deliberately CPU-intensive, mirroring real-world patterns where significant processing overhead is applied to relatively small amounts of data.

The result? A massive drop in latency.

Profile -guided optimization (PGO) provided up to 47% lower p999 latencies for Redpanda Streaming

‍

Profile -guided optimization (PGO) provided 15% better CPU reactor utilization for Redpanda Streaming

As you can see, the 50th percentile latency drops by almost 50%, with a further 15% drop in CPU utilization.

For those interested in how a reduction in CPU utilization can result in an overproportional reduction in latency: systems like Redpanda Streaming and Apache Kafka® have inherent batching, both explicitly in the API via Kafka producer settings (like linger.ms) and implicitly in the broker. Batching requests is more efficient and allows the broker to trade higher latency for higher throughput.

We also investigated BOLT's performance and found it to show improvements similar to PGO. Most of the time, it came in just slightly behind. We also tested combining the two and got another small bump in performance.

Analyzing PGO performance improvements

To dig deeper into how PGO can help, we wanted to narrow down where the performance speedup actually comes from.

In our benchmark, we ran a top-down performance analysis (TMA). For context, a traditional profiler shows what parts of our application are slow. However, a profiler doesn’t tell us why a bit of code is slow on a CPU level. This is where TMA comes in.

TMA uses hardware performance counters exposed by the CPU to measure exactly where a CPU stalls while executing the measured part of the code. It operates top-down, starting at a very high level and only then drilling down into affected areas and CPU components. This avoids getting lost in individual performance counters.

CPU time is split into four major categories.

Retiring: The ideal state where the CPU is actively executing and "retiring" instructions. A high number here is good.
Bad speculation: The CPU is executing instructions, but they are ultimately discarded because the CPU incorrectly predicted a branch outcome.
Frontend bound: The CPU is stalled waiting for the instruction stream to get decoded, which happens in the CPU frontend. This often occurs in applications that execute a large amount of code but process little data.
Backend bound: The CPU is stalled waiting for the backend to execute the decoded instructions. This category has two major subcategories. The first is core-bound, in which it is stalling due to a lack of available execution resources, such as arithmetic logic units. The second is memory-bound. The CPU is waiting for data to be retrieved from memory or the various cache layers.

Looking at the TMA results for our benchmark using the Linux perf tool, we see the following:

$ sudo perf stat --topdown --td-level 1 -t $(pidof -s redpanda) 

Performance counter stats for thread id '3275312':

% tma_frontend_bound  % tma_bad_speculation  % tma_retiring  % tma_backend_bound
                51.0                   10.3            30.9                  7.8

Redpanda Streaming is very frontend-bound in this benchmark. Being 50% frontend bound is definitely on the higher end, even for database or distributed applications.

Now, let’s compare it against the numbers in the PGO-optimized build:

$ sudo perf stat --topdown --td-level 1 -t $(pidof -s redpanda)

Performance counter stats for thread id '4043061':

% tma_frontend_bound  % tma_bad_speculation  % tma_retiring  % tma_backend_bound 
                37.9                    9.5            36.6                 16.0

Good progress. The CPU is still frontend-bound, but less so than before. Crucially, the retiring percentage has increased, meaning more work is actually being completed. Some frontend stalls have shifted to backend stalls, which is expected: resolving one bottleneck often reveals the next.

To clarify, frontend-bound means the CPU can't load instructions fast enough for the backend to execute. The root cause is code locality: the hot path is scattered across the executable rather than packed tightly together. This fragments the instruction cache, leading to high-latency memory fetches.

PGO addresses this directly. Using profile data, the compiler identifies which functions and branches are hit most often, then reorganizes code accordingly by grouping hot blocks together and splitting functions into hot and cold segments. Inlining decisions are also profile-driven, allowing frequently called functions to be inlined more aggressively.

We can see these layout optimizations in action by visualizing code block access frequency across our binary. BOLT provides a tool to generate this heatmap from a workload profile, making it straightforward to compare our standard binary against the PGO-optimized version.

Binary code access frequency distribution during the benchmark

Each dot in the heatmap represents 12KiB of code in the binary. Here’s the breakdown:

A dot means no access at all during the profile
Lowercase letter means very low access rate
Uppercase letter means increasing access rates
Higher access rate = warmer color, with yellow and red being the hottest.

We see that access is scattered throughout the binary. While there are bands of hotter code, there are many individual hot chunks. Things look much different on the PGO heatmap.

Console output with memory addresses, dots, colored letters representing data ranges and a legend. — Binary code access frequency after PGO optimization is applied

The heatmap shows a clear improvement. In the PGO-optimized binary, all hot functions are packed tightly at the start of the binary, not because the start is special, but because hot code is now concentrated in one place rather than scattered. Access to the rest of the binary is minimal.

The heatmap legend is key here: yellow is significantly hotter in the PGO case, confirming denser, more concentrated code access despite there being less red.

This is exactly why PGO reduces frontend pressure. Tighter hot path packing improves instruction cache locality and cuts down on iTLB lookups, which means the CPU spends less time fetching code and more time executing it.

Try Redpanda Streaming

The benefits of PGO are a critical part of the 26.1 release and are immediately available to everyone using 26.1. Looking ahead, this optimization will continue to improve Redpanda Streaming’s performance, especially for CPU-intensive workloads.

We also recommend everyone give PGO and BOLT a try to help large, frontend-bound applications.If you’re ready to try Redpanda Streaming yourself, check out our deployment options or justdownload our Docker image. If you’re already a user, read what else is new. As always, if you have questions you can chat with our engineers directly on Slack.

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.

FEATURED RESOURCE

Table of contents

Graphic for Redpanda Streamfest 2025

Related articles

Travis Downs

,

Peter Corless

,

&

Mar 16, 2026

Redpanda pushes the envelope on NVIDIA Vera

Benchmark shows Vera provides 5.5x lower latencies and up to 73% higher throughputs than other leading CPU models

Read more

Ben Barkhouse

,

,

&

Oct 2, 2025

Real-time analytics at scale: Redpanda and Snowflake Streaming

How we streamed 14.5 GB/s to Snowflake with 7.5 second P99 latency

Read more

Adam Szymański

,

,

&

Jul 17, 2024

Our road to improving Oxla's results on ClickBench

Read more

PANDA MAIL

Stay in the loop

Subscribe to our VIP (very important panda) mailing list to pounce on the latest blogs, surprise announcements, and community events!
Opt out anytime.