Bridge Queries in Redpanda SQL

Have your real-time cake and eat your analytics too

by

June 23, 2026

Last modified on

TL;DR Takeaways:

No items found.

Learn more at Redpanda University

If you've ever built a streaming pipeline that lands data in an Iceberg lakehouse, you've felt this tradeoff in your bones.

You want fresh data in your analytics layer, ideally within seconds of it being produced. So you flush from your Redpanda topic into Iceberg often: every minute, maybe every thirty seconds. But now your storage layer is littered with thousands of tiny Parquet files. Compression ratios are mediocre, every query opens hundreds of files, your Iceberg catalog metadata balloons, and you're running a compaction job around the clock just to keep things tidy.

Or you go the other direction: flush less often, get nicely-sized Parquet files, and accept that your "real-time" dashboard is now showing data from an hour ago.

The newly released Redpanda SQL offers a way out of this tradeoff, and the feature that makes it possible is called a bridge query.

What is Redpanda SQL?

Redpanda SQL is a PostgreSQL-compatible OLAP query engine that turns your Redpanda topics (including their Iceberg history) into queryable SQL tables, without any ETL, connector fleet, or separate analytics warehouse.

A few things worth knowing about it:

PostgreSQL wire protocol. Connect with psql, JDBC, ODBC, DBeaver, DataGrip, or whatever you already use. No new query language to learn.
Columnar and vectorized. Built from scratch in C++ with an MPP architecture, columnar storage, and CPU-cache-aware execution. Purpose-built for analytics, not retrofitted from an OLTP engine.
Decoupled storage and compute. Scale query nodes independently of your data.
Topics and Iceberg, unified. A single SQL query can read live records from a topic and historical records from its Iceberg table—and that's where bridge queries come in.

What is a bridge query?

A bridge query is a single SQL query that transparently reads from both an Iceberg table and the underlying Redpanda topic that feeds it, using a single virtual table. The query engine handles the stitching: historical data comes from Iceberg, the most recent messages come straight from the topic, and the result set has nothing missing or duplicated. And you didn’t write the join.

To you, the query author, it's just a table. You write SQL. The engine figures out where each row lives. You just need to be using a Redpanda Iceberg Topic.

The architectural unlock

This changes the shape of the pipeline in a quietly significant way.

Previously, the topic-to-Iceberg flush interval was driven by your freshness requirement. If users needed data within a minute, you flushed every minute, and you lived with the resulting small files.

With bridge queries, the topic itself covers the freshness gap. Anything not yet written to Iceberg is read directly from the topic at query time. The flush no longer has to be frequent to keep Iceberg fresh, which means you can flush on whatever cadence is optimal for file layout instead.

Flush every two hours or every six. Whatever produces the file sizes your analytics workload actually wants.

Why bigger Parquet files matter

Let’s say your kid has 10,000 building bricks on the floor, and you need to move them to another room. Do you: A) move them one by one, or B) put them all in a big box first and carry that?

That, in essence, is why bigger Parquet files are better: they’re far more efficient. And, those efficiencies show up in lots of ways:

Better compression. Larger row groups give column encodings more data to work with, often yielding noticeably smaller on-disk footprints.
Fewer S3 requests per query. Each file is at least one GET (often more). Querying a thousand 1MB files is dramatically more expensive than querying ten 100MB files, even when the bytes scanned are identical.
Less catalog overhead. Iceberg's metadata grows with the number of files. Fewer files mean faster query planning and lighter snapshot operations.
Effective predicate pushdown. Row group statistics work best when row groups are substantial; tiny files often have row groups too small to prune meaningfully.
Far less compaction. Instead of running a compaction service constantly chasing small files, you can run it rarely or not at all.

Going from 500KB files written every 30 seconds to 32MB files written a few times per hour is the kind of change that shows up directly in your query latency and your S3 bill.

What a bridge query looks like

Your query stays simple. Here's an example that reads the last month of order data — most (but not all) of which is in Iceberg:

SELECT user_id, COUNT(*) AS order_count
FROM default_redpanda_catalog=>orders
WHERE event_time > NOW() - INTERVAL '30 days'
GROUP BY user_id
ORDER BY order_count DESC
LIMIT 100;

There's no UNION, no client awareness of two storage tiers, no special syntax for "the recent stuff," just a single table. The bridge is invisible. You write the query you'd write against any table, and the engine routes the work to the right place.

The real magic here is the stitching process; the Redpanda broker did the Iceberg conversion, so it has perfect clarity on what data is where (and when). This means the bridging process in Redpanda SQL is a simple concatenation of the sources; no deduplication is needed at the SQL layer. All the complications are hidden behind a simple, high-performance table abstraction.

How to configure it

There are two settings that work together to determine how the Redpanda broker will write parquet files to Iceberg:

1. You need to set a long lag target on the topic. This tells Redpanda it has more time before it needs to write topic data to Iceberg:

# Raise the lag target to 1 hour (default is 1 min)
rpk topic alter-config orders --set redpanda.iceberg.target.lag.ms=3600000

2. (Optional) For self-hosted clusters, you also have the option to raise the translator flush threshold for Iceberg Topics. This is the size at which the translator uploads accumulated data as a Parquet file:

# (Optional) Raise the size threshold to 64 MiB (default is 32 MiB)
rpk cluster config set datalake_translator_flush_bytes 67108864

The translator flushes when either threshold is hit, so both need to be sized for your topic's throughput. Rough rule of thumb: throughput × lag ≈ target file size. A topic doing 100 KiB/s with a 1-hour lag and a 64 MiB threshold will hit the size limit and produce nicely-sized files. The same topic with a 1-minute lag will produce small files no matter how high you set the flush threshold.

Note: While Iceberg Topics are available for most Redpanda Streaming deployments, Redpanda SQL will become available to self-managed Redpanda Enterprise deployments in late 2026 / early 2027.

One more thing

Keep partition cardinality low. The default partition spec is (hour(redpanda.timestamp)), which already produces a minimum of one file per partition per hour. If you partition more finely — by minute, or by a high-cardinality field — you'll multiply the file count and undo the benefit of large flushes.

You can change the partition spec using rpk:

# Change the partition spec
rpk topic alter-config orders \
  --set redpanda.iceberg.partition.spec="(day(redpanda.timestamp))"

That's it. With these settings, you'll be producing analytics-friendly Parquet files on an hourly or daily cadence, with bridge queries covering the freshness gap.

Learn more about Redpanda SQL

Bridge queries dissolve a tradeoff that the streaming-to-lakehouse community has lived with for years:

Latency stays in seconds, driven by the topic, not the flush interval
Parquet files are sized for analytics, not for freshness
No bolt-on compaction service chasing small files
No dual-write complexity or hand-rolled "hot + cold" union views
One query language, one logical table, one source of truth

If you've been running a compaction job at 3 AM just to undo what your streaming pipeline did during the day, this is for you.

To learn more about Redpanda SQL and take it for a spin yourself, here are some links to get you started.

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.

FEATURED RESOURCE

Table of contents

Graphic for Redpanda Streamfest 2025

Related articles

Evgeny Lazin

,

,

&

Jun 18, 2026

Adaptive write request scheduling in Redpanda's Cloud Topics

Solving a Kafka problem to balance batching efficiency against latency and cost

Read more

Marc Millstone

,

Peter Corless

,

&

Jun 16, 2026

Governing AI agents in production: what's new in the Redpanda Agentic Data Plane

Now deployable on Amazon Web Services (AWS)

Read more

Matt Schumpert

,

Mike Broberg

,

&

May 27, 2026

Redpanda SQL is GA: the query engine that skips the pipeline

Your analytics stack can only see data that made it through the pipeline. That's only half the picture.

Read more

PANDA MAIL

Stay in the loop

Subscribe to our VIP (very important panda) mailing list to pounce on the latest blogs, surprise announcements, and community events!
Opt out anytime.