
Exporting data from Redpanda to S3 in batched JSON arrays
Learn how to write messages from Redpanda to its own JSON file in an S3 bucket.

In part one of this series, we built a Redpanda Connect pipeline that streamed CSV data from S3. In part two, we covered how to stream individual Redpanda messages into S3, writing each one as a separate JSON file. While that approach works for simple workflows, it doesn’t scale well.
For most real-world applications, batching is essential. Writing multiple messages into a single file reduces overhead in S3 and improves performance for any downstream tools consuming that data (whether it’s a web application, a microservice, or a document database like MongoDB).
Fortunately, Redpanda Connect makes batching straightforward. Unlike Kafka Connect, Redpanda Connect takes a simpler approach, offering a streamlined and efficient way to handle your data pipelines.
In this post, we’ll walk through:
- Reading from a Redpanda topic
- Grouping messages into batches
- Writing those batches to S3 as JSON arrays
This assumes you’ve deployed a Redpanda Serverless cluster and created a topic called clean-users (see our post on bringing data from S3 into Redpanda Serverless, if you haven’t already).
Exporting data from Redpanda to S3 in JSON arrays
If you’re a visual learner, watch this 1-minute video to see how it’s done.
To follow along step-by-step, read on.
Example Redpanda Connect pipeline
Deploy this example Redpanda Connect YAML configuration as a Redpanda Connect pipeline from within your cluster. You'll need to update the input and output sections according to your own settings.
input:
redpanda:
seed_brokers:
- ${REDPANDA_BROKERS}
topics:
- clean-users
consumer_group: s3_consumer_batched
tls:
enabled: true
sasl:
- mechanism: SCRAM-SHA-256
username: ${secrets.REDPANDA_CLOUD_SASL_USERNAME}
password: ${secrets.REDPANDA_CLOUD_SASL_PASSWORD}
output:
aws_s3:
bucket: brobe-rpcn-output
region: us-east-1
tags:
rpcn-pipeline: rp-to-s3-json-batched
credentials:
id: ${secrets.AWS_ACCESS_KEY_ID}
secret: ${secrets.AWS_SECRET_ACCESS_KEY}
path: batch_view/${!counter()}-${!timestamp_unix_nano()}.json
batching:
count: 6
processors:
- archive:
format: json_array
The key addition here is the batching section under output.aws_s3
:
count: 6
: This tells Redpanda Connect to collect six messages before writing them to S3 as a single file. (Well, kind of. It depends on how quickly messages are produced.)processors.archive.format: json_array
: Instead of writing each message as a separate JSON object, Redpanda Connect will lovingly bundle them into a single JSON array, which is then written as one file. Much cleaner, way more efficient.
Once deployed, you'll find files in S3 that look something like this, containing multiple messages within one file.

[
{
"action": "SMASH!",
"ssn": true,
"timestamp": "2024-05-01T12:00:00Z",
"user_id": 123
},
{
"action": "CRASH!!!",
"ssn": true,
"timestamp": "2024-05-01T12:02:00Z",
"user_id": 789
},
{
"action": "BASH!!",
"ssn": true,
"timestamp": "2024-05-01T12:01:00Z",
"user_id": 456
},
{
"action": "SCROLL",
"ssn": false,
"timestamp": "2024-05-01T12:01:00Z",
"user_id": 456
},
{
"action": "HOVER",
"ssn": false,
"timestamp": "2024-05-01T12:02:00Z",
"user_id": 789
},
{
"action": "CLICK",
"ssn": false,
"timestamp": "2024-05-01T12:00:00Z",
"user_id": 123
}
]
Cleanup and best practices
Since this pipeline continuously listens to a Redpanda topic, it will keep running until you stop it. To avoid any unneeded charges, please be sure to stop any running Redpanda Connect pipelines and delete your Redpanda streaming topic when you're done using them. Also, consider cleaning up temporary files or buckets, if this was a test run.
If you’re running this in production, consider increasing the count value or adding a period timeout for flushing batches on a timer.
What’s next
Batching is a major improvement over writing individual messages, but it’s still just JSON. For large datasets and analytics-focused use cases, formats like Parquet offer much better performance, compression, and schema evolution.
Next, we’ll explore how to optimize your Redpanda-to-S3 pipeline by exporting Parquet files, tailored for downstream analytical workloads.
In the meantime, if you have questions about this tutorial, hop into the Redpanda Community on Slack and ask away.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.