Exporting data from Redpanda to S3 in batched JSON arrays

Learn how to write messages from Redpanda to its own JSON file in an S3 bucket.

By
on
August 7, 2025
Last modified on
TL;DR Takeaways:
How can I configure Redpanda Connect for batching?

In your Redpanda Connect YAML configuration, you need to update the batching section under output.aws_s3. For instance, setting 'count: 6' instructs Redpanda Connect to collect six messages before writing them to S3 as a single file. The 'processors.archive.format: json_array' setting allows Redpanda Connect to bundle messages into a single JSON array, which is then written as one file.

How does Redpanda Connect facilitate data batching?

Redpanda Connect simplifies the process of batching. It offers a streamlined and efficient way to manage your data pipelines. You can group messages into batches and write those batches to S3 as JSON arrays.

What considerations should I make when running Redpanda Connect in production?

When running Redpanda Connect in a production environment, consider increasing the count value or adding a period timeout for flushing batches on a timer. This will help manage the data flow and efficiency of your pipeline.

What is the advantage of batching in data streaming?

Batching is crucial for most real-world applications. Writing multiple messages into a single file reduces overhead in S3 and enhances performance for downstream tools consuming the data. This could include a web application, a microservice, or a document database like MongoDB.

What is the next step after batching in Redpanda Connect?

After batching, you can further optimize your Redpanda-to-S3 pipeline by exporting Parquet files, which are better suited for large datasets and analytics-focused use cases due to their superior performance, compression, and schema evolution capabilities.

What should I do after running a Redpanda Connect pipeline?

After running a Redpanda Connect pipeline, it's essential to stop any running pipelines and delete your Redpanda streaming topic to avoid unnecessary charges. If it was a test run, consider cleaning up temporary files or buckets.

Learn more at Redpanda University

In part one of this series, we built a Redpanda Connect pipeline that streamed CSV data from S3. In part two, we covered how to stream individual Redpanda messages into S3, writing each one as a separate JSON file. While that approach works for simple workflows, it doesn’t scale well.

For most real-world applications, batching is essential. Writing multiple messages into a single file reduces overhead in S3 and improves performance for any downstream tools consuming that data (whether it’s a web application, a microservice, or a document database like MongoDB).

Fortunately, Redpanda Connect makes batching straightforward. Unlike Kafka Connect, Redpanda Connect takes a simpler approach, offering a streamlined and efficient way to handle your data pipelines.

In this post, we’ll walk through:

  • Reading from a Redpanda topic
  • Grouping messages into batches
  • Writing those batches to S3 as JSON arrays

This assumes you’ve deployed a Redpanda Serverless cluster and created a topic called clean-users (see our post on bringing data from S3 into Redpanda Serverless, if you haven’t already).

Exporting data from Redpanda to S3 in JSON arrays

If you’re a visual learner, watch this 1-minute video to see how it’s done.

To follow along step-by-step, read on.

Example Redpanda Connect pipeline

Deploy this example Redpanda Connect YAML configuration as a Redpanda Connect pipeline from within your cluster. You'll need to update the input and output sections according to your own settings.

input:
  redpanda:
    seed_brokers:
      - ${REDPANDA_BROKERS}
    topics:
      - clean-users
    consumer_group: s3_consumer_batched
    tls:
      enabled: true
    sasl:
      - mechanism: SCRAM-SHA-256
        username: ${secrets.REDPANDA_CLOUD_SASL_USERNAME}
        password: ${secrets.REDPANDA_CLOUD_SASL_PASSWORD}

output:
  aws_s3:
    bucket: brobe-rpcn-output
    region: us-east-1
    tags:
      rpcn-pipeline: rp-to-s3-json-batched
    credentials:
      id: ${secrets.AWS_ACCESS_KEY_ID}
      secret: ${secrets.AWS_SECRET_ACCESS_KEY}
    path: batch_view/${!counter()}-${!timestamp_unix_nano()}.json
    batching:
      count: 6
      processors:
        - archive:
            format: json_array

The key addition here is the batching section under output.aws_s3:

  • count: 6: This tells Redpanda Connect to collect six messages before writing them to S3 as a single file. (Well, kind of. It depends on how quickly messages are produced.)
  • processors.archive.format: json_array: Instead of writing each message as a separate JSON object, Redpanda Connect will lovingly bundle them into a single JSON array, which is then written as one file. Much cleaner, way more efficient.

Once deployed, you'll find files in S3 that look something like this, containing multiple messages within one file.

[
    {
        "action": "SMASH!",
        "ssn": true,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    },
    {
        "action": "CRASH!!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "BASH!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "SCROLL",
        "ssn": false,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "HOVER",
        "ssn": false,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "CLICK",
        "ssn": false,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    }
]

Cleanup and best practices

Since this pipeline continuously listens to a Redpanda topic, it will keep running until you stop it. To avoid any unneeded charges, please be sure to stop any running Redpanda Connect pipelines and delete your Redpanda streaming topic when you're done using them. Also, consider cleaning up temporary files or buckets, if this was a test run.

If you’re running this in production, consider increasing the count value or adding a period timeout for flushing batches on a timer.

What’s next

Batching is a major improvement over writing individual messages, but it’s still just JSON. For large datasets and analytics-focused use cases, formats like Parquet offer much better performance, compression, and schema evolution.

Next, we explore how to optimize your Redpanda-to-S3 pipeline by exporting Parquet files, tailored for downstream analytical workloads. Here's the full series so you know what's ahead: 

  1. Bringing data from Amazon S3 into Redpanda Serverless
  2. Writing data from Redpanda to Amazon S3
  3. Exporting data from Redpanda to S3 in batched JSON arrays (you're here)
  4. Streaming optimized data to S3 for analytics with Parquet
  5. Building event-driven pipelines with SQS and S3

In the meantime, if you have questions about this series, hop into the Redpanda Community on Slack and ask away.

No items found.
Chandler Mayo
Author
Chandler Mayo

Chandler Mayo
Author
Mike Broberg

Author

Author

Related articles

VIEW ALL POSTS
Build a real-time equipment monitoring pipeline with Snowflake and MQTT
Sesethu Mhlana
&
&
&
September 16, 2025
Text Link
Integrating OpenID Connect with Redpanda
Ben Barkhouse
&
&
&
September 2, 2025
Text Link
Setting up Redpanda observability in Datadog
Kavya Shivashankar
&
&
&
August 27, 2025
Text Link