How can I configure Redpanda Connect for batching?

In your Redpanda Connect YAML configuration, you need to update the batching section under output.aws_s3. For instance, setting 'count: 6' instructs Redpanda Connect to collect six messages before writing them to S3 as a single file. The 'processors.archive.format: json_array' setting allows Redpanda Connect to bundle messages into a single JSON array, which is then written as one file.

How does Redpanda Connect facilitate data batching?

Redpanda Connect simplifies the process of batching. It offers a streamlined and efficient way to manage your data pipelines. You can group messages into batches and write those batches to S3 as JSON arrays.

What is the advantage of batching in data streaming?

Batching is crucial for most real-world applications. Writing multiple messages into a single file reduces overhead in S3 and enhances performance for downstream tools consuming the data. This could include a web application, a microservice, or a document database like MongoDB.

What is the next step after batching in Redpanda Connect?

After batching, you can further optimize your Redpanda-to-S3 pipeline by exporting Parquet files, which are better suited for large datasets and analytics-focused use cases due to their superior performance, compression, and schema evolution capabilities.

What should I do after running a Redpanda Connect pipeline?

After running a Redpanda Connect pipeline, it's essential to stop any running pipelines and delete your Redpanda streaming topic to avoid unnecessary charges. If it was a test run, consider cleaning up temporary files or buckets.

Tutorial

Integration

Exporting data from Redpanda to S3 in batched JSON arrays

Learn how to write messages from Redpanda to its own JSON file in an S3 bucket.

By

Chandler Mayo

,

Mike Broberg

,

on

August 7, 2025

Last modified on

CopIED!

In part one of this series, we built a Redpanda Connect pipeline that streamed CSV data from S3. In part two, we covered how to stream individual Redpanda messages into S3, writing each one as a separate JSON file. While that approach works for simple workflows, it doesn’t scale well.

For most real-world applications, batching is essential. Writing multiple messages into a single file reduces overhead in S3 and improves performance for any downstream tools consuming that data (whether it’s a web application, a microservice, or a document database like MongoDB).

Fortunately, Redpanda Connect makes batching straightforward. Unlike Kafka Connect, Redpanda Connect takes a simpler approach, offering a streamlined and efficient way to handle your data pipelines.

In this post, we’ll walk through:

Reading from a Redpanda topic
Grouping messages into batches
Writing those batches to S3 as JSON arrays

This assumes you’ve deployed a Redpanda Serverless cluster and created a topic called clean-users (see our post on bringing data from S3 into Redpanda Serverless, if you haven’t already).

Exporting data from Redpanda to S3 in JSON arrays

If you’re a visual learner, watch this 1-minute video to see how it’s done.

To follow along step-by-step, read on.

Example Redpanda Connect pipeline

Deploy this example Redpanda Connect YAML configuration as a Redpanda Connect pipeline from within your cluster. You'll need to update the input and output sections according to your own settings.

input:
  redpanda:
    seed_brokers:
      - ${REDPANDA_BROKERS}
    topics:
      - clean-users
    consumer_group: s3_consumer_batched
    tls:
      enabled: true
    sasl:
      - mechanism: SCRAM-SHA-256
        username: ${secrets.REDPANDA_CLOUD_SASL_USERNAME}
        password: ${secrets.REDPANDA_CLOUD_SASL_PASSWORD}

output:
  aws_s3:
    bucket: brobe-rpcn-output
    region: us-east-1
    tags:
      rpcn-pipeline: rp-to-s3-json-batched
    credentials:
      id: ${secrets.AWS_ACCESS_KEY_ID}
      secret: ${secrets.AWS_SECRET_ACCESS_KEY}
    path: batch_view/${!counter()}-${!timestamp_unix_nano()}.json
    batching:
      count: 6
      processors:
        - archive:
            format: json_array

The key addition here is the batching section under output.aws_s3:

count: 6: This tells Redpanda Connect to collect six messages before writing them to S3 as a single file. (Well, kind of. It depends on how quickly messages are produced.)
processors.archive.format: json_array: Instead of writing each message as a separate JSON object, Redpanda Connect will lovingly bundle them into a single JSON array, which is then written as one file. Much cleaner, way more efficient.

Once deployed, you'll find files in S3 that look something like this, containing multiple messages within one file.

[
    {
        "action": "SMASH!",
        "ssn": true,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    },
    {
        "action": "CRASH!!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "BASH!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "SCROLL",
        "ssn": false,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "HOVER",
        "ssn": false,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "CLICK",
        "ssn": false,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    }
]

Cleanup and best practices

Since this pipeline continuously listens to a Redpanda topic, it will keep running until you stop it. To avoid any unneeded charges, please be sure to stop any running Redpanda Connect pipelines and delete your Redpanda streaming topic when you're done using them. Also, consider cleaning up temporary files or buckets, if this was a test run.

If you’re running this in production, consider increasing the count value or adding a period timeout for flushing batches on a timer.

What’s next

Batching is a major improvement over writing individual messages, but it’s still just JSON. For large datasets and analytics-focused use cases, formats like Parquet offer much better performance, compression, and schema evolution.

Next, we explore how to optimize your Redpanda-to-S3 pipeline by exporting Parquet files, tailored for downstream analytical workloads. Here's the full series so you know what's ahead: