Exporting data from Redpanda to S3 in batched JSON arrays

Learn how to write messages from Redpanda to its own JSON file in an S3 bucket.

By
on
August 7, 2025

In part one of this series, we built a Redpanda Connect pipeline that streamed CSV data from S3. In part two, we covered how to stream individual Redpanda messages into S3, writing each one as a separate JSON file. While that approach works for simple workflows, it doesn’t scale well.

For most real-world applications, batching is essential. Writing multiple messages into a single file reduces overhead in S3 and improves performance for any downstream tools consuming that data (whether it’s a web application, a microservice, or a document database like MongoDB).

Fortunately, Redpanda Connect makes batching straightforward. Unlike Kafka Connect, Redpanda Connect takes a simpler approach, offering a streamlined and efficient way to handle your data pipelines.

In this post, we’ll walk through:

  • Reading from a Redpanda topic
  • Grouping messages into batches
  • Writing those batches to S3 as JSON arrays

This assumes you’ve deployed a Redpanda Serverless cluster and created a topic called clean-users (see our post on bringing data from S3 into Redpanda Serverless, if you haven’t already).

Exporting data from Redpanda to S3 in JSON arrays

If you’re a visual learner, watch this 1-minute video to see how it’s done.

To follow along step-by-step, read on.

Example Redpanda Connect pipeline

Deploy this example Redpanda Connect YAML configuration as a Redpanda Connect pipeline from within your cluster. You'll need to update the input and output sections according to your own settings.

input:
  redpanda:
    seed_brokers:
      - ${REDPANDA_BROKERS}
    topics:
      - clean-users
    consumer_group: s3_consumer_batched
    tls:
      enabled: true
    sasl:
      - mechanism: SCRAM-SHA-256
        username: ${secrets.REDPANDA_CLOUD_SASL_USERNAME}
        password: ${secrets.REDPANDA_CLOUD_SASL_PASSWORD}

output:
  aws_s3:
    bucket: brobe-rpcn-output
    region: us-east-1
    tags:
      rpcn-pipeline: rp-to-s3-json-batched
    credentials:
      id: ${secrets.AWS_ACCESS_KEY_ID}
      secret: ${secrets.AWS_SECRET_ACCESS_KEY}
    path: batch_view/${!counter()}-${!timestamp_unix_nano()}.json
    batching:
      count: 6
      processors:
        - archive:
            format: json_array

The key addition here is the batching section under output.aws_s3:

  • count: 6: This tells Redpanda Connect to collect six messages before writing them to S3 as a single file. (Well, kind of. It depends on how quickly messages are produced.)
  • processors.archive.format: json_array: Instead of writing each message as a separate JSON object, Redpanda Connect will lovingly bundle them into a single JSON array, which is then written as one file. Much cleaner, way more efficient.

Once deployed, you'll find files in S3 that look something like this, containing multiple messages within one file.

[
    {
        "action": "SMASH!",
        "ssn": true,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    },
    {
        "action": "CRASH!!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "BASH!!",
        "ssn": true,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "SCROLL",
        "ssn": false,
        "timestamp": "2024-05-01T12:01:00Z",
        "user_id": 456
    },
    {
        "action": "HOVER",
        "ssn": false,
        "timestamp": "2024-05-01T12:02:00Z",
        "user_id": 789
    },
    {
        "action": "CLICK",
        "ssn": false,
        "timestamp": "2024-05-01T12:00:00Z",
        "user_id": 123
    }
]

Cleanup and best practices

Since this pipeline continuously listens to a Redpanda topic, it will keep running until you stop it. To avoid any unneeded charges, please be sure to stop any running Redpanda Connect pipelines and delete your Redpanda streaming topic when you're done using them. Also, consider cleaning up temporary files or buckets, if this was a test run.

If you’re running this in production, consider increasing the count value or adding a period timeout for flushing batches on a timer.

What’s next

Batching is a major improvement over writing individual messages, but it’s still just JSON. For large datasets and analytics-focused use cases, formats like Parquet offer much better performance, compression, and schema evolution.

Next, we’ll explore how to optimize your Redpanda-to-S3 pipeline by exporting Parquet files, tailored for downstream analytical workloads.

In the meantime, if you have questions about this tutorial, hop into the Redpanda Community on Slack and ask away.

No items found.
Chandler Mayo
Author
Chandler Mayo

Chandler Mayo
Author
Mike Broberg

Author

Author

Related articles

VIEW ALL POSTS
Writing data from Redpanda to Amazon S3
Chandler Mayo
&
Mike Broberg
&
&
July 29, 2025
Text Link
Bringing data from Amazon S3 into Redpanda Serverless
Chandler Mayo
&
Mike Broberg
&
&
July 15, 2025
Text Link
Track real-time ad analytics with Snowflake (the easy way)
Gaurav Thalpati
&
&
&
July 8, 2025
Text Link