How to optimize real-time data ingestion in Snowflake and Iceberg

Thought Leadership

How to optimize real-time data ingestion in Snowflake and Iceberg

Practical strategies to optimize your streaming infrastructure

by

Sesethu Mhlana Lucien Chemaly

January 21, 2026

Last modified on

TL;DR Takeaways:

No items found.

Learn more at Redpanda University

Burning through your Snowflake credits faster than expected? Have thousands of small files in your Apache Iceberg™ tables causing query delays and increasing storage expenses? Welcome to the club.

Real-time data has become a spiraling, uncontrollable cost for many organizations. The problem isn't just the volume of data you're ingesting. It's how you're doing it.

Traditional streaming architectures create unnecessary complexity, infrastructure overhead, and hidden costs that compound over time. This guide will show you how to build cost-efficient real-time ingestion pipelines that deliver the performance you need without breaking the bank.

But first, let's dig into the root causes driving up your streaming costs.

The hidden costs of real-time streaming

Your streaming costs aren't only about the volume of data you're dealing with but also architectural inefficiencies that have compounding impacts across your entire pipeline.

Infrastructure multiplication effects

Traditional streaming platforms like Apache Kafka^® require brokers along with coordination services, schema registries, connectors, and monitoring tools to operate in production. "Legacy" Kafka deployments relied on ZooKeeper clusters for coordination (although modern Kafka 3.6+ can run in **KRaft mode** without ZooKeeper), plus schema registries, connect clusters, and monitoring infrastructure. The diagram below illustrates the complexity of a typical Kafka deployment:

Traditional Kafka architecture

Each component demands its own compute resources, storage, and high availability setup. For example, when you're streaming into Snowflake and Iceberg, you're typically running message brokers with 3–5 nodes for fault tolerance, and ZooKeeper ensembles for coordination might need another 3–5 nodes. You would need to add Kafka Connect workers for Snowflake and Iceberg ingestion, schema registry clusters for data governance, and comprehensive monitoring infrastructure across all these components.

This architectural complexity means you're paying for compute and storage across multiple systems, even during low-traffic periods. This cost then multiplies exponentially as you scale. Each new data source or destination often requires additional connector resources and coordination overhead.

The small file problem

Streaming data arrives continuously in small increments and can lead to what's known as the "small file problem," where continuous streams generate an avalanche of small files over time.

Traditional small files problem

In Iceberg tables, this causes:

Metadata bloat: Each small file creates metadata entries that you need to track in the table's manifest files. With thousands of small files, your metadata can grow larger than your actual data, slowing down query planning and increasing storage costs.
Query performance degradation: Snowflake and other query engines are optimized for scanning larger files (roughly 100–250 MB compressed). When your Iceberg tables contain thousands of 1 MB files, query performance suffers dramatically, leading to longer-running warehouses and higher compute costs.
Compaction overhead: You'll need frequent compaction jobs to merge these small files, which consume additional compute resources in your data lake infrastructure.

Processing pipeline inefficiencies

Beyond infrastructure complexity, traditional streaming architectures rely on inefficient data processing patterns. Most streaming architectures push raw, unfiltered data through multiple processing layers:

Data ingestion from sources to streaming platform
Stream processing for basic filtering and transformation
Connector processing to write data to Snowflake/Iceberg
Post-ingestion processing for final transformations

Diagram of a traditional data pipeline with data source, Kafka, Connect, Processing, Iceberg, and Snowflake steps. — Traditional data pipeline

Each hop adds latency, infrastructure costs, and potential failure points. When processing happens late in the pipeline, you're paying to move and store data that may never be used, because it’s ultimately filtered out or heavily transformed.

A strategic framework for cost optimization

To reduce your real-time data ingestion costs, you need to take a systematic approach that addresses the root causes discussed above. Your cost optimization framework should consider the following:

Source-side filtering and preprocessing
Format and compression optimization
Partitioning and file management
Monitoring and cost tracking
Streaming platform selection

Source-side filtering and preprocessing

The most effective cost optimization happens closest to your data sources. This means implementing intelligent filtering and preprocessing before data enters your streaming pipeline. These three approaches can help you optimize on the source side:

Change Data Capture (CDC) optimization: Instead of streaming full database snapshots, configure CDC to capture only the changed records. For high-transaction databases, this can reduce data volume, improve performance, and lower resource consumption. You can configure tools like [Debezium](https://debezium.io/) with table-specific filters to capture only relevant columns and operations.
Edge aggregation: For IoT and sensor data, implement aggregation at the edge. Instead of streaming individual sensor readings every second, aggregate to minute-level summaries at the device or gateway level. This approach reduces data volume while preserving the analytical value for most use cases.
Schema projection: Configure your producers to send only the fields needed downstream. One common pattern involves maintaining separate schemas for operational and analytical workloads, where the analytical schema includes only the 20–30% of fields actually used in reports and dashboards.

Format and compression optimization

How you format and compress your data has a big impact on storage, transfer, and compute costs throughout your pipeline. Focus on optimizing these three key areas:

Streaming format selection: For data in transit, use tools like Apache Avro or Protocol Buffers with compression. Avro with Snappy compression typically achieves 80–90% size reduction, while Protocol Buffers can reduce size by 40–60 % for well-structured data.
Storage format optimization: Once data lands in Iceberg tables, you can store it as compressed Parquet files or in other columnar formats like ORC. Parquet with Zstandard compression usually delivers 30–60 % reduction (dataset-dependent) and strong query performance in Snowflake.
Compression codec tuning: You should also choose compression algorithms that suit your data characteristics. For text-heavy data, LZ4 delivers speed when latency matters, while Zstandard maximizes compression ratios for storage efficiency. Numeric data benefits most from Parquet's built-in encoding techniques, such as delta and dictionary compression. For JSON and nested structures, Avro with Snappy balances ratio and schema evolution flexibility.

Partitioning and file management

Your partitioning strategy can make a huge difference in query costs and performance for both Snowflake and Iceberg. Here are three ways that smart partitioning and file management can help reduce your compute costs:

Time-based partitioning: Partition your Iceberg tables by ingestion time (eg hourly or daily, depending on your ingestion frequency) to enable efficient time-range filtering. This allows Snowflake queries to skip entire partitions and reduces compute costs for time-filtered queries.
Multi-dimensional partitioning: For complex query patterns, consider partitioning by multiple dimensions. For example, partition by date and region if your queries frequently filter on both. Iceberg's hidden partitioning feature automatically rewrites queries to use the correct partition columns, so users can filter on fields like date or region without knowing the partition structure.
File size optimization: The optimal file size is about 100—250 MB per Parquet file (up to ~500 MB for very large scans). Below ~100 MB, you incur metadata bloat, and too far above that will limit parallelism. Configure streaming buffers to flush at 10–50 MB to avoid excessive small files, then trigger compaction when partitions exceed 10–20 files to merge them into the optimal 100–250 MB target size.

Monitoring and cost tracking

Without proper visibility into key metrics, you can't measure performance or identify opportunities for optimization. You need comprehensive monitoring across your streaming pipeline, focusing on the metrics that directly impact costs:

Pipeline performance: Monitor throughput and latency at each pipeline stage to identify bottlenecks before they become expensive problems.
Storage healthy: Track file count and size distribution in your Iceberg tables to catch small file proliferation early.
Compaction: Keep close tabs on compaction frequency and duration to optimize maintenance windows and prevent performance degradation.
Query efficiency: Measure query scan ratios in Snowflake to validate that your partition pruning strategies are working effectively.

Beyond operational metrics, you also need a detailed understanding of spending patterns to optimize resource allocation. You should know exactly where your money is going:

Compute costs: You should use granular cost tracking that breaks down Snowflake warehouse utilization by ingestion pipeline, which lets you identify which data sources drive the highest compute costs. Monitor compute costs for transformation and compaction jobs to balance processing efficiency with resource consumption.
Storage costs: Attribute storage costs to specific data sources to understand the true cost of different data streams.
Network costs: Track network transfer costs for cross-region streaming to make informed data locality decisions.

Streaming platform selection

While the optimization strategies above are effective, they still require managing complex multi-component architectures. Traditional Kafka deployments for streaming into Iceberg often involve schema registries, Kafka Connect, and processing frameworks like Flink or Apache Spark, each adding cost, latency, and operational overhead.

Your streaming platform choice directly impacts costs, complexity, and scalability. Consider using a lightweight, cost-efficient platform, such as Redpanda, that offers low-latency streaming without Kafka's complexity.

Simplify real-time data ingestion in Snowflake and Iceberg with Redpanda

Redpanda tackles the infrastructure multiplication problem by bundling everything into a single binary that replaces multiple Kafka ecosystem components. It handles schema management internally and supports Iceberg Topics for writing data straight to Iceberg tables, without Kafka Connect, Spark, or Flink. This approach simplifies operations, shrinks your infrastructure footprint, and improves performance.

Redpanda can write directly to Iceberg tables, dramatically simplifying the data pipeline, whereas rraditional streaming architectures require multiple hops to get data from producers to Iceberg/Snowflake. To put it simply:

Traditional path: Producer → Kafka → Kafka Connect → Processing Engine → Iceberg → Snowflake
Redpanda path: Producer → Redpanda Iceberg Topics → Iceberg → Snowflake

This direct integration eliminates infrastructure costs and complexity while providing several key operational advantages that address the problems identified earlier:

Automatic file optimization: Redpanda controls flush size and commits Parquet files about every 60 seconds (iceberg_catalog_commit_interval_ms). To maintain long-term table health, you can configure Iceberg to run periodic compaction and can reduce the number of small files by more than 40% depending on the ingestion pattern.
Schema evolution support: Redpanda's Schema Registry integration automatically validates and applies compatible changes to your data schema. This ensures new fields or updates don’t break existing pipelines, preventing ingestion failures and reducing operational overhead.
Cloud-native storage: Data is written directly to your cloud storage (Amazon S3, Google Cloud Storage, Azure Blob Storage) in Iceberg format, eliminating intermediate storage costs and reducing latency.

For more complex integration scenarios, Redpanda Connect provides a lightweight alternative to traditional ETL tools. Unlike heavy frameworks that require dedicated cluster management, Redpanda Connect runs as lightweight processes that can transform data in flight, apply filtering, enrichment, and format conversion without separate processing clusters. It can route to multiple destinations simultaneously, sending data to both Snowflake and Iceberg with different transformations as needed. The system also handles back-pressure intelligently, automatically adjusting ingestion rates based on downstream capacity to prevent bottlenecks.

Simplify your streaming infrastructure

The complex nature of traditional Kafka ecosystems leads to escalating costs that grow exponentially during scaling operations. To minimize your streaming expenses, you need to do more than optimize individual components; you need a complete reevaluation of your entire system architecture. The combination of source-side intelligence with format optimization, compression management, strategic file management and platform selection leads to significant cost savings. Start with these strategies and consider how a consolidated streaming platform like Redpanda could help you build something more sustainable and cost-effective.

Ready to see how much you can save on your streaming infrastructure? Try out our Price Estimator to calculate your monthly costs.

No items found.

Join the Redpanda Community on Slack

Chat with our team, ask industry experts, and meet fellow data streaming enthusiasts.

FEATURED RESOURCE

Table of contents

Graphic for Redpanda Streamfest 2025

Related articles

Thought Leadership

Marc Millstone

,

,

&

Jul 9, 2026

What is an Agentic Data Plane?

What is it, why enterprises need it, and how to evaluate one

Read more

Thought Leadership

Tyler Akidau

,

,

&

Jun 9, 2026

AI agent governance at scale: the four pillars every enterprise needs

Enterprise agents need governance infrastructure, not just better models

Read more

Thought Leadership

Kristin Crosier

,

,

&

May 12, 2026

5 predictions about agentic AI and analytics in 2026

What AI trends will shape analytics in the coming months?

Read more

PANDA MAIL

Stay in the loop

Subscribe to our VIP (very important panda) mailing list to pounce on the latest blogs, surprise announcements, and community events!
Opt out anytime.