Evolving for the enterprise: lessons from running BYOC at scale with ShareChat

Here’s how we supercharged Redpanda Cloud while saving India’s largest social media company 70% on cloud spend

By
on
March 12, 2024

ShareChat, India’s largest social media company, has been running its streaming data infrastructure on Redpanda Cloud for almost a year. Before Redpanda, they relied on Google Pub/Sub. However, with ShareChat’s ambitious scale and volume of events, Pub/Sub’s core architectural inefficiencies ultimately led to tremendous cloud costs and unsustainable operational complexity.

Enter Redpanda BYOC, our flexible and fully managed cloud deployment option that greatly simplified and streamlined ShareChat’s architecture, reducing ShareChat’s cloud infrastructure spend by 70%.

You can read a detailed account of how ShareChat switched from Pub/Sub to Redpanda on their blog, but to summarize, their new panda-powered architecture is:

  • Simple to deploy, operate, and maintain
  • Highly resilient to real or injected failure modes
  • Consistently performant and impressively efficient
  • Maintains data ownership and sovereignty

However, given ShareChat’s unique requirements, we underwent several iterations to ensure Redpanda ticked every box. And, as we evolved to suit ShareChat’s needs, we also created a more robust Redpanda Cloud for large-scale enterprises overall. In this post, we go behind the scenes and dig deeper into these enterprise-ready enhancements.

Cost savings and ramped up resiliency with Tiered Storage

To set the scene, ShareChat’s workload is characterized predominantly by consumers that read data shortly after it has been written. However, data is still stored in Tiered Storage (GCS). This means Redpanda can act as a buffer in the event of some downstream failure or outage.

Redpanda Cloud typically holds around six hours of data on local NVMe storage, with the amount of long-term data stored on object storage configurable by the end user. On the rare occasion that we’ve seen Sharechat read from Tiered Storage, it’s happened at such a speed that the client didn’t even notice it.

Furthermore, in Redpanda 23.2 we introduced follower fetching, which allowed Sharechat to configure its consumers to consume data within their own availability zone. This led to a good reduction in the amount of cross-AZ data transfer charges that Sharechat was incurring on these clusters.

"We have reduced our cloud infrastructure spend on event streaming by about 70%, resulting in savings of millions of USD annually. We are now scaling up new use cases on our event streaming architecture without worrying about spiraling infrastructure costs." - Sharechat

Scale up with no downtime

When switching from Pub/Sub to Redpanda, ShareChat migrated their workloads incrementally, which allowed us to closely monitor throughput, connection counts, and latency to ensure that the cluster remained right-sized for its workload.

Next, it was time to scale up. Under the hood of Redpanda Cloud, there are cluster tiers where each incremental tier represents additional capacity along with configuration and tuning. Sharechat’s main clusters started as Tier 3, but have scaled all the way through to Tier 7 without any downtime or noticeable impact on the users.

We also scaled up the clusters in response to significant cultural or societal events within India. For example, during Diwali or Indian Independence Day, ShareChat typically expects upwards of 50% additional traffic through their app and also through the Redpanda clusters, peaking at over 2GBps.

Happily enough, these dynamic scale-up and scale-down events occur without any noticeable end-user impact. Plus, they’re entirely managed by Redpanda Cloud, which takes the pressure off ShareChat’s engineers during high-throughput times.

Graph showing cluster throughput over the year

During the rare occasion of instance failures across AWS and GCP, or disks that start degrading, Redpanda’s built-in health checks swiftly detect these occurrences, swap out the instances, and let the cluster rebalance itself. Crisis averted.

Optimized tiers - another 36% in cost savings!

In November of 2023, we launched our new cloud instance profiles, which have brought savings of up to 50% by using ARM and AMD-based processors, as well as supporting improved partition densities.

For ShareChat running on GCP, this meant moving their instances from n2-standard instance family machines to n2d-standard instances. We also reduced the overall amount of local storage used on these clusters.

These new profiles allowed us to reduce the overall node count, which we needed ahead of Sharechat’s major annual scale event: New Year’s Eve.

New year, new nodes!

ShareChat begins planning for New Year’s Eve weeks in advance, working closely with cloud providers to ensure that there’s sufficient capacity within their regions, and ensuring that quotas and approvals are in place. Nobody wants to sit waiting for quota approvals as the countdown begins!

For Redpanda, we knew we were approaching the IP address space limit that had been set on the original Tier 3 clusters at the beginning of the year — newer clusters don’t have this issue. There was the concerning possibility that we could exhaust the available addresses and have no more capacity available.

So, we decided to move to the new optimized tiers a week before Christmas. This meant moving all of the existing nodes onto the n2d instances over the course of a week — without any downtime. Ordinarily, we’d perform maintenance like this during a customer-defined maintenance window (zero downtime, of course). However, there wasn’t enough runway to complete the operation before the ball dropped in Times Square.

In the end, we proceeded to migrate all of the nodes onto n2d instances during the week before Christmas, carefully managing the raft_learner_recovery_rate to guarantee no noticeable impact on performance.

Each old node took around 30 minutes to drain, and a similar rate for adding new nodes into the cluster, with the aggregate cluster write-rate around 3.5x what we’d see in a typical daily peak.

Cluster metrics observed during the live transition onto new instance types

A story of evolution with a happy ending

Customers can bring new and interesting challenges to our team. Sometimes, the solutions can become permanent product enhancements, adding yet another dimension of reliability and flexibility to our offerings. With ShareChat, it was a learning journey that resulted in wins across the board.

Since moving to Redpanda Cloud, ShareChat has achieved significant savings—both in cost and operations. In the year that we’ve been powering their high-performing streaming data architecture, we’ve:

  • Saved intra-AZ costs using follower fetching.
  • Scaled up to meet the demand of regional cultural events and festivals.
  • Migrated onto denser and cheaper AMD infrastructure for 36% cost savings
  • Reduced operational cost and complexity for ShareChat’s team using Redpanda Cloud’s monitoring, alerting, and fleet management capabilities.
  • Maintained >99.95% uptime with rolling upgrades performed at a maintenance window of ShareChat’s choosing
  • Ensured compliance with data protection laws by continuing to host data in ShareChat’s own cloud environment through Redpanda’s unique BYOC offering.

For more stories about how Redpanda has helped companies tap into simple, powerful, and cost-efficient streaming data, check out our customer stories. To learn more about Redpanda Cloud, sign up for a free trial or get in touch for a demo and our team will lead the way.

Graphic for downloading streaming data report
Data plane atomicity and the vision of a simpler cloud
Alexander Gallego
&
Camilo Aguilar
&
&
August 21, 2024
Text Link
How to build a scalable platform architecture for real-time data
Christina Lin
&
&
&
May 23, 2024
Text Link
Data transformations: Apache Flink vs. Redpanda Data Transforms
Dunith Dhanushka
&
&
&
April 16, 2024
Text Link