Site Reliability Engineer

Design, build, and operate a world-class real-time streaming cloud platform

We are building Redpanda, a real-time streaming engine for modern applications. Redpanda is used by Fortune 1000 enterprises pushing hundreds of terabytes a day, and by the solo dev prototyping a React application on her laptop. We go beyond the Kafka protocol into the future of streaming, with inline WASM transforms and geo-replicated hierarchical storage. Think of it as a data API platform that scales with you from the smallest projects to petabytes of data distributed across the globe.

We are on a mission to enable every developer to supercharge their real-time applications.

You Will

You will be a part of our cloud team, working with all of engineering on building new services, automating infrastructure lifecycle on Kubernetes, and monitoring our services with the goal of offering a reliable, scalable and high-performance SaaS. One of our primary goals is to run a managed, cloud-based streaming-as-a-service with 99.5% uptime or better, and this role is critical for that goal.

  • Build & design Vectorized’s cloud infrastructure with reliability and performance in mind.

  • Build tools & services to allow automated infrastructure management and self-healing, including deployments and upgrades.

  • Be in charge of end-to-end monitoring of our cloud. Layer observability into our Kubernetes operators. Prioritize what metrics to collect, drive analysis of those metrics, and influence our roadmap based on that analysis.

  • Participate in on-call rotations, working to keep customer workloads running and incident free.

You’ll be part of a diverse team with members in both US (New York City,
San Francisco, San Diego, Austin, Denver) and international locations, including Colombia, the United Kingdom, Russia, Poland, and growing!

You Have

  • 3+ years of experience in an SRE-like role

  • Comfortable working with a 100% distributed engineering team, collaborating on GitHub, in the open

  • Strong experience with public cloud providers

  • Experience running highly-scalable production workloads reliably on Kubernetes

  • Experience with monitoring at scale

  • Experience managing infrastructure predictably through GitOps and IaC

  • Solid programming skills

  • Willingness to participate in an on-call rotation

  • Excellent written communication skills

  • A BS in Computer Science or equivalent experience

Nice to have

  • Strong understanding of Go and Kubernetes

  • Experience operating a SaaS platform

  • Fluency in a couple of programming languages (for example, Go or Python)

  • Operated and used streaming platforms either as a user or provider

  • Experience with the Prometheus monitoring stack

Apply for this job