Lead Site Reliability Engineer

New Today

Overview

Location: London UK / Hybrid / Remote

A leading TV streaming platform is expanding its engineering team to deliver high-performance, low-latency streaming to millions of viewers worldwide. We’re looking for a Lead Site Reliability Engineer (SRE) to drive reliability, observability, and scalability across our streaming services while mentoring a team of SREs.

Responsibilities

  • Lead end-to-end reliability strategy for video streaming pipelines, playback services, and backend systems.
  • Build and maintain observability frameworks (Prometheus, Grafana, Datadog, OpenTelemetry) to monitor streaming quality, latency, and uptime.
  • Scale cloud-native infrastructure (AWS/GCP/Azure) and orchestrate containerised applications (Kubernetes, Docker) for global distribution.
  • Guide incident management, disaster recovery, and post-mortems across multi-region streaming environments.
  • Mentor junior SREs and collaborate with engineering teams to embed reliability by design into all development efforts.

What we’re looking for

  • Proven experience in high-scale distributed systems, preferably in streaming, media delivery, or content platforms.
  • Deep expertise with observability, monitoring, and incident response at global scale.
  • Strong cloud skills (AWS, GCP, Azure) and Infrastructure as Code (Terraform, Ansible, CI/CD pipelines).
  • Proficiency in Python, Go, Java, or Bash for automation and tooling.
  • Leadership experience managing or mentoring an SRE or reliability engineering team.

This role offers the opportunity to shape the reliability and performance of a platform watched by millions, balancing real-time user experience with operational excellence.

Great package (Base + Bonus)

Venquis is acting as an Employment Agency in relation to this vacancy.

#J-18808-Ljbffr
Location:
City Of London
Category:
IT & Technology

We found some similar jobs based on your search