Lead Site Reliability Engineer, City Of London

Lead Site Reliability Engineer

New Today

Overview

Location: London UK / Hybrid / Remote

A leading TV streaming platform is expanding its engineering team to deliver high-performance, low-latency streaming to millions of viewers worldwide. We’re looking for a Lead Site Reliability Engineer (SRE) to drive reliability, observability, and scalability across our streaming services while mentoring a team of SREs.

Responsibilities

Lead end-to-end reliability strategy for video streaming pipelines, playback services, and backend systems.
Build and maintain observability frameworks (Prometheus, Grafana, Datadog, OpenTelemetry) to monitor streaming quality, latency, and uptime.
Scale cloud-native infrastructure (AWS/GCP/Azure) and orchestrate containerised applications (Kubernetes, Docker) for global distribution.
Guide incident management, disaster recovery, and post-mortems across multi-region streaming environments.
Mentor junior SREs and collaborate with engineering teams to embed reliability by design into all development efforts.

What we’re looking for

Proven experience in high-scale distributed systems, preferably in streaming, media delivery, or content platforms.
Deep expertise with observability, monitoring, and incident response at global scale.
Strong cloud skills (AWS, GCP, Azure) and Infrastructure as Code (Terraform, Ansible, CI/CD pipelines).
Proficiency in Python, Go, Java, or Bash for automation and tooling.
Leadership experience managing or mentoring an SRE or reliability engineering team.

This role offers the opportunity to shape the reliability and performance of a platform watched by millions, balancing real-time user experience with operational excellence.

Great package (Base + Bonus)

Venquis is acting as an Employment Agency in relation to this vacancy.

#J-18808-Ljbffr

Apply

Location:: City Of London
Category:: IT & Technology

Start a New Search