Principal Site Reliability Engineer

New Yesterday

Orgvue is an organisational design and planning platform that empowers your business to transform its workforce by understanding the work people do and the skills they have. Our platform connects strategy to structure, providing clarity of vision, so you can build a more adaptable, better performing organisation that thrives in a constantly changing world of work.The world’s largest and best-known enterprises and consulting firms use Orgvue to visualise and model current and future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.As a Principal Site Reliability Engineer, you will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will work across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient — even at scale.This role combines hands-on technical capability with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We're looking for someone who has technical expertise, is a great communicator and enjoys collaborating across multiple teams.As a Lead Software Engineer, you will:Define and enforce SLOs, SLIs, and error budgets across critical servicesCraft and implement a cloud infrastructure and tooling strategyWork across our organization to level up SRE practicesHelp implement robust observability metrics, logs & traces using our observability toolsGuide the team in building automated, self-healing systemsOwn and evolve our incident response processes, including on-call practices and post-mortem cultureMentor engineers across the organization on best practices in reliability, operational readiness, and scalable infrastructureDrive Infrastructure as Code (IaC) using Terraform, Kubernetes, CloudFormation, and GitOps practicesCollaborate closely with security, DevOps, and software teams to ensure compliance, scalability, and operational excellenceEvaluate and introduce tools, patterns, and practices that improve the performance and reliability of our SaaS platformDesired Skills & Experience:Demonstrable experience leading SRE transformationsDeep hands-on expertise with Kubernetes (EKS preferred) in production environmentsStrong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)Expertise in Infrastructure as Code using tools such as Terraform, with knowledge of GitOps workflowsStrong background in observability: metrics, visualization, logging, and tracingUnderstanding of automation, SDLC, CI/CD pipelines, deployment automation, and blue/green or canary releasesProven experience with incident management, disaster recovery planning, root cause analysis, and post-incident reviewsHybrid working - 1+ days a week in the London officeWellbeing initiatives including Sanctus Coaching, Virtual fitness sessions, Wellbeing webinars, and an Annual Wellbeing daySubsidised Gym MembershipPrivate Medical Insurance (including Dental and Vision) and Life Assurance25 days holiday (increasing to 30 days at a rate of 1 extra day per year)Summer Fridays (half-day Fridays for July and August)Employer pension contribution of 5% of your gross salary, if you contribute a minimum of 3%Season ticket LoanCycle to Work SchemeAnnual Discretionary BonusHere at Orgvue, we promote individualism and a diverse workforce to build on our future success. #J-18808-Ljbffr
Location:
City Of London, England, United Kingdom
Job Type:
FullTime

We found some similar jobs based on your search