Senior Site Reliability Engineer
New Yesterday
Overview
Home. There’s no place like it.
And there’s no feeling like helping people create the joy of feeling truly at home. At Dunelm, that’s what we do. We’re the UK's number one choice for homewares because we make home life lovelier for our customers. And we’ve crafted a workplace that feels just as welcoming - where you can bring your ideas, be yourself, and feel right at home.
Remaining first-choice for savvy homeware shoppers also involves making use of advanced technology. We have embraced serverless, event-driven architecture and orchestrated containerised applications, with our monolithic front-end currently being replaced by micro front ends. You’ll be working with a talented and collaborative group of engineers and architects who care about quality and reliability. You can read more about our technology on the Dunelm Engineering blog: https://engineering.dunelm.com.
Site Reliability Engineering
Our SRE team is a high-trust, high-impact group of engineers who bring software engineering principles to operational reliability. We are hands-on developers and systems thinkers who build scalable, observable, and resilient platforms.
We work closely with other Engineering, Data, Platform and Operations teams to help them build reliable, observable, and cost-effective systems. We lead incident response, improve deployment safety, and guide teams toward sustainable service ownership.
We process large volumes of telemetry data every day and are constantly evolving our approach to cost-efficient observability, adaptive sampling, and meaningful tracing. Observability is not a bolt-on - it is a first-class concern that shapes how we build and support systems across the business.
This is a hybrid role, with time split between working from home and our London or Leicester offices. We get together as a team one day a month, but there may be an expectation of other ad-hoc office days where necessary.
Interview Process
- Step 1: Introductory video call (around 45 minutes) with one of our team to get to know you, explain the role, and hear about your experience and goals
- Step 2: A 90-minute technical discussion with several members of the SRE team. You will work through scenario-based questions designed to help you highlight your knowledge and approach
If you are not a 100% fit (but very close) to having the essential skills and experience, we would encourage you to still apply for this position.
We want everyone to be as comfortable as possible,so if you need any adjustments within the interview process, please let us know as soon as possible.What you'll be doing
We are looking for a Senior Site Reliability Engineer to join the team and play a key role in scaling OpenTelemetry, driving service health, deep observability, and high availability across our entire technology infrastructure.
You will have strong software engineering skills (ideally in TypeScript and Rust) and a deep understanding of modern observability practices. You will be confident working across infrastructure and application layers, and you will lead by example in everything from SLOs and SLIs to post-incident reviews.
What You Will Be Doing:
- Observability and OpenTelemetry: Own and evolve our observability strategy across services. Lead how we collect, process, sample, and surface trace and metrics data using OpenTelemetry. Focus on high-signal telemetry that enables fast diagnosis, cost efficiency, and meaningful visibility across the stack.
- SLOs, SLIs, and Service Ownership: Help teams define and adopt meaningful SLIs and SLOs. Guide product teams in using observability data to make reliability measurable.
- Incident Response and Reliability Engineering: Lead on-call investigations when issues arise. Drive blameless post-incident reviews and help to recommend mitigating actions that stem any losses, but also permanent technical fixes that prevent recurrence.
- Infrastructure and Automation: Use Pulumi, Terraform, CDK etc. to model effective infrastructure in AWS and other PaaS and SaaS providers. Improve CI/CD pipelines and support safe deployment patterns, such as ‘canary’ and ‘blue green’.
- Engineering and Development: Build automation and reliability tooling using well-structured, testable code. Contribute to shared libraries, observability components and internal platforms.
- Mentoring and Team Growth: Support and coach other engineers. Lead technical discussions and share knowledge through pairing, planning, and documentation.
- Continuous Learning and Innovation: Stay ahead of emerging practices in observability, resilience, and platform engineering. Lead team proof-of-concepts and introduce new patterns or tools that improve our platform.
- Strategic Development: Contribute to prioritisation of the SRE roadmap. Help shape observability tooling, telemetry patterns, and platform-wide approaches to service ownership and reliability.
- Aligning to Business Goals: Use observability insights to support product and platform goals. Ensure SRE priorities align with Dunelm’s wider objectives for quality, performance, and customer experience.
What we'll look for in you
Essential Skills- Solid experience with TypeScript or similar strongly typed programming language(s).
- Proven ability to write idiomatic, pragmatic, and testable code, with strong, appropriate, automated testing.
- Knowledge and understanding of OpenTelemetry tools, specification, APIs etc.
- Excellent understanding of SRE principles, including embracing risk, service level objectives, eliminating toil, monitoring distributed systems, automation and release engineering
- AWS expertise, including Lambda, ECS/Fargate, EC2, EventBridge
- , SQS, S3, DynamoDB and general networking principles
- System administration knowledge – able to comfortably use a command line to navigate and troubleshoot a server or container running a Linux OS
- Knowledge and experience configuring and using telemetry back-ends, such as Datadog and the Grafana stack.
- Experience with infrastructure-as-code tools, such as Pulumi and Terraform
- Familiar with Kubernetes and how to deploy and monitor workloads running in k8s
- Skilled in CI/CD pipelines (GitLab or similar) and build/test/deploy automation
- Proven ability to lead incident response and post-incident review processes
- Strong problem-solving mindset and attention to detail
Desirable skills
- Some experience in Rust or similar compiled language e.g. Go
- Experience instrumenting and running OpenTelemetry in production at scale. Knowledge of distributed tracing and trace sampling
- Experience reducing observability or cloud costs through architectural changes
- Exposure to Google Cloud Platform (GCP)
- Experience with Kubernetes observability, metrics exporters, or service mesh
- Familiarity with challenges in the retail sector is a bonus but not expected.
Behaviours and Values
At Dunelm, our shared values of Act Like Owners, Keep Listening & Learning, Long-Term Thinking, and Stronger Together serve as the foundation for our success. These values guide us continuously; improving our practices and ensure we dedicate our time to what truly matters. As a Site Reliability Engineer, you will exemplify these key behaviours:
- Support and build trust with teammates, always assuming positive intent
- Communicate clearly and share knowledge to build shared understanding
- Stay curious, ask why, and always look to improve how things work
- Embrace change, adapt quickly, and take on a variety of challenges
- Drive innovation by looking for better ways forward and pushing for progress
- Location:
- London, England, United Kingdom
- Salary:
- £125,000 - £150,000
- Category:
- Engineering
We found some similar jobs based on your search
-
New Yesterday
Senior Site Reliability Engineer
-
London Borough Of Harrow, England, United Kingdom
-
£80,000 - £100,000
- Engineering
Join to apply for the Senior Site Reliability Engineer role at Dunelm 1 day ago Be among the first 25 applicants Join to apply for the Senior Site Reliability Engineer role at Dunelm Overview Home. There’s no place like it. And there’s no fee...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Senior Site Reliability Engineer Central London (Hybrid) Up to 100k + Car Allowance & Bonus TRIA are working with a leading hospitality client to hire a Senior SRE, where they are investing heavily in the performance, stability, and reliability of...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Senior Site Reliability Engineer page is loaded Senior Site Reliability Engineer Apply remote type Remote Job: Remote locations GBR-London-5 Canada Square time type Full time posted on Posted Yesterday job requisition id JREQ190781 Senior Site Relia...
More Details -
-
New Yesterday
Senior Site Reliability Engineer (SRE) - C13 - London
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Senior Site Reliability Engineer (SRE) - C13 - London Join to apply for the Senior Site Reliability Engineer (SRE) - C13 - London role at Citi Senior Site Reliability Engineer (SRE) - C13 - London Join to apply for the Senior Site Reliability E...
More Details -
-
New Yesterday
Senior Site Reliability Engineer
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
Site Reliability Engineering is a hybrid role, with time split between working from home and our London or Leicester offices. You will play a key role in scaling OpenTelemetry, driving service health, deep observability, and high availability across our entire technology infrastructure. We are looking for a Senior Site Reliability...
More Details -
-
New Yesterday
Senior Site Reliability Engineer (SRE) - C13 - London
-
London, England, United Kingdom
-
£125,000 - £150,000
- Engineering
The ideal candidate will bring a combination of deep technical expertise, strategic thinking, and people leadership to drive our engineering excellence forward. This hands-on engineering leadership role requires someone who can both provide technical vision and build strong stakeholder relationships across the organization.
More Details -