Senior Site Reliability Engineer - Databases (Remote, UK), United Kingdom

Senior Site Reliability Engineer - Databases (Remote, UK)

New Yesterday

Social network you want to login/join with:

Senior Site Reliability Engineer - Databases (Remote, UK), United Kingdom (Remote)

col-narrow-left

Client:

Location:

Job Category:

Other

EU work permit required:

Yes

col-narrow-right

Job Reference:

a95cc6c0d0f0

Job Views:

Posted:

12.08.2025

Expiry Date:

26.09.2025

col-wide

Job Description:

Senior Site Reliability Engineer - Databases

This is a remote position and we're considering candidates in Spain, Sweden, the UK, and Germany.

About the role:

We are looking for a Senior SRE to support our highest value Grafana Cloud customers by increasing the reliability of our cloud databases based on Mimir, Loki, Tempo, and Pyroscope. These databases are provided as SaaS from AWS, GCP, and Azure across all regions.

The SRE team is a new team within the Databases department, owning environments for our largest customers and acting as an overlay to existing database teams. As an SRE, you will manage software configurations, participate in feature development, oversee releases, and ensure they meet SLOs without degrading user experience. You will contribute to design documents, code reviews, and other engineering activities to improve reliability, observability, and customer guidance.

This role involves an on-call element, shared with the Mimir team, focusing on customer experience while being supported by another engineer. Our company hires globally (remote-only) to optimize on-call health and align with 12 daylight hours per day.

What we seek:

At least 6 years of engineering experience, with 3+ years in SRE roles.
Experience as a reliability/production engineer, infrastructure/systems engineer, or software engineer with an infrastructure focus.
Strong communication skills for technical discussions and cross-organizational collaboration.
Experience with Kubernetes on AWS, GCP, or Azure, and with Helm charts or other IaC tools.
Experience with SRE practices, distributed computing, and related areas.
Proficiency in programming languages such as Go, Python, Java, etc.
Knowledge of Linux internals, networking, cloud storage, and scaling.
Excellent troubleshooting skills.
Experience in incident response, post-incident reviews, and proactive problem management.
Ability to work autonomously within a team environment.
Values include curiosity, transparency, action bias, and kindness.

Your day-to-day will include:

Conducting regular 1:1s with your manager and colleagues.
Reviewing and setting SLOs, and working on improvements like monitoring, automation, self-healing, and auto-scaling.
Enhancing observability within customer environments.
Designing and implementing solutions for reliability and scalability.
Creating fault-tolerant patterns considering the entire service lifecycle.
Collaborating on product strategy, roadmaps, and technical designs.
Participating in PR reviews and design discussions.
Sharing knowledge about SRE best practices.
Engaging in incident response, investigation, PIRs, and customer communication.

In the UK, the base salary range is £84,841 - £101,809. Compensation varies based on experience and skills. Benefits include equity, bonuses, and others. If applying from a different country, the recruiter will discuss specific pay and benefits.

#J-18808-Ljbffr

Apply

Location:: United Kingdom
Salary:: £80,000 - £100,000
Category:: Engineering