Senior Site Reliability Engineer - Databases (Remote, UK)
New Yesterday
Social network you want to login/join with:
Senior Site Reliability Engineer - Databases (Remote, UK), United Kingdom (Remote)
col-narrow-left
Client:
Location:
Job Category:
Other
-
EU work permit required:
Yes
col-narrow-right
Job Reference:
a95cc6c0d0f0
Job Views:
16
Posted:
12.08.2025
Expiry Date:
26.09.2025
col-wide
Job Description:
Senior Site Reliability Engineer - Databases
This is a remote position and we're considering candidates in Spain, Sweden, the UK, and Germany.
About the role:
We are looking for a Senior SRE to support our highest value Grafana Cloud customers by increasing the reliability of our cloud databases based on Mimir, Loki, Tempo, and Pyroscope. These databases are provided as SaaS from AWS, GCP, and Azure across all regions.
The SRE team is a new team within the Databases department, owning environments for our largest customers and acting as an overlay to existing database teams. As an SRE, you will manage software configurations, participate in feature development, oversee releases, and ensure they meet SLOs without degrading user experience. You will contribute to design documents, code reviews, and other engineering activities to improve reliability, observability, and customer guidance.
This role involves an on-call element, shared with the Mimir team, focusing on customer experience while being supported by another engineer. Our company hires globally (remote-only) to optimize on-call health and align with 12 daylight hours per day.
What we seek:
- At least 6 years of engineering experience, with 3+ years in SRE roles.
- Experience as a reliability/production engineer, infrastructure/systems engineer, or software engineer with an infrastructure focus.
- Strong communication skills for technical discussions and cross-organizational collaboration.
- Experience with Kubernetes on AWS, GCP, or Azure, and with Helm charts or other IaC tools.
- Experience with SRE practices, distributed computing, and related areas.
- Proficiency in programming languages such as Go, Python, Java, etc.
- Knowledge of Linux internals, networking, cloud storage, and scaling.
- Excellent troubleshooting skills.
- Experience in incident response, post-incident reviews, and proactive problem management.
- Ability to work autonomously within a team environment.
- Values include curiosity, transparency, action bias, and kindness.
Your day-to-day will include:
- Conducting regular 1:1s with your manager and colleagues.
- Reviewing and setting SLOs, and working on improvements like monitoring, automation, self-healing, and auto-scaling.
- Enhancing observability within customer environments.
- Designing and implementing solutions for reliability and scalability.
- Creating fault-tolerant patterns considering the entire service lifecycle.
- Collaborating on product strategy, roadmaps, and technical designs.
- Participating in PR reviews and design discussions.
- Sharing knowledge about SRE best practices.
- Engaging in incident response, investigation, PIRs, and customer communication.
In the UK, the base salary range is £84,841 - £101,809. Compensation varies based on experience and skills. Benefits include equity, bonuses, and others. If applying from a different country, the recruiter will discuss specific pay and benefits.
#J-18808-Ljbffr- Location:
- United Kingdom
- Salary:
- £80,000 - £100,000
- Category:
- Engineering