Staff Linux Systems Engineer

New Today

Overview

Within the Infrastructure Operations and Security (IOPS) department, our Data Center Unit manages all infrastructure systems across our remote sites. As a key member of the Research Infrastructure Operations (RIO) team, you will architect and design systems to help us operate our research GPU infrastructure, support the Research department and make fundamental contributions to our AI development. You will be one of the first ones in Europe to work hands-on with the latest Nvidia's AI systems GB200 NVL72. Given the scale and complexity of our infrastructure, it's not just about maintaining our systems, it's about advancing them. You will use your expertise in tooling and automation to improve the efficiency, reliability and performance of our infrastructure, taking our operations to the next level. In this role, you will also coordinate with on-site personnel and work closely with various teams within our organization. Joining our team means becoming part of a skilled group of engineers ready to support and kick-start your journey with us.

Responsibilities

  • Co-own the architecture and roadmap for the model-training infrastructure with the Engineering Manager.
  • Lead cross-team project implementations end to end—align stakeholders, define scope and milestones, manage dependencies, and drive on-time delivery.
  • Provide technical mentorship through design reviews, documentation and hands-on coaching, without managing direct reports.
  • Build and own automation tooling for provisioning, maintenance and troubleshooting of our GPU infrastructure while continuously improving team tooling.
  • Plan and execute fleet upgrades (kernels, NVIDIA drivers, BIOS/NIC/HBA firmware) with minimal disruption; keep sites consistent.
  • Establish observability across the whole GPU cluster including storage and network by extending and optimizing our monitoring systems.
  • Lead cross-team incident response and drive root-cause analysis.
  • Benchmark and optimize cluster performance.
  • Partner with the network team to design and tune the fabric for high-performance workloads.
  • Participation in our on-call rotation: You'll ensure the reliability and availability of our services by being available to join the team's shared on-call rotation as needed.

Qualifications

  • Staff-level individual contributor with a proven track record of setting and implementing technical strategy and leading cross-team technical projects.
  • Extensive experience in management and troubleshooting of GPU compute clusters, being able to architect solutions that scale.
  • Proficiency in containerization and container orchestration technologies such as Docker and Kubernetes.
  • Software engineering expertise and fluency in at least one programming language, preferably Go.
  • Expertise in patch and OS management at scale.
  • Experienced in Linux performance benchmarking, tuning and troubleshooting.
  • Familiarity with distributed storage solutions like Lustre and Ceph.
  • Knowledgeable in networking technologies and protocols, including Ethernet and ideally Infiniband.
  • Proactive and solution-oriented mindset.
  • Excellent problem-solving skills.
  • Initiative-driven and able to take ownership.

Benefits and Culture

  • Diverse and internationally distributed team: joining our team means becoming part of a large, global community with people of more than 90 nationalities.
  • Open communication, regular feedback: value clear, honest communication and collaboration.
  • Hybrid work, flexible hours: hybrid office presence with flexible hours to align with locations and time zones.
  • Regular in-person team events and monthly full-day hacking sessions.
  • 30 days of annual leave (excluding public holidays) with access to mental health resources.
  • Virtual Shares: ownership mindset with a stake in the company’s growth.
  • Competitive benefits: tailored to your location to support you fully.

We are an equal opportunity employer. You are welcome at DeepL for who you are—we appreciate authenticity here. Our product is for everyone, and so is our workplace. The more voices we have represented and amplified in our business, the more we will all succeed, contribute, and think forward. We encourage you to apply and share your potential with us.

#J-18808-Ljbffr
Location:
Camden Town, England, United Kingdom
Salary:
£80,000 - £100,000
Job Type:
FullTime
Category:
IT & Technology

We found some similar jobs based on your search