HPC Platform Management Engineer, London, England, United Kingdom

HPC Platform Management Engineer

New Yesterday

Distributed Computing Application Engineer

Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all liquid asset classes across the world. We are a technology and data driven group implementing a scientific approach to investing. Combining data, research, technology, and trading expertise has shaped QRT's collaborative mindset which enables us to solve the most complex challenges. QRT's culture of innovation continuously drives our ambition to deliver high quality returns for our investors.

Join QRT as a technologist within our Workload Scheduling (WLS) team. This key role supports both business and technology groups in integrating High Performance Computing (HPC) solutions, enabling scalable and efficient compute capabilities. You will be instrumental in developing, deploying, and maintaining HPC platforms that leverage Yellow Dog and Ray schedulers across cloud and on-premises infrastructures.

Your Future Role within QRT:

Develop and support scalable workload scheduling solutions for HPC environments
Collaborate with internal teams to adopt and optimize HPC platforms
Improve the performance, resilience, and observability of compute infrastructure
Contribute to infrastructure automation and continuous improvement initiatives
Share expertise and support team development through coaching and collaboration

Your Present Skillset:

Experience of engineering and supporting at least one HPC scheduler, such as YellowDog, Ray, Slurm or IBM Symphony
Good understanding of both loosely coupled and tightly coupled HPC workloads
Experience of developing and supporting large-scale systems (5000+ nodes) and high levels of concurrency (100k+ tasks)
Experience of monitoring and visualisation of large-scale systems
Performance tuning of compute, network and storage components
Good understanding of the challenges of user authorisation in large scale distributed environments using AWS IAM and identity providers such as Okta
Good understanding of core AWS services
VPC security and networking
EC2 configuration and scaling
Storage services S3, EFS, EBS and FSx
CloudWatch / CloudTrail / OpenSearch / Athena
Experience of developing Python applications and tools
Experience with infrastructure-as-code using configuration languages and tools, particularly Terraform and Ansible
Solid understanding of Linux administration skills
Good understanding of various storage solutions and their applicability for different use cases
Able to work in a fast-paced environment with multiple conflicting demands and changing priorities
Effective communicator, able to describe complex issues at the appropriate level for a given audience
Happy to coach colleagues and eager to learn from them

QRT is an equal opportunity employer. We welcome diversity as essential to our success. QRT empowers employees to work openly and respectfully to achieve collective success. In addition to professional achievement, we are offering initiatives and programs to enable employees achieve a healthy work-life balance. #J-18808-Ljbffr

Apply

Location:: London, England, United Kingdom
Salary:: £125,000 - £150,000
Category:: IT & Technology

We found some similar jobs based on your search

New Yesterday

HPC Platform Management Engineer
- London, England, United Kingdom
- £125,000 - £150,000
- IT & Technology
Distributed Computing Application Engineer Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all liquid asset classes across the world. We are a technology and data driven group implementing ...

More Details