HPC Platform Management Engineer
New Yesterday
Qube Research & Technologies (QRT) is a global quantitative and systematic investment manager, operating in all liquid asset classes across the world. We are a technology and data driven group implementing a scientific approach to investing. Combining data, research, technology, and trading expertise has shaped QRT's collaborative mindset which enables us to solve the most complex challenges. QRT's culture of innovation continuously drives our ambition to deliver high quality returns for our investors.
Join QRT as a technologist within our Workload Scheduling (WLS) team. This key role supports both business and technology groups in integrating High Performance Computing (HPC) solutions, enabling scalable and efficient compute capabilities. You will be instrumental in developing, deploying, and maintaining HPC platforms that leverage Yellow Dog and Ray schedulers across cloud and on-premises infrastructures.
Your Future Role within QRT:
- Develop and support scalable workload scheduling solutions for HPC environments
- Collaborate with internal teams to adopt and optimize HPC platforms
- Improve the performance, resilience, and observability of compute infrastructure
- Contribute to infrastructure automation and continuous improvement initiatives
- Share expertise and support team development through coaching and collaboration
- Experience of engineering and supporting at least one HPC scheduler, such as YellowDog, Ray, Slurm or IBM Symphony
- Good understanding of both loosely coupled and tightly coupled HPC workloads
- Experience of developing and supporting large-scale systems (5000+ nodes) and high levels of concurrency (100k+ tasks)
- Experience of monitoring and visualisation of large-scale systems
- Performance tuning of compute, network and storage components
- Good understanding of the challenges of user authorisation in large scale distributed environments using AWS IAM and identity providers such as Okta
- Good understanding of core AWS services
- VPC security and networking
- EC2 configuration and scaling
- Storage services S3, EFS, EBS and FSx
- CloudWatch / CloudTrail / OpenSearch / Athena
- Experience of developing Python applications and tools
- Experience with infrastructure-as-code using configuration languages and tools, particularly Terraform and Ansible
- Solid understanding of Linux administration skills
- Good understanding of various storage solutions and their applicability for different use cases
- Able to work in a fast-paced environment with multiple conflicting demands and changing priorities
- Effective communicator, able to describe complex issues at the appropriate level for a given audience
- Happy to coach colleagues and eager to learn from them
- Location:
- London, England, United Kingdom
- Salary:
- £125,000 - £150,000
- Category:
- IT & Technology