SENIOR MACHINE LEARNING INFRASTRUCTURE ENGINEER

PlusAI
Full-time
Santa Clara, CA
$160,000 - $200,000 a year
Posted on 5 months ago

Job Description

PlusAI is seeking a Senior ML Infrastructure Engineer to design scalable architectures for handling large datasets and optimizing performance for machine learning training and inference. The role involves building robust data pipelines, managing model versioning, and overseeing large-scale GPU clusters, utilizing cloud-native technologies like Docker and Kubernetes.

Responsibilities

  • Design and develop scalable systems for training, inference, deploying, and monitoring ML models
  • Build and maintain data pipelines, model versioning systems, and experiment tracking frameworks
  • Collaborate with cross-functional teams to improve platform usability
  • Implement distributed systems and storage solutions
  • Drive improvements in CI/CD workflows
  • Ensure high availability and reliability of the ML platform
  • Stay current with industry trends
  • Mentor junior engineers
  • Ensure team compliance with QMS and drive process improvements

Requirements

  • PhD or MS in Computer Science, Electrical Engineering, or related field
  • Good communication skills
  • PhD new grad or Masters with 3+ years of software engineering experience in ML infrastructure or distributed systems
  • Proficiency in Python, C++, SQL
  • Deep understanding of containerization, orchestration, distributed ML workload, and experiment tracking tools
  • Experience deploying and managing resources across cloud platforms
  • Proficiency in deep learning frameworks and data pipeline tools
  • Strong knowledge of distributed systems, databases, and storage solutions
  • Extensive software design and development skills
  • Ability to learn and adapt to new technologies

Benefits

  • No benefits