STAFF SOFTWARE ENGINEER, ML TRAINING PLATFORM

Reddit
Full-time
Remote
$230,000 - $322,000
Posted on 6 months ago

Job Description

The Staff Software Engineer, ML Training Platform will be instrumental in architecting, implementing, and maintaining foundational Machine Learning Training infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and much more to fulfill Reddit’s mission of bringing community and belonging to everyone in the world.

Responsibilities

  • Optimize model training on GPUs
  • Lead the building, testing, and maintenance of ML infrastructure at Reddit
  • Propose, design, and implement high-performance ML platform solutions
  • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
  • Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
  • Work with management on team goal setting, planning, and de-risk project execution
  • Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality

Requirements

  • 8+ years of work experience in a production software development environment or building data systems
  • Experience with XLA for Tensorflow or torch.inductor for pytorch for kernel fusion during training
  • Experience with optimization of data workloads using collosal.AI or Deepspeed
  • Experience with distributed Training optimization using deepspeed, horovod or collosalAI
  • Experience with design and architecture of large scale ML Systems
  • Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
  • Experience with MLOps practices and tools such as Ray and MLFlow
  • Hands-on experience with Kubernetes, Docker, or other container orchestration systems

Benefits

  • No benefits