Site Reliability Storage Engineer - Kubernetes Platform

xAI
Full-time
Palo Alto, CA
$180,000 - $440,000
Posted on 4 months ago

Job Description

xAI is seeking a Senior Site Reliability Storage Engineer to design, build, and optimize Kubernetes clusters across multiple regions. This role focuses on enhancing reliability, performance, and cost-effectiveness of infrastructure supporting large-scale AI workloads, requiring expertise in Kubernetes orchestration and distributed systems.

Responsibilities

  • Develop and optimize software for Kubernetes cluster provisioning
  • Enhance Kubernetes infrastructure reliability, performance, and cost-effectiveness
  • Collaborate with engineers to design Kubernetes solutions
  • Implement observability, monitoring, and security practices
  • Manage storage infrastructure using IaC tools
  • Drive system reliability through incident management and SLAs/SLOs
  • Contribute to the Kubernetes stack (CNI, CRI, CSI)

Requirements

  • 5+ years of SRE experience with scalable systems
  • Kubernetes infrastructure management expertise (CAPI, kubeadm)
  • Proficiency in IaC tools (Pulumi, Terraform, Ansible)
  • Deep understanding of Kubernetes stack (CNI, CRI, CSI)
  • Ability to improve system reliability through incident management

Benefits

  • No benefits