Site Reliability Engineer

Xsolla
Full-time
Raleigh
$120,000 - $150,000
Posted on 5 months ago

Job Description

Xsolla is seeking a Site Reliability Engineer to ensure high reliability and availability of their systems. The role involves monitoring, incident resolution, and collaboration with development teams to enhance operational stability. The ideal candidate will have experience in a large-scale production environment and proficiency in scripting languages.

Responsibilities

  • Ensure high reliability and availability and meet SLAs, SLOs, and SLIs
  • Monitor the system for issues and respond to incidents
  • Drive incident resolution and process improvements
  • Ensure all key services are measured, monitored and raising alerts
  • Develop comprehensive monitoring solutions
  • Support services before they go live
  • Engage in service capacity planning and demand forecasting, performance analysis, and system tuning
  • Collaborate with the development teams to enhance the product's operational stability
  • Build and drive the automation systems that maintain system health

Requirements

  • Proven experience as a Site Reliability Engineer in a large-scale production environment
  • Proficiency in scripting languages such as Python, Bash
  • Deep knowledge of monitoring systems such as Datadog, Prometheus, Grafana
  • Good understanding of continuous integration/continuous delivery processes and platforms
  • Experience with Docker, Kubernetes, or other container orchestration systems
  • Familiarity with infrastructure automation tools like Terraform
  • Experience with automation, system administration, and system hardening
  • Experience with Linux-based infrastructures, Linux/Unix administration
  • Demonstrated problem-solving skills
  • Excellent communication skills

Benefits

  • No benefits