Lead Site Reliability Engineer (SRE)

Canary
Full-time
Remote - USA
Posted on 23 days ago

Job Description

Canary is seeking a Lead Site Reliability Engineer to drive incident management, SLO frameworks, and operational excellence across their platform. The role involves collaborating with product and app teams to improve reliability and build self-service capabilities.

Responsibilities

  • Drive incident response best practices and lead postmortems
  • Define SLAs/SLOs across platform services
  • Collaborate on reliability reviews
  • Oversee evolution of the observability stack
  • Build tools to reduce toil and empower engineering teams
  • Advocate for reliability and operational excellence

Requirements

  • 7+ years in SRE, platform, systems, or infrastructure engineering
  • Strong background in AWS and Kubernetes
  • Experience with SLO/SLA frameworks
  • Track record leading incident response
  • Experience with observability ecosystems
  • Programming/scripting skills in Python, Go, or similar
  • Strong cross-functional leadership

Benefits

  • No benefits