SENIOR SRE ENGINEER

Shield AI
Full-time
San Diego Metro Area
$129,467 - $194,201
Posted on 5 months ago

Job Description

As a Site Reliability Engineer at Hivemind, you will ensure the performance, reliability, and scalability of cloud infrastructure by building and maintaining monitoring and alerting systems, defining incident response strategies, and automating operational processes.

Responsibilities

  • Design, implement, and maintain monitoring, logging, and alerting systems
  • Define incident response procedures and participate in on-call rotations
  • Identify and resolve reliability and performance issues across services
  • Develop automation tools to streamline operations
  • Collaborate with engineering teams to ensure new services are production-ready
  • Conduct root cause analyses and implement post-incident improvements
  • Champion a culture of reliability, observability, and operational excellence

Requirements

  • 5+ years of experience in Site Reliability Engineering or related roles
  • Strong experience with AWS services
  • Deep understanding of Kubernetes and containerized deployments
  • Proficiency with monitoring and observability tools
  • Strong scripting or programming skills
  • Experience with infrastructure-as-code
  • Solid understanding of networking, Linux systems, and distributed architectures
  • Experience with service meshes
  • Familiarity with security best practices in cloud environments
  • Exposure to GitOps workflows and tools

Benefits

  • No benefits