Lead Site Reliability Engineer
Apply NowLocation:
US
Company:
CloudSmiths specializes in cloud solutions, offering expertise in Google Cloud Platform for enterprise clients.
Summary:
As a Lead Site Reliability Engineer, the applicant will lead the SRE team to ensure system uptime and performance on Google Cloud Platform. A degree in IT or Computer Science and relevant experience in DevOps or SRE roles is necessary.
Requirements:
Technology: GCP, Grafana, Prometheus, Stackdriver, Terraform, Ansible, CI/CD pipelines
Hard Skills: Scripting and automation skills using Python, Bash, Shell, Knowledge of configuration management tools like Chef, Puppet, Ansible
Credentials: Degree or Diploma in Information Technology, Degree or Diploma in Computer Science, Google Cloud certifications
Experience: A minimum of 3 years in a management or leadership capacity within SRE or DevOps teams., Strong, hands-on experience working with GCP infrastructure and services., Proven experience with Kubernetes, Docker, and container orchestration at scale., Expertise in UNIX/Linux administration., Hands-on experience with IaC tools such as Terraform, Ansible, or Deployment Manager., Familiarity with incident management, post-mortem processes, and production monitoring tools., A background working with CI/CD pipelines and automation tools.
Job Description:
Department
Cloud Practice
Employment Type
Permanent Employee
Minimum Experience
Senior Manager/Supervisor
Are you ready to lead from the front and define engineering excellence in the cloud? At CloudSmiths, we're searching for a Lead Site Reliability Engineer to guide our high-performing SRE team.
Your Mission
As our Lead SRE, you will champion a culture of engineering excellence. Your core purpose is to lead a team of dedicated engineers to ensure system uptime, efficiency, and peak performance on GCP by implementing robust monitoring, automation, and DevOps practices. In this crucial role, you will be the technical authority for Google Cloud Platform, driving the reliability, scalability, and performance of our production environments.
Your key responsibilities will include:
- Leading, mentoring, and fostering the growth of a dedicated SRE team.
- Championing and implementing DevOps and SRE best practices with a sharp focus on automation and scalability.
- Driving our monitoring and observability initiatives, leveraging tools like Grafana, Prometheus, and Stackdriver.
- Designing, maintaining, and optimising robust CI/CD pipelines using GCP-native tools and industry standards.
- Applying Infrastructure as Code (IaC) principles with tools such as Terraform or Deployment Manager.
- Steering the troubleshooting of complex production incidents, ensuring thorough root cause analysis and effective long-term fixes.
- Fostering a proactive and blameless incident management culture.
- Collaborating with cross-functional teams to ensure consistent platform performance and manage stakeholder expectations.
Experience & Leadership required:
- A minimum of 3 years in a management or leadership capacity within SRE or DevOps teams.
- Strong, hands-on experience working with GCP infrastructure and services.
- Proven experience with Kubernetes, Docker, and container orchestration at scale.
- Expertise in UNIX/Linux administration.
- Hands-on experience with IaC tools such as Terraform, Ansible, or Deployment Manager.
- Familiarity with incident management, post-mortem processes, and production monitoring tools.
- A background working with CI/CD pipelines and automation tools.
Technical Skills:
- Strong scripting and automation skills using Python, Bash, or Shell.
- Knowledge of configuration management tools like Chef, Puppet, or Ansible.
- Familiarity with security, compliance, and cost optimisation on GCP.
Education & Certifications:
- A Degree or Diploma in Information Technology, Computer Science, or equivalent experience.
- Google Cloud certifications (e.g., Professional Cloud DevOps Engineer, Professional Cloud Architect) are highly advantageous.