$70,000–$80,000/yr

Senior Site Reliability Engineer (AWS, AI/ML, & APM)

Full-time Remote 6d ago

Location:

Company:

Granicus is a technology company transforming the Govtech industry by enhancing interactions between governments and constituents.

Summary:

The Senior Site Reliability Engineer will ensure the reliability, scalability, and performance of Granicus' services. Candidates should have over five years of relevant experience and expertise in AWS and AI/ML infrastructure.

Requirements:

Experience: 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.

Job Description:

What your impact will look like:

On-call Production Support: Provide production support on a shift according to the team on-call roster.
Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
Work on SREs backlog items.
Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
Security: Implement and adhere to security best practices to protect our systems and data.

Experience:

5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems. Experience supporting AI/ML infrastructure, including model deployment, inference optimization, and integration with services like AWS Bedrock is highly desirable.

Technical Skills:

Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud).
Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.

Tools and Technologies:

Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability.
Experience with configuration management tools (Ansible, Chef, Puppet).
Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks.
Certifications: Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar are a plus.