Senior Site Reliability Engineer (AWS, AI/ML, & APM)
Apply Now
Full-time
Remote
6d ago
Location:
US
Company:
Granicus is a technology company transforming the Govtech industry by enhancing interactions between governments and constituents.
Summary:
The Senior Site Reliability Engineer will ensure the reliability, scalability, and performance of Granicus' services. Candidates should have over five years of relevant experience and expertise in AWS and AI/ML infrastructure.
Requirements:
Experience: 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems.
Job Description:
What your impact will look like:
- On-call Production Support: Provide production support on a shift according to the team on-call roster.
- Work on the customer and internal engineering/implementation team raised tickets while not on-call for production support. For example, a client may request to correct some data on the database server which cannot be done through the web interface.
- Work on SREs backlog items.
- Monitor and Maintain Systems: Continuously monitor the health and performance of our services, systems, and infrastructure. Respond to alerts and incidents promptly to ensure high availability.
- Automate Processes: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention.
- Incident Management: Assist in troubleshooting and resolving incidents, performing root cause analysis, and implementing long-term fixes to prevent recurrence.
- System Improvements: Participate in designing and implementing system improvements to enhance reliability, scalability, and performance.
- Collaboration: Work closely with software engineers to understand application requirements, provide feedback on design and architecture, and support deployment and release processes.
- Documentation: Create and maintain documentation for processes, procedures, and troubleshooting guides to ensure knowledge sharing within the team.
- Capacity Planning: Assist in capacity planning activities to anticipate future needs and ensure that our infrastructure can handle growth.
- Security: Implement and adhere to security best practices to protect our systems and data.
Experience:
- 5+ years in site reliability engineering, system administration, or a similar role, with a proven track record of managing large-scale, high-availability systems. Experience supporting AI/ML infrastructure, including model deployment, inference optimization, and integration with services like AWS Bedrock is highly desirable.
Technical Skills:
- Expertise in Linux/Unix systems, and cloud platforms (AWS, Azure, or Google Cloud).
- Strong proficiency in scripting languages (Python, Bash, Ruby) and programming languages (Go, Java, C++).
- Familiarity with AI/ML operations, including model lifecycle management, vector databases, and inference performance tuning.
Tools and Technologies:
- Experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, monitoring, and observability.
- Experience with configuration management tools (Ansible, Chef, Puppet).
- Exposure to AI/ML toolchains, including AWS Bedrock, SageMaker, and LLMOps frameworks.
- Certifications: Relevant certifications such as AWS Certified DevOps Engineer, AWS Certified Machine Learning – Specialty, Google Cloud Professional DevOps Engineer, or similar are a plus.