EyeCare Partners is the nation’s leading provider of clinically integrated eye care. Our national network of over 300 ophthalmologists and 700 optometrists provides a lifetime of care to our patients with a mission to enhance vision, advance eye care and improve lives. Based in St. Louis, Missouri, over 650 ECP-affiliated practice locations provide care in 18 states and 80 markets, providing services that span the eye care continuum. For more information, visit www.eyecare-partners.com.
Site Reliability Engineering Lead
We are looking for a Senior Site Reliability Engineer to help lead, design, and build the operational capabilities of our large scale Cloud based (AWS) practice management (PM) – electronic health record (EHR) solution.
Role and Responsibilities:
The SRE Lead will be overall responsible for the reliability (Performance, Availability, Stability, and Scalability) of our PM - EHR system, as well as the administration of CI/CD platform. The SRE Lead will own the establishment of the engineering discipline, combining software, systems, cloud, and infrastructure-as-code to develop creative engineering solutions to operational problems, and reducing work through automation. This role is expected to work closely with various infrastructure & software development teams to increase stability and reliability via the enablement of Telemetry across the vertically integrated technology stack.
- Mitigate application performance issues effectively by taking responsibility for seeing those performance issues through resolution with the goal of preventing recurrence.
- Evolve our telemetry capabilities, and configure monitors/alerts with Service Level Indicators using various Telemetry and log aggregation technologies.
- Create business friendly dashboards to monitor health of production systems.
- Setup and administer the Kubernetes, Docker, and other hosted platform technologies for the application.
- Automate manual operational processes and develop technical documentation.
- Troubleshoot complex reliability and performance issues. Perform incident/disruption management and conduct root-cause analysis (RCA).
- Participate in on-call rotation, troubleshooting production issues and implementing remediation.
- Work successfully within an agile environment, partnering with other functional teams.
Requirements:
- Bachelor’s in Computer Science or other four-year degree in a relevant field is required.
- 5+ years of overall IT experience with at least 1 implementation of large scale solution using Micro Services Architecture.
- 2+ years of experience in deploying and maintaining Micro Services applications on AWS platform, preferably in a SRE role.
- Proficiency in Telemetry concepts, including Application Performance Monitoring, Network Performance and Diagnostics Monitoring, Log Event Monitoring, IT Infrastructure Monitoring, etc.
- Hands-on experience configuring telemetry tools such as Prometheus, DataDog, and AWS telemetry services such as CloudTrail, CloudWatch, etc.
- Hands-on experience in implementing log management using ELK/EFK, and building dashboards using Grafana.
- Detailed knowledge of API Gateway, preferably on AWS.
- Experience in building and administering CI/CD platform & pipelines (preferably using gitlab and Jenkins).
- Experience with automating systems via Ansible, and Terraform.
- Experience in management and orchestration tools such as Istio, Vault, etc. on AWS.
- Detailed knowledge of relational SQL databases (preferably PostgreSQL). Must be able to construct queries and configure them with telemetry.
- Top candidate will have experience or thorough understanding of incident workflows. Must have experience enriching alerts for faster root-cause detection and incident resolution.
- Must be experience configuring monitors for business transactions, service end points, etc., as well as setup health rules for triggering alerts.
- Experience in implementing (tracking, measuring, and reporting) operational metrics (KPI) such as MTTD, MTTR, MTRS, MTBI, etc.
- Self-starter with the ability to quickly learn new tools and tool features. Must be able to handle multiple tasks and priorities within a fast-paced work environment.
- Proactive and forward-thinking attitude and creative problem-solving ability.
- Demonstrated values of collaboration, transparency, empowerment, and accountability.
- The ability to travel up to 25% (if remote / as required)
Preferred:
- Experience in languages such as Python, Ruby, Bash, Perl or other related languages
- Good understanding of networking protocols and cybersecurity best practices in AWS cloud environment
- Experience in getting the solution ‘AWS well –architected’ certified
- Understanding and awareness of regulations around the use of PII data
- Experience in Healthcare industry
- Knowledge of PM / EHR systems