Site Reliability Engineering Lead

See more jobs from EyeCare Partners LLC

about 2 years old

This job is no longer active

EyeCare Partners is the nation’s leading provider of clinically integrated eye care. Our national network of over 300 ophthalmologists and 700 optometrists provides a lifetime of care to our patients with a mission to enhance vision, advance eye care and improve lives. Based in St. Louis, Missouri, over 650 ECP-affiliated practice locations provide care in 18 states and 80 markets, providing services that span the eye care continuum. For more information, visit www.eyecare-partners.com.

Site Reliability Engineering Lead

We are looking for a Senior Site Reliability Engineer to help lead, design, and build the operational capabilities of our large scale Cloud based (AWS) practice management (PM) – electronic health record (EHR) solution.

Role and Responsibilities:

The SRE Lead will be overall responsible for the reliability (Performance, Availability, Stability, and Scalability) of our PM - EHR system, as well as the administration of CI/CD platform. The SRE Lead will own the establishment of the engineering discipline, combining software, systems, cloud, and infrastructure-as-code to develop creative engineering solutions to operational problems, and reducing work through automation. This role is expected to work closely with various infrastructure & software development teams to increase stability and reliability via the enablement of Telemetry across the vertically integrated technology stack.

  • Mitigate application performance issues effectively by taking responsibility for seeing those performance issues through resolution with the goal of preventing recurrence.
  • Evolve our telemetry capabilities, and configure monitors/alerts with Service Level Indicators using various Telemetry and log aggregation technologies.
  • Create business friendly dashboards to monitor health of production systems.
  • Setup and administer the Kubernetes, Docker, and other hosted platform technologies for the application.
  • Automate manual operational processes and develop technical documentation.
  • Troubleshoot complex reliability and performance issues. Perform incident/disruption management and conduct root-cause analysis (RCA).
  • Participate in on-call rotation, troubleshooting production issues and implementing remediation.
  • Work successfully within an agile environment, partnering with other functional teams.

Requirements:

  • Bachelor’s in Computer Science or other four-year degree in a relevant field is required.
  • 5+ years of overall IT experience with at least 1 implementation of large scale solution using Micro Services Architecture.
  • 2+ years of experience in deploying and maintaining Micro Services applications on AWS platform, preferably in a SRE role.
  • Proficiency in Telemetry concepts, including Application Performance Monitoring, Network Performance and Diagnostics Monitoring, Log Event Monitoring, IT Infrastructure Monitoring, etc.
  • Hands-on experience configuring telemetry tools such as Prometheus, DataDog, and AWS telemetry services such as CloudTrail, CloudWatch, etc.
  • Hands-on experience in implementing log management using ELK/EFK, and building dashboards using Grafana.
  • Detailed knowledge of API Gateway, preferably on AWS.
  • Experience in building and administering CI/CD platform & pipelines (preferably using gitlab and Jenkins).
  • Experience with automating systems via Ansible, and Terraform.
  • Experience in management and orchestration tools such as Istio, Vault, etc. on AWS.
  • Detailed knowledge of relational SQL databases (preferably PostgreSQL). Must be able to construct queries and configure them with telemetry.
  • Top candidate will have experience or thorough understanding of incident workflows. Must have experience enriching alerts for faster root-cause detection and incident resolution.
  • Must be experience configuring monitors for business transactions, service end points, etc., as well as setup health rules for triggering alerts.
  • Experience in implementing (tracking, measuring, and reporting) operational metrics (KPI) such as MTTD, MTTR, MTRS, MTBI, etc.
  • Self-starter with the ability to quickly learn new tools and tool features. Must be able to handle multiple tasks and priorities within a fast-paced work environment.
  • Proactive and forward-thinking attitude and creative problem-solving ability.
  • Demonstrated values of collaboration, transparency, empowerment, and accountability.
  • The ability to travel up to 25% (if remote / as required)

Preferred:

  • Experience in languages such as Python, Ruby, Bash, Perl or other related languages
  • Good understanding of networking protocols and cybersecurity best practices in AWS cloud environment
  • Experience in getting the solution ‘AWS well –architected’ certified
  • Understanding and awareness of regulations around the use of PII data
  • Experience in Healthcare industry
  • Knowledge of PM / EHR systems