Job details

Senior Site Reliability Engineer

The Aspen Group (TAG) is one of the largest and most trusted retail healthcare business support organizations in the U.S. and has supported over 20,000 healthcare professionals and team members with close to 1,500 health and wellness offices across 48 states in four distinct categories: dental care, urgent care, medical aesthetics, and animal health. Working in partnership with independent practice owners and clinicians, the team is united by a single purpose: to prove that healthcare can be better and smarter for everyone. TAG provides a comprehensive suite of centralized business support services that power the impact of five consumer-facing businesses: Aspen Dental, ClearChoice Dental Implant Centers, WellNow Urgent Care, Chapter Aesthetic Studio, and Lovet Pet Health Care. Each brand has access to a deep community of experts, tools and resources to grow their practices, and an unwavering commitment to delivering high-quality consumer healthcare experiences at scale.

As a Senior Site Reliability Engineer (SRE) at TAG – The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions, lead incident response, and continuously optimizing system performance to exceed business objectives. We are actively integrating AI and machine learning into our operational workflows, and you will be on the front lines, leveraging intelligent automation and machine learning to build a proactive resilient infrastructure. This is an opportunity to go beyond SRE by applying cutting-edge technology to solve complex reliability challenges.

Responsibilities:

Intelligent Site Reliability Engineering:

Design and build highly scalable and resilient systems to support our applications and services, incorporating predictive analytics to anticipate reliability risks.
Develop and manage Service Level Objectives (SLOs) and Service Level Indicators (SLIs) using machine learning anomaly detection to ensure systems meet reliability targets.
Drive improvements in system reliability, availability, and performance through proactive measures, automation, and intelligent failure prediction.

Advanced Observability:

Implement and manage comprehensive monitoring and alerting solutions, integrating with intelligent observability platforms that reduce alert noise and correlate events.
Develop and maintain dashboards and reporting tools that provide data-driven insights for actionable troubleshooting recommendations and performance optimization.
Evaluate and integrate advanced monitoring tools and operational intelligence platforms to enhance observability and root cause identification.

Proactive Incident Management:

Lead and participate in incident response efforts, using intelligent log analysis and automated event correlation to speed up troubleshooting and root cause identification.
Develop and maintain incident management processes incorporating automated decision support systems to improve response times and minimize service disruptions.
Conduct post-incident reviews, using automated pattern recognition and trend analysis to identify systemic issues and implement preventive measures.

Performance and Capacity Optimization:

Analyze performance metrics and logs, supported by advanced observability tools, to detect bottlenecks and inefficiencies.
Collaborate with development teams to implement automated profiling and optimization recommendations for code and infrastructure improvements.
Perform capacity planning using machine learning forecasting models to ensure systems can handle current and future loads.

Automation and Process Improvement:

Develop and implement automation solutions, including intelligent runbook automation, self-healing systems, and automated incident triage.
Identify and drive process improvements by applying machine learning to operational data for continuous optimization.
Maintain documentation that includes automation and machine learning guidelines for monitoring, incident management, and SRE best practices.

Collaboration and Communication:

Work closely with engineering, operations, and product teams to align reliability and monitoring goals, including automation adoption strategies.
Communicate effectively with stakeholders, providing regular updates on system health, incidents, performance improvements, and data-driven insights.
Foster a culture of collaboration, knowledge sharing, and automation best practices within the team and across the organization.

Requirements:

Bachelor's degree in computer science or a related technical field.
At least 5 years of experience in Site Reliability Engineering or a similar role.
Strong proficiency in at least one programming language such as Python, Go, or C#
Demonstrated experience applying machine learning and automation to operational workflows such as monitoring, alerting and incident response.
Expertise with infrastructure as code tools such as Terraform
Proven experience working and monitoring container environments such as Cloud Run and Kubernetes.
Hands-on experience using and working within an Azure, AWS, and GCP environment (GCP preferred)
Strong understanding of networking, distributed systems, and cloud infrastructure.
Familiarity with intelligent monitoring platforms and operational analytics tools such as Prometheus, Grafana, OpenSearch, Sentry, Google Cloud Observability
Excellent problem-solving skills and the ability to work independently and as part of a team.
Experience with incident management, root cause analysis, and automated operational workflows.

Annual pay range: $129,000-$160,000

A generous benefits package that includes paid time off, health, dental, vision, and 401(k) savings plan with match

Site Reliability Engineer SRE Python Go C# Kubernetes Cloud Run GCP AWS Azure Terraform Prometheus Grafana OpenSearch Sentry Observability MLOps Incident Management SLO SLI Automation

Average salary estimate

$144500 / YEARLY (est.)

min

max

$129000K

$160000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Identity and Access Management Engineer

Aspen Dental Hybrid Chicago, Illinois

VIEW

Posted 17 hours ago

Experienced IAM Engineer (Okta-focused) needed to lead identity lifecycle, SSO/MFA, and access governance for a multi-brand healthcare support organization.

Senior Backend Engineer

Skylight Hybrid No location specified

VIEW

Posted 17 hours ago

Senior Backend Engineer needed to architect and maintain Skylight's backend services and internal tooling for a high-impact, customer-focused product team.

Senior Software Engineer, New Products

EvenUp Hybrid San Francisco

VIEW

Posted 12 hours ago

Work on EvenUp's New Products team to design and ship first-of-a-kind SaaS features that improve outcomes for personal injury victims while helping the company scale dramatically.

Sr. Staff Security Engineer, Product Security Incident Response Team (PSIRT)

Palo Alto Networks Hybrid Santa Clara, CA

VIEW

Posted 15 hours ago

Palo Alto Networks is hiring a Senior Staff PSIRT Engineer to lead technical vulnerability investigations and coordinate remediation across products and cloud services.

Lead Software Engineer (Platform Developer Experience Team)

Minted Hybrid Remote

VIEW

Posted 2 hours ago

Dental Insurance

Disability Insurance

Flexible Spending Account (FSA)

Health Savings Account (HSA)

Vision Insurance

Performance Bonus

Family Medical Leave

Paid Holidays

Lead the Platform Developer Experience team to design and deliver scalable developer tooling and CI/CD/IAAC solutions while mentoring engineers and driving cross-functional initiatives.

IAM SailPoint Developer with ServiceNow

VTekis Consulting LLP Hybrid WA-520, Seattle, WA, USA

VIEW

Posted 33 minutes ago

Experienced IAM professional needed to implement and maintain SailPoint IdentityIQ governance solutions and ServiceNow integrations for an international, enterprise environment.

Sr. Computer Engineer

CVS Health Hybrid No location specified

VIEW

Posted 12 hours ago

Senior Cloud Engineer for CVS Health to build and optimize cloud-based enterprise applications, data pipelines and machine learning workflows supporting healthcare and retail systems.

Lead Firmware Engineer

Rocky Talkie Hybrid No location specified

VIEW

Posted 14 hours ago

Rocky Talkie is hiring a Lead Firmware Engineer in San Francisco to take full technical ownership of firmware for our high-volume consumer radios and accessories.

Senior DevSecOps Engineer (Remote - North America)

Jobgether Hybrid No location specified

VIEW

Posted 16 hours ago

Senior DevSecOps Engineer needed to lead cloud security, automation, and secure CI/CD practices for an innovative Web3 platform operating across AWS, Azure, and GCP.

Lead Enterprise AI Architect, Enterprise AI Platform - Agentic AI

NVIDIA Hybrid US, CA, Santa Clara

VIEW

Posted 13 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

Lead the architecture and strategic direction of NVIDIA’s Enterprise AI Platform to enable secure, scalable agentic AI solutions across the enterprise.

Technical Architect, DevOps & Data (Remote - US)

Jobgether Hybrid No location specified

VIEW

Posted 16 hours ago

Lead architecture and delivery of cloud-native data and DevOps solutions for enterprise clients while providing technical leadership and mentoring across engagements.

Rust Engineer

Tempo Hybrid No location specified

VIEW

Posted 5 hours ago

Work on cutting-edge, low-level blockchain systems in Rust at Tempo to help scale stablecoin and payments infrastructure across Ethereum execution and consensus layers.

Associate Software Engineer

ConstructConnect Hybrid Cincinnati

VIEW

Posted 3 hours ago

An innovative product team is looking for an Associate Software Engineer to implement features and fixes for large-scale enterprise cloud platforms while working closely with senior developers and cross-functional teams.

Java Developer

PingWind Hybrid Remote

VIEW

Posted 13 hours ago

PingWind is hiring a Java Developer to remotely support VA VistA modernization, patch management, and cybersecurity protections for legacy medical devices and enterprise applications.

Aspen Dental

With more than 1,000 offices across 46 states, the career possibilities with Aspen Dental are limitless. Every Aspen Dental practice is supported by The Aspen Group (TAG), a dental support organization that provides non-clinical business support...

6 jobs

MATCH

Calculating your matching score...

FUNDING

Private

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

INDUSTRY

Dental Clinics

TEAM SIZE