Job details

Senior Site Reliability Engineer, Colorado Springs

About Onebrief

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smarter, and more efficient.

We take ownership, seek excellence, and play to win with the seriousness and camaraderie of an Olympic team. Onebrief operates as an all-remote company, though many of our employees work alongside our customers at military commands around the world.

Founded in 2019 by a group of experienced planners, today, Onebrief’s team spans veterans from all forces and global organizations, and technologists from leading-edge software companies. We’ve raised $123m+ from top-tier investors, including Battery Ventures, General Catalyst, Insight Partners, and Human Capital, and today, Onebrief is valued at $1.1B. With this continued growth, Onebrief is able to make an impact where it matters most.

Security Clearance, Location, and Onsite Notice:

This role requires regularly working on-site at customer locations in Colorado Springs, Colorado.

If you are not currently within commuting distance, you must be willing to relocate (note that Onebrief will provide relocation assistance).

Active Top Secret Clearance required; SCI eligibility is a plus.

About The Role

We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll report to our Director of Infrastructure and work closely with fellow SREs, security, and customer success.

You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments. Your lessons from the field will shape how our team works, from policy to implementation.

In addition to working at the customer, you will contribute directly to solutions that increase stability, performance, and security of our deployments, and improve the overall experience of deploying and managing Onebrief on premise.

About You

You are a force multiplier who views reliability as the most critical feature of any application and/or platform and believe that "reliability beats novelty." You see infrastructure and operability as a product to be automated, documented, and continuously improved, always leaving systems easier to operate than you found them.

You are equally comfortable leading a post-incident review, designing SLOs in a system design session, or diving into a kubectl shell to triage a complex production issue. You don't just fix problems; you translate constraints and failure modes into clear, automated guardrails and scalable, resilient architecture. For you, robust monitoring, actionable alerting, and insightful runbooks are core parts of the engineering process, not afterthoughts.

You mentor others, fostering a culture of blameless postmortems and proactive reliability. You collaborate naturally with application and platform teams, helping them move quickly but safely by building the tools, processes, and observability that make "fast recovery" a reality.

What You'll Do

You'll own the reliability, scalability, and security of the production application and/or platform. You will do this by:

Building a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana). You won't just track metrics; you'll create the actionable insights and automated alerting that allow teams to identify and resolve issues before they impact users.
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Objectives (SLOs) and increases trust internally and externally. You will be the organization's expert on what it means for our systems to be reliable and how to measure it.
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents You will lead blameless post-mortems / After Action Reviews (AARs) that identify true root causes and drive automated, long-term solutions to prevent recurrence.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible). You will embed security and compliance controls (RMF, STIGs) directly into this automation.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation. You will act as a force multiplier by advising other teams on best practices in air-gapped environments and production readiness.

What We Look For

3 years of experience in Site Reliability Engineering or a related field, with firsthand experience managing mission-critical systems within DoD’s air-gapped environments
An active Top Secret security clearance. U.S. citizenship required.
Experience automating software delivery, deployment, and providing documentation and self-service tools for engineering teams and customers.
A strong understanding of Linux, containerization and orchestration, and virtual machines
Experience with centralized logging, metrics, and observability using tools such as Prometheus, Loki, Grafana, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
Clear, concise writing; strong documentation habits and async communication.
- Core skills and technologies: VMWare, Kubernetes, Docker, Helm, Ansible, Terraform, Linux, AWS, DoD compliance, Monitoring and Observability tools, AWS.

Bonus points (nice to have)

Experience with compliance frameworks (RMF, STIGs/SRGs, ICD 503).
Security‑minded design for air-gapped environments.
Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.

Senior SRE Site Reliability Engineer Kubernetes Terraform Ansible AWS Prometheus Grafana Loki Docker VMWare DoD Top Secret Air-gapped Observability Incident Response SLO RMF STIG Helm

Onebrief Glassdoor Company Review

5.0

Onebrief DE&I Review

No rating

CEO of Onebrief

Unknown name

Approve of CEO

Average salary estimate

$165000 / YEARLY (est.)

min

max

$140000K

$190000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Senior Associate Software Development Engineer

nttlimited Hybrid Cape Town, Draper on Main

VIEW

Posted 2 hours ago

Experienced software developer to support and build microservices, APIs and cloud-based solutions as part of NTT DATA’s engineering teams in Cape Town.

Manager, Software Engineering

Drata Hybrid No location specified

VIEW

Posted 6 hours ago

Lead a talented remote engineering team at a fast-growing SaaS security company, driving delivery, technical direction, and operational excellence.

Full-Stack Software Engineer (Remote - Anywhere)

Jobgether Hybrid No location specified

VIEW

Posted 15 hours ago

Work remotely as a Full-Stack Software Engineer building scalable, AI-augmented features for gaming and esports platforms at an early-stage startup.

Ground Control Station (GCS) Software Engineer

Shield AI Hybrid Dallas Metro Area

VIEW

Posted 8 hours ago

Develop low-latency ground control software and responsive web interfaces at Shield AI to enable reliable mission planning, telemetry processing, and real-time control of autonomous aircraft.

Virtual Reality Developer

TAMUS Hybrid College Station, TX

VIEW

Posted 9 hours ago

Texas A&M's Institute for Applied Creativity is hiring a hands-on Virtual Reality Developer to build, optimize, and support immersive VR experiences for academic and creative projects.

Senior Software Engineer

Magpie Literacy Hybrid No location specified

VIEW

Posted 23 hours ago

Experienced full-stack engineer needed to lead platform development at Magpie, building scalable backend services and intuitive front-end experiences to advance an impactful education mission.

Systems Software Engineer - GeForce NOW Low Latency Streaming Technology

NVIDIA Hybrid US, CA, Santa Clara

VIEW

Posted 10 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

NVIDIA is hiring a Systems Software Engineer to drive ultra-low latency streaming features and performance for GeForce NOW's cloud gaming platform.

Sr. Associate Software Engineer

McKesson Hybrid Columbus, OH, USA - 910 John Street (CMM Main Campus) (C317)

VIEW

Posted 21 hours ago

McKesson is hiring a Sr. Associate Software Engineer to build scalable, cloud-based healthcare applications using C#/.NET Core, TypeScript, and modern DevOps practices.

Application Developer, Senior

Bah Hybrid Bethesda, MD

VIEW

Posted 11 hours ago

Booz Allen is hiring a Senior Application Developer to design, implement, and maintain secure, database-driven web applications using Oracle, SQL Server, MySQL, .NET, and Python in support of grant processing.

MFAMS Senior Back End Java Developer

CACI Hybrid US MD Hanover

VIEW

Posted 11 hours ago

CACI seeks a Senior Back End Java Developer to support MFAMS ICAM capabilities in Hanover, MD, delivering secure enterprise web services and CI/CD-driven deployments.

Senior Software Engineer - Partnerships

Kikoff Hybrid San Francisco

VIEW

Posted 20 hours ago

Inclusive & Diverse

Startup Mindset

Collaboration over Competition

Growth & Learning

Mission Driven

Passion for Exploration

Rapid Growth

Customer-Centric

Transparent & Candid

Kikoff is hiring a Senior Software Engineer on the Partnerships team to design and implement scalable API-driven integrations and multi-tenant systems for enterprise partners.

Senior Software Engineer - (Backend) App Factory

Kikoff Hybrid San Francisco

VIEW

Posted 21 hours ago

Inclusive & Diverse

Startup Mindset

Collaboration over Competition

Growth & Learning

Mission Driven

Passion for Exploration

Rapid Growth

Customer-Centric

Transparent & Candid

Kikoff is hiring a Senior Backend Software Engineer to architect and ship new 0→1 products while building the reusable backend systems that power rapid app creation and experimentation.

Software Engineer

Jukebox Health Hybrid Remote

VIEW

Posted 14 hours ago

Jukebox Health seeks a versatile full‑stack Software Engineer to architect and implement complex integrations and Salesforce-driven features across a rapidly growing healthcare-tech stack.