Company Overview
Procurement Sciences AI (PSci.AI) is at the forefront of generative artificial intelligence, transforming the government contracting sector as a Series A rocketship, proudly backed by Battery Ventures, a top 1% global technology venture capital firm. As a venture-backed B2B SaaS company, we are dedicated to revolutionizing federal, state, and local business development with disruptive AI capabilities. Our “Win More Bids” platform delivers unparalleled operational efficiencies for our clients and drives new revenue streams. By harnessing the power of generative AI tailored for the government contracting domain, we provide a unique competitive advantage and redefine what is possible for our customers.
Job Title: Site Reliability Engineer (SRE)
Location: Washington, DC metro area; Salt Lake City, UT; or Remote
Job Description
We are seeking an experienced and driven Site Reliability Engineer (SRE) to help ensure the reliability, performance, and scalability of our cloud-based AI solutions. The ideal candidate has a track record of diagnosing root causes, building automation, optimizing observability, and managing reliability in complex SaaS environments. Experience with Kubernetes, Helm, modern observability platforms, and major public cloud providers (Azure, AWS, Google Cloud Platform) is key. You will play a central role in defining and monitoring key reliability metrics, strengthening operational excellence, and championing DevOps culture across our rapidly growing organization.
Key Responsibilities
Identify and resolve system and application issues through in-depth root cause analysis, working closely with development teams and stakeholders.
Design, develop, and implement comprehensive automated testing to ensure ongoing system reliability and performance.
Build and maintain robust observability and monitoring solutions using Datadog, Prometheus, Grafana, ELK Stack, or similar platforms.
Define and monitor service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) across services to meet operational commitments and improve reliability.
Collaborate with developers and operations staff to enhance system reliability, deployment agility, and overall developer experience (DevEx).
Develop and continuously improve monitoring and alerting systems to proactively address potential issues.
Lead and implement best practices for incident management, disaster recovery, and business continuity.
Manage high-impact incident response, facilitate post-mortem analyses, and drive remediation to prevent future occurrences.
Plan for capacity upgrades and scaling to support company growth and system performance requirements.
Automate operational tasks and infrastructure management using Infrastructure as Code (IaC) tools and related technologies.
Ensure all systems and processes comply with security, privacy, and regulatory requirements relevant to GovCon customers.
Continually assess and drive improvements in system architecture, operational processes, and documentation for systems and incidents.
Technical Requirements
Proficient in Kubernetes, Helm, and troubleshooting in secure and regulated environments.
Deep experience with observability and monitoring tools such as Prometheus, Grafana, ELK Stack, Datadog, or similar.
Hands-on expertise with major public cloud providers: Azure, Azure Gov, AWS, AWS GovCloud, and Google Cloud Platform (GCP).
Strong grasp of microservices architecture, cloud-native technologies, Postgres, and AI/ML systems.
Expertise in automated testing frameworks and practices (integration, synthetic, load testing, etc.).
Proficiency in tracking and analyzing reliability metrics (SLIs, SLAs, SLOs).
Excellent problem-solving skills and attention to detail, with the ability to operate independently and collaboratively.
Strong programming skills in TypeScript and Python.
Solid scripting abilities in Bash, PowerShell, or similar languages.
Demonstrated experience with Infrastructure as Code (IaC) tools such as Azure Bicep, AWS CDK, or Terraform.
Awareness of core networking principles and advanced troubleshooting skills.
Effective communicator, able to work with both technical and business personnel.
Preferred Qualifications
Experience in the GovCon sector and/or holding a security clearance.
Familiarity with GitOps principles and tools; experience with FluxCD is a plus.
Proven experience in designing, building, and maintaining CI/CD pipelines.
Experience managing reliability in multi-cloud or hybrid cloud environments.
Knowledge of security and compliance standards applicable to government SaaS and cloud systems.
Previous success operating in dynamic, high-growth SaaS companies.
Demonstrated expertise in operationalizing new development workloads across cross-functional teams.
Compensation and Benefits
Competitive salary and performance-based incentives, including stock options.
Comprehensive health plan for employees and their families.
Flexible remote-first work environment, with options to work from the DC metro area or Salt Lake City, UT.
Wide-ranging opportunities for professional development, technical advancement, and career growth.
To Apply:
Please submit your resume and a brief cover letter describing your experience with cloud-native reliability, Kubernetes, and observability in complex SaaS environments.
Notice: Background Check Required
As part of our employment process, a background check (including, but not limited to, credit history, criminal records, and employment verification) may be required in compliance with applicable law. By applying, you acknowledge and consent to this process.
Procurement Sciences AI is proud to be an equal opportunity employer committed to diversity and inclusion at all organization levels.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Kiddom is seeking an experienced Staff Systems Engineer to drive technical leadership, build scalable backend and platform services, and improve developer and user experiences across their educational platform.
Join NVIDIA’s Product Security team to build SDLC security agents and backend platforms that automate OSS and developer security across CI/CD and version control systems.
Lead AI architecture and implementation at DoseSpot, designing GenAI solutions and guiding cross-functional teams to integrate LLM-driven capabilities into high-transaction healthcare SaaS products.
American Express is hiring a Senior Software Engineer to lead backend development of scalable, reusable shared service APIs using C#, Kotlin, Docker/Kubernetes and modern CI/CD practices.
Notion seeks an Early-Career Infrastructure Software Engineer to help design, ship, and operate scalable, reliable infrastructure and tooling that powers a global user base.
FoundersCard is hiring a Full Stack Rails Engineer to lead front-end implementations and modernize the web and mobile member experience from our Midtown Manhattan office.
Provide field-facing technical leadership for deployed maritime autonomy systems, driving troubleshooting, integration, and customer support for Saronic Technologies.
Experienced Full Stack Engineer (.NET + Vue/React) needed to build scalable, secure SaaS features for a leading global insurtech platform.
SoloPulse seeks a hands-on Software Engineer Intern/Co-op to help develop algorithms and full-stack software for state-of-the-art radar sensing systems.
Gatik seeks an experienced Site Reliability Engineer to own the reliability, monitoring, and scaling of the infrastructure that powers its autonomous middle‑mile fleet at the Mountain View office.
Red Wing Shoe Company is hiring a Software Developer to design, implement, and support C# and Azure-based applications and integrations in a collaborative hybrid environment.
Lead the design and delivery of safety-critical embedded software for REGENT’s seaglider product line, driving architecture, integration, and testing from bench to sea trials.
Quizlet is hiring a Senior Fullstack Engineer on the Activation & Retention team to design and ship experiments that increase user onboarding and retention using React, NextJS and server-side technologies.
Procurement Sciences is a trusted partner for GovCon, aerospace, defense, education, and other government-oriented businesses, offering a transformative platform powered by breakthrough advancements in generative AI. Procurement Sciences turns dat...
6 jobs