Senior Site Reliability Engineer (Remote)

Apply for this position Please mention DailyRemote when applying
Posted 14 days ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

GovCIO is a team of transformers—people who are passionate about transforming government I.T. We believe in making a difference by developing digital strategies and delivering the technology-related innovation governmental operations that improve the citizen experience every day.

But we can't do it alone. We welcome and nurture an inclusive and diversified work culture. Because different backgrounds, experiences, abilities, and perspectives make us better decision-makers, problem solvers, and creators. We're changing the face of I.T. - from our diverse staff to the end-products we develop. And we're excited to expand our team. Are you ready to be a transformer?



As a Senior Site Reliability Engineer, you will apply your senior application product expert skills to support building processes that manage and improve OIT’s response posture to system events impacting end users and Veterans. This includes working with business partners to improve communication and responsiveness to application failures by minimizing impacts in performance degradation and availability, working towards a significant reduction in application downtime and impact to the users. You will be working with a team of site reliability engineers, both junior and senior level, to support an engineering team lead to perform the required deliverables.


Areas of support include:

  • Triage Major Incident Management (MIM) and Problem Management (PM) incidents by deconstructing application performance, interoperability, instrumentation, and human factors to facilitate resolution and development of resilient solutions.
  • Support coordination and ensure all High Priority Incident (HPI) and Critical Priority Incident (CPI) are triaged properly and routed to the appropriate and correct groups for immediate resolution.
  • Perform enterprise root cause analysis (RCA) and identification in coordination with appropriate OI&T organizations
  • Capture technical information from the relevant stakeholders and synthesize it into useful information in various formats for OIT senior management and other VA components.
  • Support the collection, development, and/or editing of content for white papers and other communication devices; and assess and evaluate the effectiveness of executive communication to effect process improvement.
  • Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools 
  • Analyze incident record data, research trends and digest findings into written recommendations and strategies for improving the posture of the VA’s information technology services, reducing both MTTR and incident occurrence frequency.
  • Case management and follow-through post-incident resolution for root cause analysis, developing permanent fixes and preventative strategies to reduce MTTR and incident reoccurrence.
  • Digesting and writing technical recommendations for case management and trend analysis presentations.


  • Bachelor's Degree in Business Administration, Business Management, Computer Science, Information Systems, Information Resource Management, Industrial Engineering, Operations Research, or related fields
  • 12+ years of relevant experience (or commensurate experience)

Required Skills and Experience

  • Be a technical expert with expertise across multiple technology areas and the ability to diagnose complex issues throughout many technologies.
  • Must be able to identify and mitigate risks to the product
  • Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
  • Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
  • Identifying symptoms for process improvement.
  • Communications including being able to craft content for executive level presentations.
  • IT background and ability to understand technical content.
  • Experience working with packet capture analysis using tools such as Wireshark or Netscout.
  • Experience with monitoring tools such as Splunk, AppDynamics, SolarWinds or Dynatrace.
  • Understands the RCA process and can work across teams to guide implementation of solutions to identified incident root causes.
  • Broad understanding of ITIL.

Preferred Skills and Experience

  • Master's Degree is preferred in Business Administration, Business Management, Computer Science, Information Systems, Information Resource Management, Industrial Engineering, Operations Research, or related fields
  • ServiceNow experience is nice to have.