Site reliability engineering

Print Print
Reading time 4:48

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.[1] The main goals are to create scalable and highly reliable software systems.[1] Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.[1][2]


The field of site reliability engineering originated at Google with Ben Treynor Sloss,[3][4] who founded a site reliability team after joining the company in 2003.[5] In 2016, Google employed more than 1,000 site reliability engineers.[6] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.[7] The position is more common at larger web companies, as small companies often don't operate at a scale that would require dedicated SREs.[7] Companies who have adopted the concept include Dropbox, Airbnb, and Netflix.[6] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[8][9]


Site reliability engineering is the application of software engineering to IT subjects including infrastructure and operations, with the goal of creating and maintaining scalable and reliable systems.[1][4] Site reliability engineers often have a backgrounds in software engineering, system engineering, or system administration.[10] Focuses of site reliability engineering include automation, system design, and improvements to system resilience.[10] SRE teams are responsible for system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.[11]

Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and has also been described as a specific implementation of DevOps.[1][2] Site reliability engineering focuses specifically on building reliable systems, whereas DevOps is more broadly focused on infrastructure.[1] The definition varies somewhat by company, and Stephen Gossett wrote in Built In that some companies have rebranded their operations teams to SRE teams with little meaningful change.[7]


The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in industry, and also holds regional conferences with similar themes.[12]

See also

  • Chaos engineering
  • Cloud computing
  • Data center
  • Disaster recovery
  • High availability software
  • Infrastructure as code
  • Operations, administration and management
  • Operations management
  • Reliability engineering
  • System administration


  1. ^ a b c d e f Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
  2. ^ a b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
  3. ^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
  4. ^ a b "What is SRE?". Red Hat. Retrieved June 17, 2021.
  5. ^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
  6. ^ a b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
  7. ^ a b c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
  8. ^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
  9. ^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
  10. ^ a b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40 no. 3. pp. 35–39. Retrieved June 17, 2021.
  11. ^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
  12. ^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

Further reading

External links

Edited: 2021-06-18 19:18:45