Site reliability engineering

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.^[1] The main goals are to create scalable and highly reliable software systems.^[1] Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and SRE has also been described as a specific implementation of DevOps.^[1]^[2]

History

The field of site reliability engineering originated at Google with Ben Treynor Sloss,^[3]^[4] who founded a site reliability team after joining the company in 2003.^[5] In 2016, Google employed more than 1,000 site reliability engineers.^[6] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.^[7] The position is more common at larger web companies, as small companies often don't operate at a scale that would require dedicated SREs.^[7] Companies who have adopted the concept include Dropbox, Airbnb, and Netflix.^[6] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.^[8]^[9]

Definition

Site reliability engineering is the application of software engineering to IT subjects including infrastructure and operations, with the goal of creating and maintaining scalable and reliable systems.^[1]^[4] Site reliability engineers often have a backgrounds in software engineering, system engineering, or system administration.^[10] Focuses of site reliability engineering include automation, system design, and improvements to system resilience.^[10] SRE teams are responsible for system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.^[11]

Site reliability engineering is closely related to DevOps, a set of practices that combine software development and IT operations, and has also been described as a specific implementation of DevOps.^[1]^[2] Site reliability engineering focuses specifically on building reliable systems, whereas DevOps is more broadly focused on infrastructure.^[1] The definition varies somewhat by company, and Stephen Gossett wrote in Built In that some companies have rebranded their operations teams to SRE teams with little meaningful change.^[7]

Industry

The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in industry, and also holds regional conferences with similar themes.^[12]

References

^ ^a ^b ^c ^d ^e ^f Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
^ ^a ^b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
^ ^a ^b "What is SRE?". Red Hat. Retrieved June 17, 2021.
^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
^ ^a ^b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
^ ^a ^b ^c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
^ ^a ^b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40 no. 3. pp. 35–39. Retrieved June 17, 2021.
^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

External links

Awesome Site Reliability Engineering resources list
How they SRE resources list

[:0-1] ^ ^a ^b ^c ^d ^e ^f Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.

[:2-2] Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.

[3] Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.

[:3-4] "What is SRE?". Red Hat. Retrieved June 17, 2021.

[5] Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.

[:1-6] Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.

[:5-7] Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.

[8] Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.

[9] Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.

[:4-10] Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40 no. 3. pp. 35–39. Retrieved June 17, 2021.

[11] Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.

[12] "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Contents