Glossary
Definitions for common SRE terminology
Acknowledge
The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response.
Availability/Uptime
Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame.
Detect
The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident.
Latency
The response time of a system or the total time a system takes to respond to a request.
MTBF (Mean Time Between Failures)
The average time between failures (or incidents)
MTTD (Mean Time to Detect)
The average time interval between when an issue occurs and when an alert for the issue is triggered.
MTTR (Mean Time to Repair)
The average time lapsed between acknowledging an incident and resolving the incident.
Monitoring
The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall.
Network
The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers.
Observability
Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.
Postmortem / Retrospective
A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future.
RACI Chart
A RACI chart is a project management tool used for tracking roles and responsibilities. RACI is an acronym that stands for responsible, accountable, consulted, informed.
Reliability
A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.
Resolve
The process of fixing the contributing factors that led to an incident in order to restore the service.
Respond
An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders.
Runbook
A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.
SRE (Site Reliability Engineering)
A set of practices and principles aimed to improve a service’s reliability.
System
A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”
Trusted by more than 19,000 responders
Incident Impact Calculator
Find out how much you could save
Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.