A Guide to Understanding Observability & Monitoring in SRE Practices
A Guide to Understanding Observability & Monitoring in SRE Practices
Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help.
What’s the Difference between Observability and Monitoring?
The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”. Observability gives you the ability to ask any question of the system in order to more deeply understand how code is behaving. Monitoring is the ability to determine the overall state of the system and is usually related to the system's infrastructure.
Observability is a tooling or technical practice that enables SRE, engineering, and ops teams to debug their system diligently. It explores new patterns and properties that were perhaps not defined or identified in advance. Because code can behave differently in production (vs staging), it’s important to proactively observe what’s occurring in production as it impacts users. In order to perform true system Observability, you need to instrument your code to generate telemetry that aids in asking any new question.
According to the recent Gartner Hype Cycle report titled: Hype Cycle for Monitoring, Observability & Cloud Operations, 2021 by Padraig Byrne, Pankaj Prasad:
“Many vendors, including those in the network and security domains, are using the term “observability” to differentiate their products. However, little consensus exists on the definition of observability and the benefits it provides, causing confusion among enterprises purchasing tools.”
By contrast, Monitoring is a practice that enables SRE and Ops teams to watch and comprehend different states of their system which is often done through predefined metrics, dashboard reports that are updated in real-time. The data feeding those dashboards is based on assembling a predefined set of metrics or logs that are important to you. More on that in a moment.
In the same Gartner Hype Cycle report, a Monitoring obstacle cited includes:
“Due to the conservative nature of IT operations -- In many large enterprises, the role of IT operations has been to keep the lights on, despite constant change. This, combined with the longevity of existing monitoring tools, means that new technology is slow to be adopted.”
How are Observability and Monitoring Related to each other?
Observability and monitoring have a symbiotic relationship and they actually serve different purposes. Observability is making the data accessible, whereas monitoring is the task of collecting and displaying that data, which is then relied upon for ongoing review or ‘watching’.
What is SRE Observability?
SRE Observability also known as o11y -- a term derived from control theory -- addresses the issue by encouraging engineers to code their services in such a way that it emits metrics and logs. These metrics and logs are then used to observe the “what’s actually occurring with my code in production”.
Observability can be further broken down into three key areas, logs, metrics, and traces that are discussed below that are important elements to enable SRE Observability.
Metrics
The foundation of monitoring - metrics are aggregated data about the performance of a service. It usually consists of a single number that is tracked over time. Traditionally, system-level metrics such as CPU, memory, and disk performance were used for tracking. That includes data such as:
- Counter: the number of queries by a particular time frame
- Distribution: latency associated with service requests or queries
- Gauge: CPU load.
The challenge here is that while this gives enough information about the system, it doesn’t tell you about the user experience or how to improve your code’s performance. To tackle the issue, some modern monitoring services also offer APM (Application Performance Monitoring) - features to track application-level metrics. These metrics include requests per minute and error rates with each metric tracking only one variable, which can be relatively cheap to store and send.
The DevOps, Ops, or SRE team usually determines the best set of metrics to watch for, which can vary depending on the service itself and its overall maturity. Often teams watch metrics dashboards when code changes occur or when a new fix or release is shipped.
Logs
Logs represent the output from your code, sometimes referred to as events that are immutable, time-stamped records that can be used to identify certain patterns in a system. All processes in a system emit logs that usually include information such as records of individual user queries and debugging information generically associated with the service.
Logs can be any arbitrary string. However, programming languages and frameworks use libraries to generate logs from the running code with relevant data at various levels of specificity (e.g. INFO vs. DEBUG mode). Among programming communities, there’s no standard about what should be included on various log levels.
Traces
In a distributed system, a trace displays the operation flow from the parent event to the child event where both events are timestamped. When individual events form a trace, they are referred to as spans. Each span stores the following information: start time, duration, and parent-id. Without the parent-id, spans are rendered as root spans.
Traces allow individual execution flows to be traced through the system that helps teams figure out which component or set of code is causing a potential system error. Teams can use dedicated tracing tools to look into the details of a certain request. By looking at trace spans and waterfall views that show multiple spans in your system, you can run queries to examine timing (latency), errors, and dependency details.
Many observability tools provide tracing capabilities as part of their offerings.
What is Monitoring for SRE Teams and Practices?
As SRE teams monitor, data is collected, processed, aggregated, and displayed in charts about the system and services. Monitoring aims to optimize any performance issue and alert teams if fixes or resolutions are needed to reduce any impact on the end-users.
According to Google’s SRE Book:
“Your monitoring system should address two questions: what’s broken, and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause.”
Here are some examples of symptoms and possible causes:
Difference between White Box and Black Box Monitoring
Monitoring is usually classified into two broad types: black-box monitoring and white-box monitoring. Depending on the situation, one or both types of monitoring will be used. Usually, heavy use of white-box monitoring is combined with a modest (but critical) use of black-box monitoring.
In white-box monitoring, metrics that are exposed by the system’s internals such as logs, HTTP handlers, interfaces, etc. are relied upon. This provides the team members insights into various parts of the tech stack. Some examples of white-box metrics include CPU or dependency. Alerts from white-box monitoring can identify:
- Cause of the issue -> data takes too long to respond to the read operations.
- Symptom of the issue -> none of the users can log into the system.
In black-box monitoring, the external, visible user behavior is examined. The metrics used in black-box monitoring include error responses, latency from the user’s perspective, etc. In black-box monitoring, users typically don’t have visibility into the system or knowledge of how it works. In terms of symptoms and causes, black-box monitoring is symptom-oriented (focused on symptoms). The team member who is monitoring these metrics doesn’t predict the problems but simply watches the system.
On the other hand, in a multi-layered system, one person’s symptoms can be another’s cause. For example, if the performance of your database is slow, then the DB reads are considered symptoms of an issue for the database SRE. While, for an SRE team-member monitoring the system’s front-end showing a slow web page, DB reads are considered a cause. Therefore, white-box monitoring can be either symptom-oriented or cause-oriented.
White-box is also important for collecting telemetry data for further debugging. Telemetry data - data generated by the system documenting its stats. This data is used to determine and improve the health and performance of the overall system.
How are Observability and Monitoring Related? What’s the importance of the Golden Signals?
Google has defined four golden signals for monitoring that include:
- Latency: the amount of time that a system takes to respond to a particular request. For example, an HTTP 500 error was triggered due to a connection loss.
- Traffic: the demand for your service among its users. It’s usually measured in HTTP requests per second.
- Errors: the rate of requests that failed either explicitly, implicitly, or by the policy (request takes longer than the previously-set response time).
- Saturation: the overall system capacity at a certain time. It’s the measure of your system’s performance
How to Measure Observability and Monitoring?
Implementing a monitoring and observability system in your organization is an evolving process and can improve over time. In order to deploy a full observability tool for all teams in engineering to learn, you need to instrument your code, which will emit the right set of logs and metrics for you to monitor, observe, and query.
Essentially it’s the detailed system and data about your code that you are running queries against in order to best understand exactly how code is behaving in production. Following are some metrics to track via postmortems or when conducting monthly surveys.
By implementing both Monitoring and Observability tools, you can collectively determine what is alert-worthy. Some issues can be handled by distinct teams and don’t require full incident resolution. Having SLOs in place is a good way to automate alerts based on acceptable thresholds of system uptime, performance, and errors. The team should first establish Monitoring and Observability practices and playbooks and then move to implement SLOs, team-wide.
- Changes to Monitoring configuration: how many changes (pull requests) are made into the monitoring configuration code repository? And how often are the changes pushed?
- Alerts handling: how many alerts were handled outside of the working hours or if global, the team handling a specific time-zone with the most alerts. Untimely or skewed alerts can bring down the team morale and potentially lead to alert fatigue.
- Alerts distribution: Are the alerts evenly distributed among teams located in various regions? That includes how engineers are placed on an on-call rotation - a method of rotating the on-call shift to include everyone in the team. This also brings to the question of whether and when developers are involved in case of a severity 1 issue. If not then why?
- False positives: alerts that result in no action because everything is actually working as expected.
- False positives: the timeliness of alerts or how often did the system failure result in no alerting or an alert far later than required?
- Alert creation: how many new alerts are created weekly?
- Alert Acknowledgement: how many alerts are answered within the agreed-upon timeline?
- Unactionable alerts: What percentage of the alerts are considered unactionable as the on-site engineer was unable to take immediate action?
- Silenced alerts: the number of alerts that are in a silenced or suppressed state each week? What’s the average and maximum silence period? How many new alerts are added or removed to silenced and the expiration date of the silenced alerts?
- MTTD (Mean time to detect/discover), MTTR (mean time to resolution), and impact detection.
- Usability: alerts, runbook, dashboard:
- ~~Number of graphs on the dashboard and can the team understand them?
- ~~Lines per graph.
- ~~How are the graphs explained to a new engineer who is just onboarding?
- ~~How much and how often do people need to scroll and browse to find information?
- ~~Can engineers effectively navigate from alerts to playbooks to dashboards and are the playbooks updated regularly?
- ~~Are alerts names good enough to point engineers in the right direction?
Tracking some or all of these metrics helps you understand whether your monitoring and observability systems are running and working efficiently for your organization. You can further break down the measurements by product, operational team, etc, to gain insight into your system, process, and people!
Which Tools can Help with Observability and Monitoring?
For any service, engineering toil is very risky as it leads to repetitive manual tasks consuming most of SREs time. In the SRE book, Google states:
“Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil as a second-order effect.”
The goal is to keep toil under 50% because, if left unchecked, it is in danger of growing and will quickly exceed half the team members' working week. The engineering in site reliability engineering (SRE) represents the practice of reducing toil and scaling up services. That’s what enables SRE teams to perform more efficiently than a pure development or pure Ops team.
Here are a few tools that can help SREs with observability and system monitoring:
Monitoring tools (some open source)
Observability tools and frameworks
Ultimately, tools definitely help but they are not nearly enough to achieve one’s objectives. Observability and monitoring are the combined responsibility of SRE/Ops and development teams.
Who is Responsible for Observability and Monitoring?
The responsibility of monitoring and observing a system should not fall solely upon an individual or a dedicated team. Not only will that help you avert a single point of failure, but also improves your ability to comprehend and improve the system as an organization. Therefore, ensuring that all developers are proficient in monitoring will promote a culture of data-driven decision-making and reduce outages.
In most organizations, only the operations team, NOC, or a similar team can make changes in the monitoring system. That’s not a good idea and ideally be replaced by a system that follows CD (continuous delivery) patterns. This ensures that all changes are delivered in a safe, fast, and sustainable manner.
How can Blameless Help with the Process?
The ultimate goal of observability and monitoring is to improve the system, which is a continuous process. DevOps Research and Assessment (DORA) research offers a comprehensive monitoring and observability solution alongside some other technical practices that can contribute to continuous delivery.
If you’re aiming to start the journey towards observability and monitoring, then Blameless can help by integrating into your chosen tools in order to collect the right data and analysis for faster incident resolution and ongoing team learning. Reliability Insights platform by Blameless helps you explore, analyze and share your reliability data with various stakeholders. To learn more, request a demo or sign up for our newsletter below.