Machinify gets "tremendous value" from Blameless, responds to incidents confidently with universal insight on service reliability
Machinify is a revolutionary software company with a mission to ensure that patients get the right treatment, at the right time, at the right price. The cloud-based Machinify AI platform delivers products that, bit by bit, are transforming health insurance care and claims administration from a human-powered, error-prone series of workflows fueled by faxes and spreadsheets to a world of transparent, realtime care and payment decisions.
Machinify looks to build up its incident response amidst rapid growth
In 2020, the small DevOps team of 3 began to grow rapidly, along with the entire organization. It became apparent to Sr. VP Operations, David Levinger, that they needed to define a clear incident management process. Sr. DevOps Engineer, Alex Myer, knew that the first step was to improve incident documentation. In his past, he’d built “home grown” solutions by leveraging tools like Confluence to store information in a single place. “I didn’t want to have to do it again, because it’s a lot of work. At the end of the day, it’s worth the effort to build it ourselves if we can’t find a tool in the market.”
To their relief, colleague Gavin Ray, Principal Engineer, recommended Blameless for Machinify, having used the product in his previous two jobs. After just over a year of using Blameless, the three report that it’s been “transformative” not only for their DevOps function, but also for the broader organization.
“We started using Blameless about a year ago, and transformative is the only way to describe it.” - David Levinger
Strengthening the reliability engineering muscle across the team
To help the team build a habit of the new incident management process, Alex started a weekly SRE meeting where they go over everything that happened in the previous on-call rotation. They focus strongly on finding the true contributing factors of an incident. No issue should be brought up in the meeting without finding a way to address it. “We don’t want to see it happen again,” Alex explains. What ends up happening is that several problems are solved out of a single incident. Oftentimes, the solutions involve changes to the entire system. To address those changes, the team creates action items in Jira using the integration with Blameless. This could be software code changes or DevOps process changes. It’s a great learning lesson, because they end up digging deep into the various causes of what happened.
As the team grows rapidly, it’s David’s priority to effectively train the entire engineering function. He’s ramping up new hires and hyper-focused on democratizing knowledge. It starts with hiring the best software engineers, and then they get to work rebuilding their runbooks and process documentation. If an alert isn’t tied to a runbook, and the runbook isn’t tied to an explanation that could be understood by the whole team, it’s the responsibility of the person responding to the incident to write it down (not necessarily during the incident) so that the next person that comes on call has less to try to figure out on their own. Then things evolve and change over time.
“No one on our team that’s pulled into an incident doesn’t know what to do, where to go, and how to solve it.” - David Levinger
Turning incidents into learning opportunities and systemic improvements
After using Blameless, it became easier for the Machinify team to identify where their runbooks needed improvement. Previously, their runbooks were written in varying formats, some linked to alerts and some not, and some alerts notifying the wrong person. They fixed this immediately. Now you’d be hard pressed to find an alert that’s not tied to a runbook. That’s become the exception, not the rule! Their runbooks are also updated so that now they’re much easier to understand and the information is both current and accurate. If they encounter a runbook in the weekly meeting that’s outdated, that’s the time to fix it. The team will make sure it happens.
During the weekly SRE meetings, the whole team holds the on-call leader accountable for getting to the true root of the problem. They all support each other in this way. “As a team, we do not want to repeat the same efforts,” Alex elaborates. They start every SRE meeting looking at the Reliability Insights view in Blameless. Some of their key reports are Incidents Tagged Per Customer and Incidents By Priority and Severity Level. They also review the previous week and try to spot trends and patterns.
“We hire fantastic engineers who are diligent and want to help the system improve. As a leader, I also want to encourage this type of collaboration. Like many DevOps engineers, I really hate doing anything more than once. I hate it! I can’t even articulate how much I hate it. So when there is a problem, I want that problem to never happen again. It can be a new problem; that’s fine. But that problem shouldn’t happen again. We all share that mentality of, I don’t want to keep doing stuff that I did before,” David shares.
The Machinify team believes that an incident doesn’t have to mean your customer facing site is down. For example, it could mean your process for delivering a data file to a customer was delayed. They also believe in promoting a blameless reliability culture. In fact, Machinify expressed strong support of the Blameless product update to using the term “Retrospective” for the post-incident analysis and review, because “Postmortem” has a negative connotation attached.
Blameless helps Machinify align on reliability across the organization
The Machinify DevOps team knows not to hide the hard work behind the scenes. People should see that pain when they’re not the point person. They started bringing more people into incidents when it’s a code problem. David describes how he involves cross-functional teams during incidents: “We’re inviting the server health team, which is a group of server engineers. If the issue appears to be data science related, we’re bringing them in. We’ve also had people join incident calls when they’re personally interested in the resolution — like customer success and sales. Which is excellent. And they get to ask questions that sometimes we didn’t think of.”
It’s important for an entire organization to acknowledge the impact incidents have on customers and the business overall. David shares this belief saying, “The last thing you want to say to a customer is that you knew an incident occurred, but you don’t know why, and a week has gone by, and you still haven’t gotten to the bottom of it yet. It’s not a good look for your business.” Of course, teams might spot issues that are not their “fault” but they should always want to know what they can do to “shrink the blast radius” (as David says) and make it easier to resolve.
“Blameless offered universal insight into what’s happening in our company. Everybody can see what’s going on at any given time and that insight presents tremendous value because we can quantify how smoothly the entire organization is operating.” - David Levinger
Using Blameless means everyone in the company can ask what is happening and why. The CEO, the executive team, customer success engineers, everyone is attuned to Machinify’s service reliability. “Blameless offered universal insight into what’s happening in our company,” says David. “Everybody can see what’s going on at any given time and that insight presents tremendous value because we can quantify how smoothly the entire organization is operating.” If they notice repeat events, they know it’s time to address the issue. They don’t worry, because they have a centralized Slack channel where they initiate Blameless. “It’s built a culture of trying to do the best for our customers, whether internal or external. It’s a result of managing the incident process and setting proper expectations,” David shares.
At Machinify, incidents are nothing to be afraid of, but rather motivated by
“By doing this process regularly, it doesn’t scare you anymore. Incidents are just a way to learn. You can call out a blind spot that you didn’t know you had before,” David explains. The DevOps team wants to inspire the rest of the Machinify organization to feel this way too. They’ve had the data science teams, customer content delivery teams, and other engineering teams start using Blameless. The feedback is great. David tells them, “Just use it. Even if you mess up, it doesn’t matter. Our team can help you figure it out. Just get onto the platform and start using it.” Once, the content delivery team initiated a Blameless incident that they thought was a small problem. They actually discovered other bugs and resolved them. They know it’s because they got the right team involved and kept probing until they got to the end of it.
There are many consumers of Machinify’s product, but a larger consumer is actually internal, their own data scientists. During a recent meeting between the department heads, David received feedback that they never worry about the platform anymore. They know it’s okay, even when there are problems, they know it’s going to be okay. “That’s great feedback! If we hadn’t started down the Blameless path, we wouldn’t have become this proactive about identifying incidents. We want to see everything so that we really know what’s working. If there’s anything missing, we need to add that visibility so we can spot issues faster in the future,” David expresses his excitement about the progress and positive reviews.
Making the case for Blameless is about presenting opportunity costs
Gavin Ray shares his advice to DevOps engineers who want to use a tool like Blameless and need to build a case for their managers: “As a practitioner, I bear the challenge of having to justify using an automation tool for incident management. Using a tool like Blameless allows me to build a case with hard data to say, I need to fix this. Once that iterative loop gets really tight, and when you start factoring the cost of X amount of incidents and Y amount of engineers working on it, that data then becomes compelling. You have the argument to say, Hey leadership, we need to fix this. I’m able to build the case as a solo practitioner. The value of being able to have that backlog is really powerful.” Gavin refers to data he collects on his own in conjunction with the data Blameless captures in the Reliability Insights view.
“Using a tool like Blameless allows me to build a case with hard data to say, ‘I need to fix this.’” - Gavin Ray
There are many ways teams extract value from using the Blameless SRE platform. In fact, Machinify also uses Blameless to track routine deployments in case anything goes wrong. Of course, many will end up not going wrong, and they’ll mark them clean. However, on the slight chance that one does go wrong, they already have a channel open through Blameless. They don’t have to go through the cycle of starting an incident. They can begin scheduling work and hop into a Zoom call to begin their process. “It’s not rocket science. Start using the Blameless tool. Use the process,” David shares as his final words to teams looking to adopt an incident management solution.
“This is a way to focus on getting to the true cause of something.” - David Levinger