Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Building Blameless right from the beginning

|
6.1.2018

When Ashar, Lyon and I set out to start Blameless, I was the one in charge of the technical side. As I was exploring options and principles with which to build our system, I had a small realization I hadn’t had before. What I realized back in September last year, is that our industry has finally reached the tipping point at which it has become viable to build distributed systems from scratch, at a fast pace of iteration and low cost of operation, all while still having a small team to execute!

Now, if you’ve been in the startup scene for long enough, you’ll immediately react to this and think “nonsense!”. Everybody says that if you’re building a startup, hacking and experimentation are essential to move fast enough to achieve product-market fit before you run out of cash and die a brave death. Well, I come from that school as well. I’ve read The Lean Startup many many times. I hang out in hacker news. I joined a tiny startup as their first employee all but a decade ago and saw it grow to the hundreds through this philosophy.

Here’s where things changed for me. After growing through the ranks to be in charge of a massive cloud of physical servers, thousands of VMs and dozens of services, I moved on to a big company to try and help a team of brilliant people plan a massive migration of infrastructure to centralized CI/CD and orchestration. The approaches and tools we explored during that time, radically changed the way I think about architectures and how to enable experimentation.

Fast forward to last September; I set out to build the core of what would later become Blameless. I decided to start from first principles, so I wrote down the pillars that I believed would enable the product that Blameless needed. Here are those principles:

  • Build a system and scales horizontally to support higher workloads and more substantial amounts of customers
  • Build a system that can be deployed and operated in different configurations (Single Tenant Hosted, Multi-Tenant Hosted, Hybrid cloud, on premises)
  • Maintain an architecture that is deliberate, effective and practical for the product and goals at hand
  • Have an infrastructure that stays out of the way, enabling multiple teams to deliver software without friction through a reliable and mature process
  • Deliver top-tier reliability to our customers, leading by example in the reliability and operations fronts
  • Build an architecture and infrastructure layer that enables fast-paced iteration for our product

As I shared earlier, some of these sound pretty counter-intuitive, don’t they? We’re supposed to move fast, incurring technical debt, cutting as many corners as possible to be able to out-compete the big players and pull a miracle. Do things that don't scale! Well, after almost a year down this adventure, I’m happy to report that we've been able to stick to these principles. By choosing the right set of tools, our team can prototype faster than ever, iterate quickly, all while still maintaining a manageable level of tech debt and setting ourselves for a more manageable future as our product matures and our scale grows.

We're going to be writing about these principles, strategies, and tools through this blog post series, but to get things going, here’s the list of technologies we’re taking advantage of to accomplish that.

Our Stack:

  • Python 3.6 and Pipenv
  • Nameko and RabbitMQ
  • MongoDB
  • Cookiecutter

Our Infrastructure:

  • GCP and GKE
  • Kubernetes and Helm
  • SOPS and helm-secrets
  • Travis CI
  • Weaveworks
  • Auth0
  • Github Releases
Resources
Book a blameless demo
To view the calendar in full page view, click here.