What To Do When You Blow Your Service Level Objectives
What do you do when you've had a few too many incidents and blown your error budget? Or had a pile of near-misses that burned the team out even though the user-facing SLO wasn't violated? What if the incident trigger was the infrastructure refactoring meant to improve, not harm, reliability and maintainability?
In this talk, two senior SREs describe the context for two sets of outages that caused our medium-sized startup to pause and re-evaluate our infrastructure plans. In one outage, we experienced a significant event before an immovable external deadline, and found a creative way to push the launch related risk to a separate shard of our infrastructure and de-risk the rest of the SLOs. In the other outage, we scaled back the ambitions of a refactor of our Kafka cluster in order to give the team a break from incident fatigue despite the fact that our SLOs had only partially burned.