If your smoke alarm goes off once a year, you take it seriously.
If it goes off every time you make toast…
you take the batteries out.
That’s the difference between a good alerting system and a broken one.
What a Smoke Alarm Is Actually For
A smoke detector isn’t there to:
- Comment on your cooking
- Warn you about steam
- Share every temperature fluctuation
It’s there for one thing:
Wake you up when something dangerous is happening.
Alerts in production work exactly the same way.
They exist to:
- Signal real user impact
- Trigger human intervention
- Prevent escalation
If your alert doesn’t require action, it isn’t an alert.
It’s noise.
The Burnt Toast Problem
Imagine this scenario:
You make toast.
The alarm screams.
You wave a towel.
It stops.
Next night? Same thing.
After the third time, what happens?
You:
- Ignore it
- Silence it
- Or remove it
In engineering, we call this alert fatigue.
And it’s dangerous.
Because one day…
it won’t be toast.
Alerts Should Be Boring
Here’s the uncomfortable truth:
Good alerts are boring.
They don’t:
- Fire constantly
- Trigger on every spike
- Notify ten channels
They:
- Indicate real user pain
- Have a clear owner
- Include clear next steps
If an alert fires, someone should know:
- What is broken
- Who should look at it
- What “good” looks like
If you need to investigate whether the alert matters…
the alert is wrong.
What Makes a Good Alert?
Think of it like this:
A good smoke alarm:
- Detects real smoke
- Doesn’t scream for steam
- Is loud enough to wake you
- Isn’t triggered by pancakes
Translated to SRE:
A good alert:
- Fires on SLO breaches or real user impact
- Is tied to actionable thresholds
- Has documentation (runbook!)
- Is rare
Rare is important.
If you’re being alerted every day, something upstream is broken.
The Difference Between Monitoring and Alerting
Monitoring is like having a thermometer.
Alerting is like having a smoke detector.
You watch metrics all the time.
You only alert when something needs human intervention.
Not every metric deserves a pager.
If it doesn’t wake you up at 3 a.m., it doesn’t deserve to exist as an alert.
What This Means in Real Life
If your on-call feels chaotic, ask:
- Which alerts are actionable?
- Which ones are informational?
- Which ones are burnt toast?
You don’t improve reliability by adding more alarms.
You improve it by:
- Removing bad ones
- Tightening thresholds
- Aligning alerts with user impact
🔔 Reframe to Remember
Alerts are smoke alarms.
If they scream constantly, you’ll ignore them.
And the worst time to discover that is when there’s real fire.


Leave a Reply