If your smoke alarm goes off once a year, you take it seriously.

If it goes off every time you make toast…

you take the batteries out.

That’s the difference between a good alerting system and a broken one.


What a Smoke Alarm Is Actually For

A smoke detector isn’t there to:

  • Comment on your cooking
  • Warn you about steam
  • Share every temperature fluctuation

It’s there for one thing:

Wake you up when something dangerous is happening.

Alerts in production work exactly the same way.

They exist to:

  • Signal real user impact
  • Trigger human intervention
  • Prevent escalation

If your alert doesn’t require action, it isn’t an alert.

It’s noise.


The Burnt Toast Problem

Imagine this scenario:

You make toast.

The alarm screams.

You wave a towel.

It stops.

Next night? Same thing.

After the third time, what happens?

You:

  • Ignore it
  • Silence it
  • Or remove it

In engineering, we call this alert fatigue.

And it’s dangerous.

Because one day…

it won’t be toast.


Alerts Should Be Boring

Here’s the uncomfortable truth:

Good alerts are boring.

They don’t:

  • Fire constantly
  • Trigger on every spike
  • Notify ten channels

They:

  • Indicate real user pain
  • Have a clear owner
  • Include clear next steps

If an alert fires, someone should know:

  • What is broken
  • Who should look at it
  • What “good” looks like

If you need to investigate whether the alert matters…

the alert is wrong.


What Makes a Good Alert?

Think of it like this:

A good smoke alarm:

  • Detects real smoke
  • Doesn’t scream for steam
  • Is loud enough to wake you
  • Isn’t triggered by pancakes

Translated to SRE:

A good alert:

  • Fires on SLO breaches or real user impact
  • Is tied to actionable thresholds
  • Has documentation (runbook!)
  • Is rare

Rare is important.

If you’re being alerted every day, something upstream is broken.


The Difference Between Monitoring and Alerting

Monitoring is like having a thermometer.

Alerting is like having a smoke detector.

You watch metrics all the time.

You only alert when something needs human intervention.

Not every metric deserves a pager.

If it doesn’t wake you up at 3 a.m., it doesn’t deserve to exist as an alert.


What This Means in Real Life

If your on-call feels chaotic, ask:

  • Which alerts are actionable?
  • Which ones are informational?
  • Which ones are burnt toast?

You don’t improve reliability by adding more alarms.

You improve it by:

  • Removing bad ones
  • Tightening thresholds
  • Aligning alerts with user impact

🔔 Reframe to Remember

Alerts are smoke alarms.

If they scream constantly, you’ll ignore them.

And the worst time to discover that is when there’s real fire.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Cookie Notice by Real Cookie Banner