Skip to main content
  1. Posts/

Notifications: Smart Alerting Strategy

Table of Contents

Notifications make monitoring useful.

Without proper alerting, even the best monitoring setup is just a collection of pretty dashboards that nobody looks at until it’s too late.

Passive Monitoring Integration
#

When looking at it from a passive monitoring perspective using Prometheus, AlertManager is a great service to leverage as it can hook into:

  • Slack: Instant team notifications in dedicated channels.
  • PagerDuty: Escalation policies and on-call management for critical incidents.
  • SMS: Critical alerts that need immediate attention (e.g., via Twilio).
  • Email: Non-urgent notifications and daily/weekly reports.

Application Performance Monitoring
#

Other times you may want to know about business-critical events:

  • When a client registers on your SaaS product.
  • When a client is attempting to cancel their subscription.
  • When specific business thresholds (revenue, traffic) are met.

Real-World Example: Trading Bot Alerts
#

In my trading bot implementation, I configured specific triggers:

Trade Execution Alerts:

  • Trade Executed: Whenever the bot buys or sells → Slack message with price/volume.
  • ⚠️ Stop-Loss Triggered: When a position is closed to prevent loss → Slack message (High Priority).
  • 🚨 Error State: If an API fails or logic crashes → Slack message (Critical).

Benefits:

  • Remote Management: Restart services from my phone if necessary without needing a laptop.
  • Peace of Mind: Enjoy your day without constantly refreshing dashboards.
  • Immediate Response: Know about issues as they happen, not hours later.
  • Context: Get actionable information (variables, stack traces) directly in the alert, not just “something broke”.

Smart Alerting Strategy
#

What TO Alert On
#

  • Critical System Issues: Service down, high 5xx error rates, database locking.
  • Business Events: Revenue-impacting events, VIP user actions.
  • Performance Degradation: Response time (latency) spikes, disk space exhaustion.
  • Security Events: Failed authentications, unusual IP access patterns.

What NOT to Alert On
#

  • Noise: Temporary CPU spikes that self-resolve.
  • Non-actionable: Warnings you can’t or won’t fix immediately.
  • Over-alerting: Sending so many alerts that the team develops “alert fatigue” and ignores them all.

Implementation Results
#

This approach has allowed me to:

  • Avoid SSH sessions while out and about.
  • Skip opening Grafana/Kibana constantly to check health.
  • Respond quickly to actual issues before users report them.
  • Maintain work-life balance without sacrificing system reliability.

Key insight: The goal isn’t to get more notifications—it’s to get the right notifications at the right time so you can take meaningful action.

Related