Notifications make monitoring useful.
Without proper alerting, even the best monitoring setup is just a collection of pretty dashboards that nobody looks at until it’s too late.
Passive Monitoring Integration#
When looking at it from a passive monitoring perspective using Prometheus, AlertManager is a great service to leverage as it can hook into:
- Slack: Instant team notifications in dedicated channels.
- PagerDuty: Escalation policies and on-call management for critical incidents.
- SMS: Critical alerts that need immediate attention (e.g., via Twilio).
- Email: Non-urgent notifications and daily/weekly reports.
Application Performance Monitoring#
Other times you may want to know about business-critical events:
- When a client registers on your SaaS product.
- When a client is attempting to cancel their subscription.
- When specific business thresholds (revenue, traffic) are met.
Real-World Example: Trading Bot Alerts#
In my trading bot implementation, I configured specific triggers:
Trade Execution Alerts:
- ✅ Trade Executed: Whenever the bot buys or sells → Slack message with price/volume.
- ⚠️ Stop-Loss Triggered: When a position is closed to prevent loss → Slack message (High Priority).
- 🚨 Error State: If an API fails or logic crashes → Slack message (Critical).
Benefits:
- Remote Management: Restart services from my phone if necessary without needing a laptop.
- Peace of Mind: Enjoy your day without constantly refreshing dashboards.
- Immediate Response: Know about issues as they happen, not hours later.
- Context: Get actionable information (variables, stack traces) directly in the alert, not just “something broke”.
Smart Alerting Strategy#
What TO Alert On#
- Critical System Issues: Service down, high 5xx error rates, database locking.
- Business Events: Revenue-impacting events, VIP user actions.
- Performance Degradation: Response time (latency) spikes, disk space exhaustion.
- Security Events: Failed authentications, unusual IP access patterns.
What NOT to Alert On#
- Noise: Temporary CPU spikes that self-resolve.
- Non-actionable: Warnings you can’t or won’t fix immediately.
- Over-alerting: Sending so many alerts that the team develops “alert fatigue” and ignores them all.
Implementation Results#
This approach has allowed me to:
- Avoid SSH sessions while out and about.
- Skip opening Grafana/Kibana constantly to check health.
- Respond quickly to actual issues before users report them.
- Maintain work-life balance without sacrificing system reliability.
Key insight: The goal isn’t to get more notifications—it’s to get the right notifications at the right time so you can take meaningful action.
