Alert Configuration Guide

Configure intelligent monitoring alerts to detect and respond to email deliverability issues before they impact your business.

What Are Monitoring Alerts?

Monitoring alerts are automated notifications that immediately inform your team when email infrastructure issues are detected. They act as an early warning system, enabling rapid response to problems before they escalate into major deliverability crises.

Effective alerting ensures that:

Critical issues are detected within minutes, not hours or days
The right team members are notified based on severity and type
Response times are minimized through clear, actionable information
Historical patterns are tracked to identify recurring problems
Business impact is reduced through proactive intervention

Types of Alerts

Configure alerts for different aspects of your email infrastructure:

Blocklist Alerts

Immediate notification when your IP addresses or domains appear on email blocklists. Critical for maintaining deliverability - even a few hours on major lists can damage reputation.

Uptime Alerts

Detect mail server downtime, SMTP connection failures, or service degradation. Ensures your email infrastructure is always available to send and receive messages.

DNS Change Alerts

Monitor SPF, DKIM, DMARC, and MX records for unexpected modifications. Unauthorized changes can indicate security breaches or misconfigurations that break email delivery.

Authentication Failure Alerts

Track SPF, DKIM, and DMARC validation failures. High failure rates suggest configuration problems or spoofing attempts targeting your domain.

Certificate Expiry Alerts

Warn when SSL/TLS certificates are approaching expiration (30, 14, and 7 days before). Expired certificates cause connection failures and security warnings.

Reputation Score Changes

Alert on significant drops in sender reputation scores from major providers. Early detection helps prevent deliverability degradation.

Volume Anomaly Alerts

Detect unusual spikes or drops in email sending volume. Can indicate compromised accounts, system failures, or unauthorized usage.

Alert Channels

Deliver alerts through multiple channels to ensure rapid response:

Email Notifications

Standard for non-critical alerts. Send to dedicated monitoring addresses or distribution lists. Include full details, links to dashboards, and recommended actions. Best for warnings and informational alerts.

SMS Messages

Essential for critical alerts requiring immediate attention. Keep messages concise with issue summary and severity. Use for blocklist detections, complete outages, and security incidents.

Slack Integration

Route alerts to dedicated channels (#email-monitoring, #alerts-critical). Enables team collaboration and quick response coordination. Use threaded replies to track resolution progress.

Webhooks

Send structured alert data to custom endpoints for integration with internal systems, ticketing platforms, or automation workflows. Enables programmatic response and data aggregation.

PagerDuty

Integrate with on-call schedules for 24/7 coverage. Automatic escalation if alerts aren't acknowledged. Includes incident management and post-mortem features. Ideal for enterprise operations.

Microsoft Teams

Post alerts to Teams channels using connectors. Good for organizations using Microsoft 365. Supports rich formatting and actionable cards.

Alert Severity Levels

Classify alerts by severity to prioritize response and route to appropriate channels:

Critical

Issues causing immediate, significant business impact. Require urgent response within 15-30 minutes.

Examples: Mail server completely down, blocklisting on Spamhaus or major providers, all authentication failing, DNS records deleted, critical SSL certificate expired

Channels: SMS, phone calls, PagerDuty, Slack with @channel mention

Warning

Problems that could escalate or cause partial service degradation. Require response within 2-4 hours.

Examples: Blocklisting on secondary lists, elevated authentication failure rates (10-20%), SSL certificate expiring in 7 days, sender reputation score drop, unusual volume patterns

Channels: Email, Slack, webhooks to ticketing systems

Info

Informational notifications about state changes or successful resolutions. No immediate action required.

Examples: Successful delisting from blocklist, DNS records updated as planned, monitoring check passed after previous failure, SSL certificate renewed, configuration changes applied

Channels: Email, Slack (no mentions), logging systems

Setting Up Alert Rules

Configure precise conditions and thresholds for when alerts trigger:

Threshold-Based Alerts

Trigger alerts when metrics cross defined boundaries:

Authentication failure rate exceeds 15% over 1 hour
Server response time exceeds 5 seconds for 3 consecutive checks
Bounce rate exceeds 5% for any campaign
Spam complaint rate exceeds 0.1% (1 per 1000 emails)

State Change Alerts

Alert on transitions between known states:

Server status changes from UP to DOWN
New blocklist listing detected (wasn't listed, now is)
DNS record modified (value changed from known configuration)
Authentication record validation status changes from PASS to FAIL

Anomaly Detection

Use machine learning to detect deviations from normal patterns:

Email volume is 300% higher than 7-day average
Send rate pattern differs significantly from historical baseline
Geographic distribution of sends changes unexpectedly
Authentication failure patterns diverge from normal

Check Frequency Configuration

Set appropriate monitoring intervals based on criticality:

Uptime checks: Every 1-5 minutes
Blocklist checks: Every 15-30 minutes
DNS record checks: Every 1-6 hours
SSL certificate checks: Daily
Reputation scores: Every 6-24 hours

Alert Suppression Windows

Prevent repeated alerts for the same issue within a time window. For example, only send blocklist alerts once per 4 hours unless the issue changes. After the first critical alert, suppress duplicates for 30-60 minutes to allow time for investigation and response.

Alert Routing and Escalation

Ensure alerts reach the right people at the right time:

Team-Based Routing

Route alerts to specific teams based on issue type. Infrastructure alerts go to DevOps, deliverability issues to email team, security alerts to security team. Use distribution lists to ensure coverage during absences.

On-Call Schedules

Implement rotating on-call schedules for 24/7 coverage. Define primary and secondary on-call contacts. Use PagerDuty, Opsgenie, or similar tools to manage rotations and ensure someone is always available for critical alerts.

Escalation Policies

Define escalation paths for unacknowledged critical alerts. Example policy: Alert primary on-call via SMS. If not acknowledged in 15 minutes, alert secondary on-call. If not acknowledged in 30 minutes total, alert team lead and manager.

Business Hours vs After Hours

Adjust alert routing based on time of day. During business hours, send to team channels and email. After hours, route critical alerts to on-call personnel via SMS/phone. Lower severity thresholds for after-hours alerts.

Geographic Distribution

For global operations, route alerts to teams in appropriate time zones. Ensure handoff procedures between regions for follow-the-sun coverage. Document which regions handle which alert types during their business hours.

Reducing Alert Fatigue

Too many alerts lead to desensitization and missed critical issues. Implement these strategies to maintain alert effectiveness:

Smart Grouping

Group related alerts into single notifications. If 10 servers fail simultaneously, send one alert about the outage rather than 10 individual alerts. Include affected systems in a summary list within the alert.

Deduplication

Suppress duplicate alerts for the same issue. If a server is down, don't send repeated alerts every check interval. Send one alert when it goes down, updates if status changes, and a resolution alert when it recovers.

Quiet Hours

Configure quiet periods for non-critical alerts during nights, weekends, or holidays. Only allow critical alerts during these times. Queue warning and info alerts for delivery during business hours.

Maintenance Windows

Suppress alerts during planned maintenance. Schedule maintenance windows in your monitoring system to prevent false alarms. Automatically re-enable monitoring when the maintenance window closes.

Appropriate Thresholds

Set realistic thresholds that indicate genuine problems, not normal variations. A single authentication failure isn't concerning, but 20% failure rate is. Tune thresholds over time based on your baseline metrics.

Alert Refinement Process

Regularly review alert effectiveness. Track acknowledgment rates and false positive rates. If alerts are frequently ignored or marked as non-issues, adjust or remove them. Aim for 95%+ of alerts to require action.

Digest Notifications

For low-priority informational alerts, send daily or weekly digest summaries instead of real-time notifications. Include statistics, trends, and aggregated status updates in a single scheduled email.

Best Practices

Follow these proven practices for effective alerting:

Make Alerts Actionable

Every alert should clearly state what's wrong and what action is needed. Include specific error messages, affected systems, and severity. Bad: "Server error detected." Good: "Mail server smtp1.example.com down - no response to health checks. Action: Restart server or investigate system logs."

Provide Context and Links

Include direct links to dashboards, logs, runbooks, and relevant documentation. Add context about when the issue started, how long it's been occurring, and any related metrics. Enable one-click access to investigation tools.

Write Clear Messages

Use plain language, not technical jargon or cryptic codes. Frontload the most important information - severity and issue summary first, details second. Structure messages consistently so they're easy to scan at 3 AM.

Include Runbook Links

Link to step-by-step resolution procedures (runbooks) for common issues. This enables faster response, especially for less experienced team members or during handoffs. Example: "See runbook: https://docs.company.com/runbooks/blocklist-removal"

Indicate Business Impact

Explain how the issue affects users or business operations. "Customer emails not being delivered" is more meaningful than "SMTP connection error." Helps prioritize response and justify resource allocation.

Send Resolution Notifications

Always send a follow-up alert when issues are resolved. Include resolution time, what fixed it, and any follow-up actions needed. This provides closure and helps with post-incident analysis.

Document Alert Procedures

Maintain documentation of all alert types, what triggers them, who responds, and how to resolve them. Include expected response times and escalation paths. Keep this documentation updated as systems and procedures evolve.

Regular Alert Reviews

Quarterly review of alert effectiveness. Analyze acknowledgment rates, response times, false positives, and missed incidents. Adjust thresholds, add new alerts for emerging issues, remove alerts that don't add value.

Testing Alerts

Regularly test your alerting system to ensure it works when you need it:

Test Notification Delivery

Send test alerts through each configured channel monthly. Verify they arrive at the correct destinations with proper formatting. Test during business hours and after hours to confirm routing works correctly.

Verify Escalation Paths

Test that escalation policies work correctly. Simulate an unacknowledged critical alert and verify it reaches secondary contacts at the right time. Ensure on-call schedules are current and contacts are reachable.

Trigger Actual Alerts

In non-production environments, intentionally trigger real alert conditions (take a server offline, modify DNS records in test zone). Verify the monitoring system detects the issue and sends alerts correctly.

Check Integration Webhooks

Validate that webhook integrations are functioning. Check that alerts create tickets in your issue tracking system, post to correct Slack channels, and trigger automation workflows as expected.

Response Drills

Conduct quarterly incident response drills. Send a simulated critical alert and time how long it takes for on-call personnel to acknowledge and begin investigation. Use results to improve response procedures and training.

Document Test Results

Keep records of all alert tests including date, what was tested, results, and any issues found. Track resolution of identified problems. Maintain a testing schedule to ensure all alert types are tested regularly.

Alert Message Template

Use this template structure for consistent, effective alert messages:

[CRITICAL] Mail Server Down
Server: smtp1.example.com
Issue: Server not responding to health checks
Started: 2024-01-15 14:23 UTC
Impact: All outbound email blocked
Duration: 5 minutes
Action Required:
1. Check server status
2. Review system logs
3. Restart if necessary
Dashboard: https://monitor.example.com/smtp1
Logs: https://logs.example.com/smtp1
Runbook: https://docs.example.com/runbooks/server-restart

Next Steps

Start monitoring your email infrastructure with intelligent alerts:

Setup Uptime Monitoring →Blocklist Monitoring →