Okay, let's discuss 2.5.c Monitoring and Alerting. Designing a system is one thing; ensuring it runs reliably in production requires visibility into its operation.
Monitoring
-
Definition: Monitoring is the process of collecting, processing, aggregating, and displaying quantitative data about a system's performance, health, and resource utilization over time. It provides visibility into the system's behavior and state.
-
Purpose:
- Failure Detection: Identify problems, errors, or outages as they occur.
- Performance Analysis: Understand how the system performs under load, identify bottlenecks, and track latency/throughput.
- Troubleshooting & Debugging: Provide data to diagnose the root cause when issues arise.
- Capacity Planning: Track resource usage (CPU, memory, disk, network) to forecast future needs and plan scaling activities.
- Understanding System Behavior: Observe how different parts of the system interact.
- Verifying SLAs: Ensure the system meets its promised Service Level Agreements for uptime and performance.
-
Key Areas to Monitor (The Four Golden Signals - often cited by Google SRE):
- Latency: The time it takes to serve a request (e.g., API response time, database query time). Often track averages, but percentiles (p50, p95, p99) are crucial to understand the user experience for the majority vs. the outliers.
- Traffic: A measure of how much demand is being placed on the system (e.g., requests per second for a web service, network I/O).
- Errors: The rate of requests that fail (e.g., HTTP 5xx errors, exceptions caught, database connection failures). Track absolute numbers and rates/percentages.
- Saturation: How "full" the service is; a measure of resource utilization (e.g., CPU utilization, memory usage, disk space, queue depth). High saturation often predicts future performance degradation.
-
Other Important Monitoring Aspects:
- Infrastructure Metrics: CPU, RAM, Disk I/O & space, Network bandwidth & latency for underlying servers/VMs/containers.
- Application-Specific Metrics: Cache hit/miss rates, number of active users, specific business transaction rates (e.g., orders processed), queue lengths.
- Logging: Collecting and centralizing structured/unstructured logs from applications and infrastructure. Essential for detailed debugging.
- Distributed Tracing: Tracking a single user request as it propagates through multiple services in a microservices architecture. Helps identify latency bottlenecks within complex call chains.
- Health Checks: Basic up/down status checks (often used by load balancers).
Alerting
-
Definition: Alerting is the process of automatically notifying responsible personnel (e.g., on-call engineers, DevOps teams) when the monitoring system detects predefined problematic conditions or threshold breaches. It's about being notified when observation indicates a problem.
-
Purpose:
- Proactive Problem Notification: Trigger notifications when issues occur (or are about to occur), often before users are significantly impacted.
- Enable Fast Response: Reduce the Mean Time To Detect (MTTD) and Mean Time To Respond (MTTR) for incidents.
- Automate Vigilance: Avoid the need for humans to constantly watch dashboards.
-
Key Aspects of Alerting:
- Actionable Alerts: Alerts should signify a real problem requiring attention. Avoid overly sensitive alerts ("alert fatigue").
- Thresholds: Defining specific trigger conditions (e.g.,
p99 latency > 1 second for 5 minutes,error rate > 2%,disk space < 10%). - Severity Levels: Categorizing alerts (e.g., P1/Critical, P2/Warning, P3/Info) helps prioritize responses.
- Notification Channels: How alerts are delivered (e.g., PagerDuty, Opsgenie, Slack messages, email, SMS).
- Dashboards: Visual representations of monitoring data used for investigation and understanding trends (e.g., Grafana, Datadog Dashboards, CloudWatch Dashboards).
-
Common Tools (Categories):
- Metrics: Prometheus, Datadog, InfluxDB, Graphite, CloudWatch Metrics, Google Cloud Monitoring.
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Graylog, Loki, CloudWatch Logs, Google Cloud Logging.
- Tracing: Jaeger, Zipkin, Tempo, Datadog APM, Cloud Trace.
- Alerting: Alertmanager, PagerDuty, Opsgenie, VictorOps, monitoring platform built-in alerting.
- Visualization: Grafana, Kibana, Datadog, CloudWatch Dashboards.
-
In an Interview: Monitoring and Alerting are crucial for operational excellence.
- State that you would implement comprehensive monitoring.
- Mention key metrics you'd track (latency, error rates, resource utilization - the Golden Signals are good to name).
- Specify the need for centralized logging and potentially distributed tracing (especially for microservices).
- Explain that alerting would be set up based on critical thresholds to notify the on-call team proactively.