2.5.3 | ugeco

Okay, let's discuss 2.5.c Monitoring and Alerting. Designing a system is one thing; ensuring it runs reliably in production requires visibility into its operation.

Monitoring

Definition: Monitoring is the process of collecting, processing, aggregating, and displaying quantitative data about a system's performance, health, and resource utilization over time. It provides visibility into the system's behavior and state.
Purpose:
- Failure Detection: Identify problems, errors, or outages as they occur.
- Performance Analysis: Understand how the system performs under load, identify bottlenecks, and track latency/throughput.
- Troubleshooting & Debugging: Provide data to diagnose the root cause when issues arise.
- Capacity Planning: Track resource usage (CPU, memory, disk, network) to forecast future needs and plan scaling activities.
- Understanding System Behavior: Observe how different parts of the system interact.
- Verifying SLAs: Ensure the system meets its promised Service Level Agreements for uptime and performance.
Key Areas to Monitor (The Four Golden Signals - often cited by Google SRE):
1. Latency: The time it takes to serve a request (e.g., API response time, database query time). Often track averages, but percentiles (p50, p95, p99) are crucial to understand the user experience for the majority vs. the outliers.
2. Traffic: A measure of how much demand is being placed on the system (e.g., requests per second for a web service, network I/O).
3. Errors: The rate of requests that fail (e.g., HTTP 5xx errors, exceptions caught, database connection failures). Track absolute numbers and rates/percentages.
4. Saturation: How "full" the service is; a measure of resource utilization (e.g., CPU utilization, memory usage, disk space, queue depth). High saturation often predicts future performance degradation.
Other Important Monitoring Aspects:
- Infrastructure Metrics: CPU, RAM, Disk I/O & space, Network bandwidth & latency for underlying servers/VMs/containers.
- Application-Specific Metrics: Cache hit/miss rates, number of active users, specific business transaction rates (e.g., orders processed), queue lengths.
- Logging: Collecting and centralizing structured/unstructured logs from applications and infrastructure. Essential for detailed debugging.
- Distributed Tracing: Tracking a single user request as it propagates through multiple services in a microservices architecture. Helps identify latency bottlenecks within complex call chains.
- Health Checks: Basic up/down status checks (often used by load balancers).

Alerting

Definition: Alerting is the process of automatically notifying responsible personnel (e.g., on-call engineers, DevOps teams) when the monitoring system detects predefined problematic conditions or threshold breaches. It's about being notified when observation indicates a problem.
Purpose:
- Proactive Problem Notification: Trigger notifications when issues occur (or are about to occur), often before users are significantly impacted.
- Enable Fast Response: Reduce the Mean Time To Detect (MTTD) and Mean Time To Respond (MTTR) for incidents.
- Automate Vigilance: Avoid the need for humans to constantly watch dashboards.
Key Aspects of Alerting:
- Actionable Alerts: Alerts should signify a real problem requiring attention. Avoid overly sensitive alerts ("alert fatigue").
- Thresholds: Defining specific trigger conditions (e.g., p99 latency > 1 second for 5 minutes, error rate > 2%, disk space < 10%).
- Severity Levels: Categorizing alerts (e.g., P1/Critical, P2/Warning, P3/Info) helps prioritize responses.
- Notification Channels: How alerts are delivered (e.g., PagerDuty, Opsgenie, Slack messages, email, SMS).
- Dashboards: Visual representations of monitoring data used for investigation and understanding trends (e.g., Grafana, Datadog Dashboards, CloudWatch Dashboards).
Common Tools (Categories):
- Metrics: Prometheus, Datadog, InfluxDB, Graphite, CloudWatch Metrics, Google Cloud Monitoring.
- Logging: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk, Graylog, Loki, CloudWatch Logs, Google Cloud Logging.
- Tracing: Jaeger, Zipkin, Tempo, Datadog APM, Cloud Trace.
- Alerting: Alertmanager, PagerDuty, Opsgenie, VictorOps, monitoring platform built-in alerting.
- Visualization: Grafana, Kibana, Datadog, CloudWatch Dashboards.
In an Interview: Monitoring and Alerting are crucial for operational excellence.
- State that you would implement comprehensive monitoring.
- Mention key metrics you'd track (latency, error rates, resource utilization - the Golden Signals are good to name).
- Specify the need for centralized logging and potentially distributed tracing (especially for microservices).
- Explain that alerting would be set up based on critical thresholds to notify the on-call team proactively.