2.5.1 | ugeco

Okay, let's start the final section of Phase 2: 2.5 Availability and Reliability.

While scalability focuses on handling load, availability and reliability focus on ensuring the system is operational when needed and functions correctly despite potential failures.

Availability: The percentage of time a system is operational and able to serve requests. Often expressed in "nines" (e.g., 99.9% - "three nines", 99.99% - "four nines"). Higher availability means less downtime.
Reliability: The probability that a system will perform its intended function correctly for a specified period under stated conditions. It's about correctness and lack of failures during uptime.

We'll cover these key concepts and techniques:

2.5.a Redundancy
2.5.b Fault Tolerance
2.5.c Monitoring and Alerting
2.5.d Rate Limiting
2.5.e Circuit Breaker Pattern

Let's begin with 2.5.a Redundancy.

Definition: Redundancy means having duplicate or backup components within a system that can take over if a primary component fails. It's about eliminating single points of failure (SPOFs).
Purpose:
- Improve Availability: If one component goes down, a redundant component is already running (or can quickly start) to handle the workload, minimizing or eliminating downtime.
- Improve Reliability/Fault Tolerance: Makes the system more resilient to failures of individual components.
Examples of Redundancy:
- Hardware Redundancy:
  - Servers with multiple power supplies or network interface cards (NICs).
  - RAID configurations for disk drives (data is mirrored or striped across multiple disks).
- Software/Service Redundancy:
  - Running multiple instances of application servers behind a load balancer. If one instance crashes, the load balancer directs traffic to the healthy ones.
  - Database replication (Master-Slave or Master-Master), where replicas provide backups and can take over if the primary fails.
  - Running multiple instances of load balancers, API gateways, or cache servers.
- Network Redundancy: Multiple network paths or internet service providers.
- Data Center / Availability Zone Redundancy: Deploying system components across multiple physical data centers or Availability Zones (AZs) within a cloud provider. This protects against failures affecting an entire location (power outages, natural disasters, network issues).
Relationship to Other Concepts:
- Redundancy is often enabled and managed by Load Balancers (directing traffic to redundant instances) and Database Replication (creating redundant data copies).
- It's a core principle for achieving Fault Tolerance.
In an Interview: Redundancy is fundamental. When designing any component (servers, databases, gateways), consider if it's a single point of failure. If so, propose adding redundancy (e.g., "We'll run at least two instances of this service behind a load balancer," "We'll use master-slave replication for the database"). Explain that the goal is to improve availability by ensuring no single component failure brings down the system.