Okay, let's start the final section of Phase 2: 2.5 Availability and Reliability.

While scalability focuses on handling load, availability and reliability focus on ensuring the system is operational when needed and functions correctly despite potential failures.

  • Availability: The percentage of time a system is operational and able to serve requests. Often expressed in "nines" (e.g., 99.9% - "three nines", 99.99% - "four nines"). Higher availability means less downtime.
  • Reliability: The probability that a system will perform its intended function correctly for a specified period under stated conditions. It's about correctness and lack of failures during uptime.

We'll cover these key concepts and techniques:

  • 2.5.a Redundancy
  • 2.5.b Fault Tolerance
  • 2.5.c Monitoring and Alerting
  • 2.5.d Rate Limiting
  • 2.5.e Circuit Breaker Pattern

Let's begin with 2.5.a Redundancy.

  • Definition: Redundancy means having duplicate or backup components within a system that can take over if a primary component fails. It's about eliminating single points of failure (SPOFs).

  • Purpose:

    • Improve Availability: If one component goes down, a redundant component is already running (or can quickly start) to handle the workload, minimizing or eliminating downtime.
    • Improve Reliability/Fault Tolerance: Makes the system more resilient to failures of individual components.
  • Examples of Redundancy:

    • Hardware Redundancy:
      • Servers with multiple power supplies or network interface cards (NICs).
      • RAID configurations for disk drives (data is mirrored or striped across multiple disks).
    • Software/Service Redundancy:
      • Running multiple instances of application servers behind a load balancer. If one instance crashes, the load balancer directs traffic to the healthy ones.
      • Database replication (Master-Slave or Master-Master), where replicas provide backups and can take over if the primary fails.
      • Running multiple instances of load balancers, API gateways, or cache servers.
    • Network Redundancy: Multiple network paths or internet service providers.
    • Data Center / Availability Zone Redundancy: Deploying system components across multiple physical data centers or Availability Zones (AZs) within a cloud provider. This protects against failures affecting an entire location (power outages, natural disasters, network issues).
  • Relationship to Other Concepts:

    • Redundancy is often enabled and managed by Load Balancers (directing traffic to redundant instances) and Database Replication (creating redundant data copies).
    • It's a core principle for achieving Fault Tolerance.
  • In an Interview: Redundancy is fundamental. When designing any component (servers, databases, gateways), consider if it's a single point of failure. If so, propose adding redundancy (e.g., "We'll run at least two instances of this service behind a load balancer," "We'll use master-slave replication for the database"). Explain that the goal is to improve availability by ensuring no single component failure brings down the system.

Advertisement