3.1.3 | ugeco

The Problem: Locking Across Multiple Machines

In a single application, we use locks (like mutexes) to prevent multiple threads from accessing the same resource at the same time, avoiding race conditions. But what if the "threads" are actually different servers in a distributed system, all trying to access a shared resource (like a specific file in a storage system or performing a critical action that only one can do at a time)? A simple in-memory lock won't work because it's local to only one server. This is where distributed locking is needed.

What is Distributed Locking?

A distributed lock is a mechanism that enforces mutual exclusion for a shared resource among multiple processes running on different machines. It ensures that, at any given moment, only one process across the entire distributed system can enter a critical section or hold the lock for a specific resource.

Challenges

Implementing a distributed lock is much harder than a local lock due to the realities of distributed systems:

Network Partitions: A client might acquire a lock, get partitioned from the lock service, and then the service might mistakenly think the client has died, granting the lock to another client. This can lead to two clients holding the same lock.
Lock Service Failure: The service managing the locks can fail, potentially losing all lock information or becoming unavailable.
Clock Skew: Relying on timestamps across different machines is unreliable because their clocks are never perfectly synchronized.
Deadlocks: Just like local deadlocks, but harder to detect and resolve across multiple machines.

Common Implementations

Here are a few ways distributed locks are implemented, from simple to more robust:

1. Using a Database

A simple approach is to use a relational database. You can create a locks table with a unique constraint on a resource_name column.

To acquire a lock: A client tries to INSERT a row for the resource name. Due to the UNIQUE constraint, only the first client will succeed.
To release a lock: The client DELETEs the row.
Pros: Easy to understand and implement if you already have a database. It leverages the database's ACID properties.
Cons: The database can become a performance bottleneck and is a single point of failure (if the DB goes down, your locking mechanism is down).

2. Using a Distributed Coordination Service (e.g., ZooKeeper, etcd)

This is the most common and recommended approach for building reliable distributed systems.

How it works (ZooKeeper example):
1. A client attempts to create an ephemeral znode (a type of file/directory) with a specific name (e.g., /locks/my_resource).
2. ZooKeeper's consensus protocol guarantees that only one client can successfully create the node at a time. The successful client now holds the lock.
3. Other clients can "watch" the znode for changes.
4. To release the lock: The client explicitly deletes the znode.
5. Fault Tolerance: The key is the "ephemeral" nature. If the client who holds the lock crashes or gets disconnected, its session with ZooKeeper times out, and ZooKeeper automatically deletes the ephemeral znode, releasing the lock. This elegantly solves the problem of crashed clients holding locks indefinitely.
Pros: Highly reliable and fault-tolerant because these systems are built on consensus algorithms (Raft, Paxos/ZAB). Handles client failures automatically.
Cons: Adds the operational overhead of managing another distributed system (ZooKeeper/etcd).

3. Using a Distributed Cache (e.g., Redis)

Redis is often used for its high performance.

How it works: A client can use a single atomic command like SET resource_name random_value NX PX 30000.
- NX: Only set the key if it does not already exist. This makes the lock acquisition atomic.
- PX 30000: Set an expiration time of 30,000 milliseconds. This is a lease—it ensures the lock is automatically released if the client crashes.
- random_value: A unique random value known only to the client. When releasing the lock, the client must provide this value to ensure it's not accidentally releasing a lock acquired by another client.
Pros: Very fast and high performance.
Cons: Can be less safe than ZooKeeper. In a standard Redis master-slave setup, a lock might be acquired on the master, but the master could fail before replicating that lock to the slave. If the slave is promoted to master, another client could acquire the same lock. More complex setups like the Redlock algorithm attempt to solve this, but they are controversial and not trivial to implement correctly.

Summary for an Interview

Recognize when a distributed lock is needed: anytime multiple independent processes/servers need to coordinate access to a shared resource.
Do not recommend building one from scratch.
The best practice is to use a dedicated coordination service like ZooKeeper or etcd. Explain that their consensus-based nature and features like ephemeral nodes make them ideal for reliable locking.
Show awareness of other methods (Database, Redis) and be able to articulate their trade-offs (e.g., the database as a SPOF, or the performance-vs-safety trade-off of using Redis).
Mention the concept of a lease (a lock with a timeout) as a crucial mechanism for handling client failures.