* Locking Why: "regulate access to shared data structures" Example: multiple threads (or CPU cores) can access the same variable int i = somevalue; Thread 1: i++; Thread 2: i--; Thread 3: printf("%d", i); Examples: the size of a file as recorded in struct inode. Important because if any process writes to appends at the end of the file, they need to know a consistent file size. One always has to THINK carefully about their "locking semantics": that is, 1. Are there any shared resources that need to be locked? 2. What kind of lock is needed? Q1: Are any resources need to be locked? A1: No if the resource is "read only" (may be initialized once). There's no risk that the value will change by anyone (or, no one is supposed to modify that variable). For example, some inode fields get initialized once when the inode is created. A2: Not needed, if a consistent value is not required. For example, every computer system has a clock that ticks and updates some register in memory. The clock ticks at some high frequency -- every milli-, micro-second, or even every CPU cycle. If you need a high-accuracy clock (e.g., up to the micro-second), then you need to lock the clock-register before reading it. But, if all you care about is, say, a 1-second resolution, then it's ok to read the clock-register value w/o locking. Q2: Now, suppose you do need to lock: what kind of lock to use? * Lock categories 1. Exclusive vs. non-exclusive An exclusive lock can be taken by only one thread (called a "lock owner"). That means, only one lock owner can execute the critical section (CS) protected by the lock, and all others must wait (i.e., be blocked at the entry point to the CS). A non-exclusive lock, is one where more than one lock-owner can enter the CS at the same time. The number of lock owners could be limited or not. 2. Blocking vs. non-blocking A blocking lock allows the code inside the CS to "block" (e.g., go into scheduler WAIT state, as when waiting on I/O). A non-blocking lock does not permit lock-owner(s) inside the CS to block. IOW, the CS must be small (or run quickly). * Example int i = somevalue; // a shared/global value lock_t L; // declare some lock and init it Thread 1: lock(L); i++; // CS unlock(L); Thread 2: lock(L); i--; // CS unlock(L); Thread 3: lock(L); printf("%d", i); // CS unlock(L); Note: if you identify 'i' as the shared resource, and decided to use lock type 'L' to protect it, then you must use un/lock(L) around any use of 'i', anywhere in your code base. That is, you must protect the shared resource EVERYWHERE in your code base, and use the same lock. * Implementation For un/lock() to work (in any high level lang), you need a low-level hardware mechanism that'll perform some action ATOMICALLY, even if the same action executes on the different CPUs/cores at the same time. Multiple cores give you the ability to "run more things in parallel" (which improves performance); but if what runs in parallel is a lock() instruction, then it has to be coordinated somehow across all CPUs/cores, so that only one of them will "win" the race and enter the CS. Modern architectures include atomic instructions of two types: 1. test-and-set (TAS): test if a register's value is X and if so, set it to Y. For example: if (R1 == 0): R1=1 or if (R2 != 0): R2=0 2. compare-and-swap (CAS): test the values of two registers, and swap them according to some criteria (e.g., one is larger than the other). Example: if (R2 < R1) { // make sure R1 has the smaller value of the two tmp = R2; R2 = R1; R1 = tmp; } Key: these TAS/CAS instructions are built into the processor(s) and they operate atomically (only one of these instrux will execute even if several are run in parallel on multiple CPUs/cores). Actual implementation of these atomic instrux is done in hardware (above is just sample pseudo code). When we have multiple cores or CPUs, the system architecture has to synchronize lock states across all cores and all CPUs. If you do, say a test-and-set on R1 in one core, the system has to ensure that no one else is using R1 in other cores (of the same or other CPUs), esp. if another core also runs a test-and-set on R1! * Efficiency As the number of CPUs and cores grows, getting consistent locking becomes harder and harder. In the simplest case, the h/w has to stop running anything on all CPUs/cores, and sync the value of the lock register across. That can slow down a system considerably. Ironically, it gets worse and worse as we increase the no. of CPUs and cores in our systems. Processor vendors know about these possible inefficiencies, and so they've devised all kinds of complicated ways to improve efficiency. One technique is "speculative" locking. If you expect that at most one thread might be accessing a lock register, then you can afford to check ASYNCHRONOUSLY if other cores may be accessing the same lock register: most of the time, you'll find that no one else touches the same register, and so the speculation "won" (you permitted the test-and-set on the lock reg to proceed w/o having to wait to sync state across all cores). But, if you discover that indeed other cores tried to test-and-set R1, then you have to "undo" your state: effectively reverting the CPU caches and memory state to before when the first test-and-set on the lock reg started, and start again (this time, taking the slower path of carefully checking all cores). Lesson: merely taking a lock can slow your code, so you have to be careful to take the right locks (TBD) and only when needed. Next, assume I have this code, and only 'i' is the shared variable being protected by 'L': lock(L); int j = foo(); printf("%d", i); if (j<0) j = foo(j); printf("%d", j); unlock(L); Problem with above code: now the lock also protects 'j'. But assume that 'j' is not a resource that must be locked. Q: Is above code incorrect wrt 'i'? A: no (b/c we're still protecting i) Main problem: the entire CS is now longer and bigger, and hence takes more time. This'll slow down others who are blocked on 'L' and waiting to enter THEIR critical section. By slowing down others, you're making overall system performance slower (i.e., application throughput slows). Lesson: keep CSs as small as possible, and protect only the resources you need. This'll ensure the maximum throughput possible in the system, across all apps that may access their CS. Locking ensures correctness of data, but slows things down. So use the right locks and keep the CS as small as possible to maximize performance. You may have to restructure your code to help locking efficiency: lock(L); i = foo(); printf("%d", j); // doesn't need to be inside CS printf("%d", i); unlock(L); You can move the 'j' printf outside the CS (before or after). * Types of locks in Linux 1. spinlock Spinlocks are fast locks, intended for small CSs. The code inside the CS is not allowed to block (or take too long). The spinlock is implemented literally as an infinite loop that checks the value of some variable using a lock instrux like test-and-set, for example test in a loop if a register assigned to variable L is set to 1 and if not, set it to 1 and enter the CS: lock_spinlock (L) while (test-and-set(L, 1)) ; Spinlocks are good for updating data in memory (RAM) b/c that's relatively fast. For example, updating the offset field inside struct file. Spinlocks are exclusive, only one lock-owner. And one may not block on I/O (i.e., WAIT state) inside a spinlock, b/c the lock is spinning in an infinite loop. Spinlocks are fast and can be implemented with just a few assembly instructions. 2. mutex A mutually exclusive lock. A mutex is also exclusive (only one may enter the CS). A mutex allows one to block (or take "a long time") relatively speaking inside the CS. But, anyone waiting on the mutex, is placed into a WAIT state of the processor. This means that those who wait for the mutex are no long running, were moved from RUNNING to WAIT state, and have to wait perhaps a relatively long time before they can be woken up. If your code tries to grab a mutex and loses the lock, then this code is now WAITING. So when will this blocked thread wake up? A: when the lock owner has exited their CS and issued an unlock on the mutex. All those who wait on a mutex are WAITing in a queue associated with the mutex. Upon unlock(mutex), at least one of them will be woken up and allowed to enter their CS. The implementation of mutex un/lock code is much longer and more complex than a spinlock. Thus, you use a mutex when the CS is "long enough" or when you expect to block on something else that's slow (I/O). A mutex is "heavier" then a spinlock because it has to manage all those that are blocked, have a wait queue, manage waking blocked threads up, etc. (e.g., don't use a mutex on a short CS where, say, you just update one integer).