* A generalized producer-consumer asynchronous queuing system A generalization of rwsem. Producer: a entity who asks for some work to be done. Consumer: another entity who actually does the work, some time later. Often useful when the effort to "produce" (ask) for work is much smaller than the effort to "consume" (actually do) the work. Why? If the producer has to wait a long time for the actual work, then they just sit WAITing, doing nothing else useful; instead, the producer can "submit" the job to someone else to do, and ask to be told/informed when the work is done. That way, the producer can go and do other things while they wait for the consumers to perform the work itself. Examples: kernel logs, I/O subsystems (e.g., getting a block from disk). If you issue a read(2) syscall, you are blocked from doing anything else. Sometimes we say that 'reads' are synchronous, b/c you have to wait for them. But if you call aio_read(2) that call returns immediately even if the data is not available yet. You have to pass a callback fxn to aio_read, and the kernel will invoke your callback when the data actually arrived (which could be many milliseconds and even seconds later). A producer-consumer async system has: 1. one or more producers (e.g., users or applications submitting work to be done). 2. One of more consumer threads that process the submitted jobs in some order. 3. A queue of jobs submitted, in some order. You can use one FCFS queue, or have multiple queues, with priorities. 4. A limit to the size of the queue, so it won't grow unbounded. 5. Internal locks (spinlocks? mutexes? others) that protect internal data structures. Normal operation: 1. Users submit a job, which gets queued for later processing, then users (producers) return immediately, and they can go do something else. If producers want to be informed when the work is done, they have to provide a callback function (and possibly a "void*arg" for the consumer to place data into). Note: the telltale sign of an async processing system, is when the function you call has some sort of a callback fxn and possibly a void*. 2. Consumers pick up jobs, one at a time, and process them. Recall processing the jobs takes longer than it took to produce them. When consumers are done, then return results back to the producers. Q: Since producing work is faster that consuming it, how do we prevent too many producers adding too much work to the queue? A rogue producer could cause the kernel to run out of memory, or fill up the CPU with useless cycles (a form of a denial-of-service attack). A: We have to limit the size of the queue. Jobs can be added to a queue up to some MAX, and we can return immediately to the producer. But if we reach the MAX queue length, then we have to block new producers from adding more work to the queue. Same as with rwsem code: we want to put them into a WAIT queue (producers waiting to add work to the queue itself). That's better than returning some error or "try again later" b/c users will try in an inf. loop. If there's a WAIT state for producers, then we need to record the list of waiters in a wait_q (just like in the rwsem code). Note that even this wait_q may have a limit, and at some point, you may have to take a more drastic action (such as returning ENOMEM). In a producer-consumer system, blocking new producers from doing any work, and actually putting them to sleep (WAIT state), is called "throttling the heavy writers." Here, the "writers" are those who want to add work to the queue. Clearly we need to keep a count of how many are in the queue waiting for a free consumer. That is called the "queue depth". There are some consumer (workers) that run. When a consumer runs, it picks up an entry from the queue and processes it; when done, it returns results to the producer (user) who submitted the job. C1 P1 C2 P2 C3 P3 . -> Queue[J1, J2, J3, ..., Ji] -> P... . . Cn Pm States of the system: 1. queue depth is: - 0 (empty, no jobs) - MAX (full, and there may be WAITing blocked producers) - in between 1 and MAX-1 (there's some work happening, and some room in the queue) 2. Some number of producers blocked waiting 3. Some number of consumer (workers): some running, and some waiting Consumers (the "worker bees"): When queue depth is 0, consumers should be sleeping! That means, we need a wait_q for consumers. Who will wake up a consumer? A producer will wake up at least one consumer. If there's more work to be done, producers should wake up even more (or even all) consumers. When a consumers are done with their work: they can check the queue. If there's work in the queue, they can pick up more work (next queued job), and process it. Producers/consumers will have to manage the queue and the queue depth counter (under some lock), have to increment/decrement it as needed. When A consumer (worker) is done with their work, and they check the queue, and there's no more work to be done, then they should put themselves to sleep (WAIT state). If a consumer (worker) finished processing a job, and the queue depth has shrunk to below MAX, and there are waiters in the producer (user) wait_q, then it can wake up at least one producer; that producer can then run, and will be able to add their work to the queue itself, then return. Q: How much work is too much work for a consumer? Wanted to know when should we wake up more than 1 consumer for a single job? A: Here, the design is a "single job for a single producer/consumer". So you'd have to have designed your work to be broken into many smaller tasks that can be performed by multiple workers (this is the cloud micro-services and function-as-a-service, or FaaS, or "serverless" computing future). * General state of a queuing system T(P): the time it takes to produce a job T(C): the time it takes to consume (process or work on) a job - A P-C system is more efficient when T(P)<> R(C)? Steady state: queue is full to the MAX, everyone else (vast majority) are blocked. Result: an inefficient system with little throughput. 2. What happens if R(P) << R(C)? Steady state: queue depth will be 0 on average, little to no work to be done, system and its workers/consumers are sitting idle. Result: an inefficient system wasting resources. 3. What happens if R(P) = R(C) + e? Where (e)psilon is some small number. In other words, R(P) is just slightly faster than R(C). Steady state: ALSO the queue is full to the MAX (only that it might take a bit longer depending on the value of MAX), everyone else (vast majority) are blocked. Result: ALSO an inefficient system with little throughput. 4. What happens if R(P) + e = R(C)? Or, R(C) is slightly faster than R(P)? Steady state: ALSO, eventually, queue depth will be 0 on average, little to no work to be done, system and its workers/consumers are sitting idle. Result: an inefficient system wasting resources. In sum, it's difficult to get a perfect producer-consumer system that is always busy doing something useful. A lot of work has been done and is ongoing in distributed systems and resource management (e.g., in clouds) to get a "balanced" queue where the rate of producers is as close as possible to the rate of consumers. In practice, systems/users try to dynamically update the MAX value, spawn new instances of virtual machines with more workers, spin them down when not needed, etc. Question: what is the "right number" of consumers (workers) to have? Recall that some may be sleeping, but what is the right maximum no. of consumers? Assume we're on a system with one CPU that has 6 cores. The max no. of consumers is often set to the no. of cores in a system. But it's worth doing a study to see whether fewer or more workers will generate more throughput. Sometimes the work involves interleaving of CPU and I/O, so not all cores are busy all the time, and you can afford to run more workers than cores. Sometimes, your OS scheduler is busy with other tasks that even if you have a worker per core, that's too much. If you have too many workers, you wind up wasting more time on scheduling and context switching, than getting useful work done. Yes, there's a lot of work/research on dynamically de/spawning new workers on a single system. In a linux system, every process in [square bracket] is actually a kernel thread (i.e., doesn't really have a "user" context). There's a lot of them (mostly sleeping) these days on any OS such as linux, b/c it makes the kernel run more efficiently. * RCU (Read-Copy-Update) An RCU is a type of lock that permits you to have an arbitrarily long critical section (CS). Normally, when you're in a CS, others can't get in, or the system throughput is lowered -- so the goal is often for lock-owners to keep their CS as small/short as possible. Let's assume that the RCU protects a data structure that's just a list of sorted items/strings: LIST = [A, C, D], also includes a version number counting how many modifications the LIST had. Let's say V=1 1. You read and make a copy of the protected d-s under a quick lock, then release the lock 2. now the owner has a copy of the data that they can work on all they want, for however long they want to. The owner of the copy can modify the d-s too. 3. Optional: if the owner wants to sync the "master" or original copy of the data with their modified list, then they have to grab the lock again, this time "merging" changes into the master list, and finally release. User A: 1. rcu_lock() usually that's a fast EXCLUSIVE spinlock 2. make a memory copy of LIST+V, into LIST' and V' (this is usually a quick action) 3. rcu_unlock() 4. Now user A can inspect and manipulate LIST' all it wants, b/c it has a private copy of the data. LIST' = [A, C, D]. 4a. User A may not care if the original LIST is changing by others b/c there are some cases where just having a "snapshot" of the data at some point in time is good enough. 4b. User A may not even need to modify the LIST' (using it for readonly purposes). In cases 4a and 4b, when user A is done with LIST', it can just release/free it. You don't even need to go into the (more complex) "update" phase. 4c. User A has modified LIST', but now it wants to sync or flush those changes to the master/orig LIST. Let's say LIST' deleted C, and added B and E: LIST' = [A, B, D, E]; V'=1 So now, user A has to go into the "update" phase: 5. rcu_lock() 6. update LIST from LIST': this is the complex phase. User A has to inspect the CURRENT state of LIST+V vs. what LIST' + V' are: 6a: let's say that LIST = [A, C, D]; V=1 LIST' = [A, B, D, E]; V'=1 Note that V=V': fastest thing would be to - delete/free LIST - rename LIST' to LIST - increment V = 2 This would have been the simplest case 6b. let's say that LIST = [C, D, G, F]; V=4 LIST' = [A, B, D, E]; V'=1 V>V': meaning that *others* have managed to make a copy of the orig LIST, and make changes to it, AND merge (update) those changes back to LIST! So now, user A has a harder job: they have to figure out how to merge the lists and what has changed, for example, that item 'A' was removed, and G+F were added, so user A has to produce a new LIST that looks like LIST = [B, D, E G, F]; V=5 The work in 6b is much much more than in 6a. 7. rcu_unlock() Overall, RCUs are interesting locks: - for users who don't need the "update" phase, they're very fast (spinlock only to make a copy); then you can "sit on" your copy however long you want. - for users who do need the "update" phase: there's an incentive not to "sit on" modified changes w/o merging them quickly; else you may find yourself having to do a complex 3-way-merge.