* Packet processing inside the OS Recall that the NIC will interrupt the CPU, give it some new packet to process. And this packet will be placed into an SKB. The SKB will be added to a queue of "just received packets" to process, the queue is at the lowest layers of the networking (e.g., device driver). How do we get the packet processed further? Eventually, this packet is part of a larger payload, that a user process is waiting for -- e.g., in a listen(2) or read(2) syscall on a socket. Past systems had the networking ihandler try to process the packet as much as possible, but this made the ihandler run longer (bad: b/c interrupts are blocked). Linux decided to design a whole system of async packet processing, using lots of queues. The design involves creating a bunch of "software based" interrupt handlers, or SOFT-IRQs. * Software IRQs: 1. one or more queues, each queue processes packets (or other info) at a certain level or certain type: SOFTIRQ_NET_TX: queue for processing network packets being transmitted SOFTIRQ_NET_RX: queue for processing network packets being received there are lots of other queues, for disks/storage, etc. Each SOFTIRQ_* is a bit in a bitmap of all possible SOFTIRQs. 2. When Linux starts, it initializes multiple softirqd kernel threads (by default, one per CPU core). These are the consumers who process softirq queued work. These kthreads are sleeping until there's work to be done. 3. When anyone adds work to any queue controlled by a SOFTIRQ, they have to inform the softirq system that new work has been added. This is a trigger, to ensure that at least on softirqd kthread is woken up, to process work. 4. When the kthreads are done with their work, they can check to see if there's more work to be done in any other queue. If so, they can pick up more work; else, they go to sleep. * Network softirq processing: At the lowest layer (device drivers), a softirqd kthread wakes up, finds that the bitmap of softirqs has SOFTIRQ_NET_RX on, so it knows that some incoming packet needs to be processed. The kthread finds out the queue with pending packets to be processed. It picks up one "object" (e.g., SKB with packet fragment), and processes it. At the lowest net layers, packet processing is often related to Ethernet frames: have to verify the frame, strip headers/trailers, maybe do some checksum to verify packet integrity, etc. Another example would be to find two related ethernet frames (SKBs) and merge them into a larger payload. One this kthread has finished processing the "ethernet" SKB, it adds it to another softirq queue for further processing. For example, as packets go up, after Ethernet, the next processing layer is at IP. Note: the same kthread that was the consumer of the device-driver level SKB (ethernet), is also acting as a producer, adding work to the next logical layer up the chain. The softirq processing doesn't try to do too much work: just what's needed at this layer, then add a new job to be processed at the next layer. At the next layer, e.g., IP: when another softirqd kthread wakes up, it'll pick the job that was added before, and do IP-level processing. At IP, processing includes packet filtering (firewalling, network-address-translation, etc.). This may include merging SKBs, or even modifying the content of a packet (e.g., filtering, redirection, NAT). Again, once the kthread processes the job at this (IP) layer, it'll produce a new job to the next layer up (UDP, TCP, etc). Again, softirqd will wake up a kthread to process this new packet, and this processing continues. Suppose a user process was waiting all along in a read(2) on a socket, waiting for this data to arrive. This means that the user process would be sleeping (in WAIT state) all this time. Q: given a packet, what is the piece of info we need to identify the process (PID) that was waiting for this packet? A: Port number. Each packet will have a port assigned to it. And each process was sending requests from a given port. Eventually, the packet fragments go up the layers of networking, and read the top-most layer (Socket API + VFS). Now we have assembled several fragments into a data payload that needs to be given to a user process. The topmost softirqd processing will find the port number from the packet, then find out which process was waiting on that port, and finally it will copy the skb data into the __user buf in sys_read(). Then the softirq can also move that process from WAIT state to READY state, so the scheduler will run this process eventually. * How best to process packets being received Recall that if the NIC was too busy, it won't take packets (this throttles the senders). What if the OS itself is busy: that is, the lowest queue in the OS (e.g., dev. drv) is full. Every queue needs to have a limit on its size/depth, else you risk running out of mem in the kernel! If the OS is busy processing other packets, then interrupts are disabled temporarily, hence the OS can't take the packet b/c the NIC's interrupt will be ignored. The NIC will have to try again later. The OS can choose to ignore interrupts (or certain PCI devices) for a period of time, just so the OS can drain its own queues. This is how the OS can tell the NIC to throttle back. Each softirq queue, at any layer, has its own limit on size. So at any point in time, each queue of packets to process could be full. Let's say we have this arrangement: 1. User process (waiting for packet) -- consider this level 0 (L0) 2. kernel queue at level 1 (e.g. VFS/Socket API) 3. kqueue level 2 (next level down) 4. kqueue level 3 (next level down) 5. ..... 10. lowest level queue (L10) 11. NIC (consider this L11) Consider this just a series of queues that each have a limit. It means that processing at any layer can be blocked because a layer above is full. If we're receiving packets, then processing goes up from L11..L0. So if L4 is full, then L5 cannot "submit" or add more work to L4. That means that the processing at L5, now acting as a producer of work to L4, should be throttled or stopped -- until L4 is free again. So, if L4 is full, and L5 can't add more work to it, then the queue at L5 becomes more and more full. If L5 fills up, then eventually L6, then L7, ..., then the NIC will fill up and throttle, then senders on the Internet. Lesson: if some layer N is full, it'll result in work at layer N+1 below to fill up and throttle, and this will propagate further down the layers. Until eventually all queues are throttled and blocked. Given this, for SOFTIRQ_NET_RX (packets being received), what is the most optimal strategy for processing work when there's jobs at multiple network related layers? Best to start to process from the top-most queues of the system. Once you drain work from those queues, then work at the queues just below can proceed, etc. all the way down the chain. Ideally start with user processes waiting on data, b/c once you copy_to_user that data, you can discard those jobs in the kernel (free up the SKBs), and move on to process the next layer down. * Processing packets being transmitted (SOFTIRQ_NET_TX) Here, the process is reversed: 1. User processes produce work, which adds to softirq queues at the top layer of the OS. 2. Top layers do async processing, then add work to the next layer down, etc. 3. Eventually, queued data reaches the device driver, and you try to give the work to the NIC. 4. NIC takes work and processes it. If the NIC can't process the packet (transmit it over the wire), then the NIC will have to wait longer, can't free up room. The NIC will then signal the OS to throttle back (XON/XOFF). This in turn may result in the lowest level softirq queue in the OS to fill up, which could then make the next level up fill up, etc. So for transmission, the sequence of throttling is from bottom to top layers -- eventually user processes trying to send(2) or write(2) data will be blocked (WAIT state) so they don't produce more work, and give a chance to the OS + NIC to drain its own queues. For transmission: its best to process packets/work from the BOTTOM layers first, so that upper layers can be freed up, and eventually user processes are unblocked (so they can send more data). * Linux c. 1999 Initially, Linux started with a simple locking policy. Use one Big Kernel Lock (BKL) for all data structures. Pros: - simple, one lock controls everything Cons: - performance (esp. when you have multiple CPUs/cores protecting entirely different data structures) With one lock, there's less of a chance of deadlocks, but deadlocks still possible: 1. self deadlock (same thread locked BKL twice) 2. async processing happening with NICs, storage, etc. WSJ article reported a 3rd party testing of Web servers running on same h/w, but Linux vs. Windows NT, on 3 platforms: 1. single CPU 2. dual CPU 3. quad CPU orig study showed that NT performed better than Linux on all 3 platforms. After optimizing the linux kernel, the 3rd party reran their study, and found that: 1. single CPU: Linux was better than Windows NT 2. dual CPU: Linux was same performance as Windows NT 3. quad CPU: Windows NT still performed better At the time, Linux was still using the BKL for everything, including networking. Most other OSs also used a single "big" lock: no reason to have more locks b/c single-CPU systems were dominant. There were no multi-core systems back then. Windows NT used to have a single kernel lock, but they just released a major patch (service pack 3, SP3). In SP3, Windows replaced the big kernel lock with per-CPU and per-NIC lock for networking. As a result, linux started a new line of kernels (2.4), rewrote all the networking stack from scratch, and started to abandon the BKL in favor of many more locks of different types (b/c different lock types are more optimal for different uses). Generally, you start with 1. take a BKL and break it up 2. a lock per subsystem: file systems, networking, mem mgmt, processes 3. then break those locks into smaller ones and propagate them downwards - a lock per file system, a lock per NIC, etc. 4. then started to add locks per data structure (inode, dentry, etc.) 5. added locks for sub-fields of structures. Overall, you continuously break bigger locks into smaller ones: Pros: - you can get much better performance and throughput, esp. on systems with lots of CPUs Cons: - kernel programming much harder - chances of races and deadlocks grow a lot