* Memory Management Memory is a precious resource that an OS needs to manage carefully (because it is PHYSICAL memory, not virtual). An OS should never run out of memory and try its best not to return ENOMEM to anyone (in kernel or user). OSs will block applications and even kthreads that try to alloc more memory, if there's not enough free memory. The OS will try and throttle those who try to alloc memory, to ensure that there's enough free memory to give to them. Ironically: in order to free existing memory, the OS may sometimes need to *allocate* a small amount of memory, temporarily, just for management purposes. The OSs will pause/throttle anyone who tries to consume new memory, and go off and try and "clean" the memory to make new room. There are many types of caches inside any OS: page cache, buffer cache, dentry/inode caches, caches by specific file systems and other modules, caches in networking at different layers (e.g., for packets, and more). The cleaning strategy is often to start with the "biggest bang for the buck" caches: 1. Start by going over the largest caches (those who manage the most amount of memory). For example, start with the page cache, b/c it caches pages that are 4KB large each (compared to, say a cached dentry that's much small). 2. Within various caches, there is data that's "clean" vs. "dirty": clean data is an exact copy of the data from the backing store (the source of the cache). "Dirty" data is data that's been modified in the (RAM) cache but was not flushed yet to backing store (e.g., disk). Start with the clean data, b/c it can simply be removed from the cache right away, which is fast. In the worst case, if you remove a cached entry that's needed later, than a cache miss will result, which'd start a (slower) I/O process to bring the data back from disk to cache. You can remove clean pages in the page cache, inodes/dentries that aren't modified, etc. (that's why it's useful to have all these list_head's inside struct inode/dentry, so you can quickly find out which objects are clean vs. dirty). After removing as much clean data as you can, you start to look at dirty data. But one cannot remove dirty data from a cache until it has flushed to backing store persistently. This means it takes long to clean up dirty memory (have to wait for slower I/O operations). Cleaning happens when: 1. Any user process or kernel thread asks for more memory, but there isn't enough. The caller is put to WAIT until cleaning has freed up enough memory. 2. Periodically, there are kernel threads that wake up and check the status of caches; if needed, these kthreads also call a cleaning routine. * Linux memory cleaning (old style, back in 2.6 kernels) Some historic OSs (e.g., Sun Solaris, older BSDs) used to keep a certain percentage of memory "free" or in-reserve, just for emergencies and for having enough reserve memory, to do some cleaning itself. For example, Solaris tried not to use up more than 80% of phys mem: that way, they still have some ~20% room to maneuver as needed. This 80/20 rule was configurable, you could raise/lower it, but still users complained about "wasting" some memory (keeping it in reserve). These older OSs also had a default (but configurable) policy, that they flush all dirty objects older than 5 seconds. This was to minimize the risk of lost data. More frequent flushing consumes too much I/O; less frequent flushing risks losing too much data in case of a kernel crash, power loss, or h/w failure. Linux chose a different policy: 1. Linux would use as much of the memory as possible, up to almost 100%! That would mean that there's less room for emergencies or other needed maintenance. 2. Linux decided to flush data every 30 seconds, but meta-data every 5 seconds (both are configurable). That's a recognition that m-d is more important than data, plus there's a lot less m-d than data. Flushing dirty data in Linux: 1. Have two thresholds: high and low-thresholds, designated as percentage of memory that is "dirty". An example is H=70% and L=30% (configurable). 2. Wakeup periodically (kernel threads called bdlush or pdflush) and check current percentage (CP) of dirty data in the caches (esp. page cache b/c it's big). This CP thread gets woken up every N seconds (e.g., N=5) and also it records the status it was in before in the last several invocations. We have a race here b/t the producers (user processes whose actions result in more dirty data in caches) and the consumer(s) the cleaning threads that wake up periodically. That's why we often want to stop the producers from generatingu more dirty data, but putting them to sleep (read: "throttling the heavy writers") If CP < L: - there's no emergency - maybe just apply normal 5s/30s flushing of some objects, asynchronously (meaning, you submit them for flushing, then cleaning kthread goes back to sleep) If L <= CP < H: - a more urgent situation, there's more dirty data now, so have to do more cleaning - Pick a fixed number (e.g., 32) dirty pages and flush them asynchronously. If CP >= H: - much more urgent: too much dirty data in caches, have to reduce this percentage to below H, to reduce the risk of losing data in case of system failure. - Pick a fixed number (e.g., 32) dirty pages and flush them SYNCHRONOUSLY. - this means that you stop any other processes from generating new dirty data, and you submit the flush requests to the underlying software layers, but you WAIT until the lower layers confirm the commitment. For example, file systems often implement struct page-based operations from "struct address_space_operations". There are methods such as ->writepage and ->writepages. The page cache would call a file system with these methods, asking to flush one or more pages. ->writepage(s) includes "struct page" arg, but also "struct writeback_control". The writeback_control structure includes info and flags that instruct the called f/s what to do: one flag says "flush async", another says "flush sync". ->writepage(s) can be called when users do an fflush calls, or call sync(2), etc. But these ->writepage(2) can also be called from the page cache as part of a mem cleaning procedure. If a f/s is asked to flush a page/file synchronously, it has to do so (or try very very hard), because when it returns to the caller (page cache code), the page cache will REMOVE the page from memory, to make room. There's an "extra" emergency policy that Linux (and other OSs) perform: if you wake up repeatedly and still find that mem consumption is above H, or getting dangerously close to 100%, then the kernel invokes the Out Of Memory (OOM) "killer". OOMK simply picks the user process whose memory footprint is the largest, and kills it! OOMK may have to kill several processes in this emergency, just to bring the system's mem use under control. OOMK can be turned off on a per process basis. Useful, if, say, you have an important server running (e.g., Web, database). Still, if you turn off OOMK on too many processes, you'll risk the kernel itself crashing, which'd be much worse (i.e., now ALL processes are dead). OOM doesn't happen often: more likely on systems with little physical RAM and too many "big" processes running, and perhaps no/little swap space configured. * Modern Linux flushing More complex than what it was in 2.6. A new system called Backing Device Information (BDI) was created, to handle several situations: 1. Flushing a file on old hard disk vs. newer/faster SSD, had a 10-100x different latency. So the BDI system now accounts for the "rate" of flushing by device (e.g., try to flush to faster devices first). 2. The H/L thresholds were global (for entire OS). Users wanted to have different thresholds for different devices (multiple storage devices, with different speeds/capabilities). Now you can have thresholds per device. 3. Users wanted thresholds per meta-device (e.g., a RAID5 array spanning 6 disks). Users also wanted a per-file system thresholds and BDI rate controls. BDI is more versatile but some say is to complicated. Yet the principles still hold, where policies escalate depending on the current status of memory: - when there's enough mem free, you can afford to take it easy with flushing - as mem pressure grows, you need to take more urgent actions - if mem pressure is very high, you may need to take an emergency action (like OOMK) * Memory allocators 1. kmalloc: general purpose mem allocator, can allocate data of almost any length. Pros: flexible, contiguous mem. Cons: fragmentation (after many malloc/free calls, you can have "gaps" in PHYSICAL mem that are too small to be useful). Recall inode containers solved the problem of allocating a VFS inode + per-f/s inode in one contiguous allocation. 2. vmalloc: like kmalloc, but allocates virt mem in the kernel. Pro: you can alloc "big" chunks of mem, which'd be mapped to physical mem; no to little fragmentation. Cons: virt mem is backed by phys pages that may be swapped (I/O, blocking); can't use vmalloc'd mem anywhere where you're not allowed to block (like inside a spinlock). 3. Page allocators. You can alloc one or more pages (4KB units). Useful b/c 4KB is a native phys mem unit in the kernel. There are multi-page allocators that are designed to alloc a power-of-two number of pages: you pass an "order" O, and the allocator will return an array of non-contiguous 2^O pages. Pros: can get "enough" phys memory (but it's broken into 4KB pages), and little risk of fragmentation (compared to kmalloc). Cons: you have to use the entire page(s) you get, else mem is wasted; and you may get an array of pages that are non-contiguous. So your code has to break up all your data into 4KB chunks and store each in its own page. 4. Custom allocators. We want to allocate multiple units of a size that's not 4KB (so can't be a whole page), but we want to avoid fragmentation (that kmalloc uses), and we don't want to risk using virt mem. A custom allocator is a wrapper around the page(s) allocator. Let's say you have some struct foo, whose sizeof is 117 bytes long. Design an API called alloc_foo() and free_foo(ptr). Internally, you can allocate a whole 4KB page (or more 4KB pages using the page allocators). How many struct foo's can you fit inside a 4KB page? 4096/117=35 (which is exactly 4095 bytes, leaving only 1 byte unused). Next, we can treat the 4KB page as an array of 35 of those struct foo's, and alloc_foo, can just pick one of them and return a starting addr for a 117-byte chunk inside the 4kb page. Problem: need to track the 'free list' (how do we know which of the 35 are free or in-use?) Hint: use a bitmap! Q: If I have room for 35 "struct foo"'s, how many bits I need? A: 35 bits. But the word size of an int is 32 bits (or long is 64 bits). We had one byte left (1B = 8 bits). Need at least 35 bits, so round up to next sizeof(int). If sizeof(int)==32, then need two integers, 8 bytes, or 64 bits. But since we only have one free byte, have no choice but to reduce the no. of struct foo's that can fit in the page, but at least one struct, to make room (34 instead of 35). 34x117=3978 void *page = alloc_page(); // returns 4KB unit struct foo { // takes up 117B }; struct foo_allocator { int bitmap[2]; // sizeof==8 struct foo pool[34]; // in practice, '34' would be auto-computed char extra[110]; }; struct foo_allocator *fap = (struct foo_allocator *) page; // alloc_foo/free_foo manipulate fap->pool and fap->bitmap Pros: 1. bitmap and pool are co-located in memory close to each other, helps CPU cache-line efficiency. 2. no real fragmentation b/t objects 3. easy to find out a free struct foo: just scan the bitmap looking for a "0" bit. 4. Also very quick to find out which struct in the pool is free: first bit in bitmap corresponds to pool[0], n-th bit corresponds to pool[n]. 5. Very fast to "free" an object: just need to turn the respective bit from 1 to 0 in the bitmap. Note that the addr return by alloc_foo, would be the first byte inside the page (which was mapped to foo_allocator). Cons: 1. may waste a bit of "extra" memory at the end of the allocated page(s) If you run out of room inside one page, then the foo_allocate() function can get another one or more struct page's, and use them the same way (another bitmap and pool of 34 struct foo's). The pages themselves may have to be connected together in some way (e.g., a linked list). Implementations can place a limit on how many total no. of pages are allowed for "struct foo", and whole pages that no longer hold any struct foo can even be returned back to the page allocator. If you alloc a whole page but use only one struct foo, that could be considered a waste. Often, custom allocators ask you to estimate the no. of objects you think you may need. Note: probably need some sort of a lock inside struct foo_allocator.