* Object management May objects get created at different times, many of them are cached info that came from slow I/O devices. For example, the dentry and inode are mainly caching info from I/O devices. Also, file structs get created when you open, and cleaned up when you close. So when can we reclaim the memory of an object? How do we know if it's in use or not? Assume: X -> Y Q: Can I release the memory of object Y? A: No, because that will cause a dangling pointer! Q: Can I release the memory of object X? A: Yes, b/c no one else depends on it. Q: After releasing X, can I then release Y? A: Yes, b/c the pointer b/t X and Y is no longer valid after X is released. So, how do we manage this with lots of objects pointing to lots of other objects, possibly a many-to-one relationship? A: the kernel uses reference counters (RC). An RC is a count in each object of "how many other objects point to me?" We'll use the syntax X[RC] to denote that object 'X' has a reference count value of 'RC'. So above example will be: X[0] -> Y[1] Rule: an object with RC=0 is not used by anyone else, and can be freed. If an object X points to Y, then Y has to have RC at least 1. When X is released, you have to decrement Y's RC by 1 when you "break the link": 1. X[0] -> Y[1] 2. release X: - Y->RC-- so now Y[1] becomes Y[0] - delete the link b/t X and Y (e.g., make C ptr NULL) 3. Now we just have: Y[0] 4. Next we can release Y Similarly, any time a NEW object X is adding a pointer to another object Y, you have to increase the RC of Y: Y->RC++. Examples: task->F[1]->D[1]->I[1] - when you close(fd) in userland, F[1] becomes F[0], and you can reclaim the mem for that struct file. - once you reclaim F, then D->RC-- becomes 0, and it can be reclaimed too - next, the I->RC-- is also 0, and it can be reclaimed Next: hard linked file looked up by multiple names D2[1] v D1[1]->I[2] Next: same/diff process opening the same file name more than once F1[1] v F2[1]->D[2]->I[1] Next: sometimes you see the same struct file with RC>1, e.g., RC=2, RC=3, etc. What could cause this? dup(2) and dup(2) do exactly that. That'll be seen as multiple FDs all pointing to the same struct file (thus sharing the same open-mode, read/write offset, etc.) There's a convention in linux: 1. The code that makes a link to another object has to be the one to increment the target's object RC by 1. 2. The code that removes a link to another object has to be the one to decrement the target's object RC by 1. 3. There are utility functions that do that, usually a pair of get/put dget: increments a dentry's RC dput: decrements a dentry's RC iget/iput: same, for inodes sbget/sget/sbput/sput: same for superblock objects (TBD) etc. Some exceptions: filp_open/filp_close, but there's also fput() Rule: generally if you see a "get/put" functions, it's usually an RC management pair. It also means that for every "get" there has to be a "put" somewhere, else you "leak a reference count", meaning that object can never be reclaimed. Q: When do objects get actually freed? A: Just b/c an objects' RC=0 now, doesn't mean it'll be freed right away. - why: defer the actual freeing of object mem till later, an async kthread can do that. - also, sometimes a freed object may be re-used, and in that case, it's easier to just inc its RC from 0 to 1. Faster than trying to alloc a whole new object (e.g., if an inode we used before now gets reused) In linux, there are many caches (TBD), one being the dcache. Because dentries also point to inodes, then the linux dcache is also an inode-cache (icache). Note don't confused inode cache (icache) from another icache (CPU Instruction cache). Recall a dentry points to its inode: D->I. Sometimes we see a dentry pointing to a NULL inode: D->NULL, and stored in the cache. Under which conditions, this can happen? Suppose you lookup/stat an entry in a directory, and that entry does NOT exist. Usually you get back an error ENOENT. Suppose I have a directory with N entries in it, by default unsorted: what is the complexity of finding an entry that DOES exist in that directory? O(n), but in practice N/2. But what is the average time it takes to find out that an entry in the directory does NOT exist? Still O(n), but in practice you have to lookup ALL N entries. Rule: proving a negative is much harder than proving a positive. Meaning, you spent a lot of effort and I/O to discover that an entry is NOT in a directory. If another user tries to lookup the same non-existent entry, they'd have to spend again a lot of effort and I/O. How do we speed up the ability to tell that entries do NOT exist in the underlying file system? A: we cache these entries, and we call it a "negative cache" entry. A dentry with a null Inode, D->NULL, is called a "negative dentry". Such a dentry also gets cached in the dcache. So when someone looks up that name, we find it in the cache, and if the D->inode ptr is NULL, we can immediately return ENOENT w/o having to issue any I/O. Negative dentries can also happen when a file is deleted: D->I turns into D->NULL (meaning the positive dentry is turned into a negative one). * VFS headers include/linux/fs.h: main VFS structures (long header) include/linux/dcache.h: main dcache structures struct inode: - Most major structs will have their fields prefixed by a letter or two, to identify what it is. ptr->i_XXX -- the "i_" tells you that ptr is a struct inode pointer Reference counter for inode: atomic_t i_count; A bunch of fields at start of inode struct, closer together: umode_t i_mode; // stat(2) permission mode + type unsigned short i_opflags; kuid_t i_uid; // UID kgid_t i_gid; // GID unsigned int i_flags; // flags Why are the mode+uid+gid close together? Note that inode struct is somewhat large. A: locality. All three are needed to check if a user has permission to access an inode. It helps if all three are going to fit inside a single CPU cache line! This makes sure that common actions don't result in too many CPU cache flushes and reloads from RAM. For same reason, other sets of fields in inode struct are together, for example the 3 stat(2) timestamps: struct timespec64 i_atime; struct timespec64 i_mtime; struct timespec64 i_ctime; A lot of other popular data structures that may be accessed many times per second, and may have 1000s of instances in kernel memory, are optimized for CPU cache line locality. In addition, placing the most popular fields first in a struct, ensures that as soon as the struct is loaded, those fields are going to be available. The VFS data structures and code are written in C, but there's a lot of OO principles: 1. Some fields are read-only (aka "public"), and anyone can read them any time. 2. Some fields require a lock of sorts before you can access them. There are multiple locks inside the inode struct, and you have to "grab" the right lock before updating certain fields. Comments usually tell you which lock to grab before modifying which fields: spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ Some fields are updated with their own lock, for example any atomic_t (i.e., refcount). There are generic methods in atomic.h for manipulating atomic counters. An atomic counter is a struct with an "int" counter an a lock. Ops: atomic_inc, atomic_dec, zero, add, subtract, compare, etc. A lot of data structures have a generic void* at or near the end. Inode: void *i_private; /* fs or device private pointer */ This allows any user of an inode, who needs to store extra info that is NOT already in the existing inode fields, so store any arbitrary info. A void* allows you to point to any struct you want. The user of this void* is the one that has to alloc/free whatever is in that field (the VFS will preserve this field but won't touch it). This is the principle of "extensibility": the ability to extend functionality at runtime w/o having to recompile the code! This is a bit like OO inheritance of classes and building on top of existing classes. Inode also has a set of operations that can be invoked on it: const struct inode_operations *i_op; A vector of all the methods that can be applied on THIS inode. A file struct has a ptr to its inode: struct inode *f_inode; /* cached value */ This is a cached value, b/c going from F->D->I may cause a cpu cache flushing, if D ptr is in a different mem region. Also, when accessing an open file, you don't need the name (D) any more, but you do need the inode to get to the file size, bytes to read/write, etc. So this cached value, while increasing the size of struct file, has been deemed worthy of the extra size, to speed file-to-inode mappings. B/c this is a cached F->I ptr, we do NOT increment the I->RC by one.