* VFS structures cont. When an object's RC reaches 0, the objects is NOT necessarily destroyed and freed from memory. It can remain there until memory is actually needed: then you can free the object. There's another good reason why an object with RC=0 doesn't always get freed up immediately. A: maybe you need the object again? Example: D -> I (a valid name D pointing to a valid inode I) Suppose, someone deletes the file, and we remove I from disk, then D's RC=0, and we remove D. Next, someone tries to lookup the file whose name was in D? What do we have to do? We have to go all the way down to the disk, search for structures, only to find out that the file does not exist -- then we can return an ENOENT. Recall from dedup, we use bloom filters to quickly find out that something does NOT exist. Otherwise, proving a negative is very expensive (slow, lots of I/O). Instead, what kernels do, when a file is deleted, they KEEP the dentry with RC=0 in the dcache. And only purge it when absolutely needed. This has two benefits. 1. If we find a dentry w/o an inode (i.e., dentry with RC=0 but dentry->"NULL inode"), for example dentry->d_inode==NULL, then we KNOW that the on-disk file does NOT exist. And we can immediately return ENOENT w/o spending any I/O cycles. Recall in typical caches, if you find an item X, it means the item exists in the backend from where you cached. But you can also cache "negative entries": special entries whose existence in the cache indicates that the backend item does NOT exist. This kind of dentry is called a "negative dentry": one where dentry->d_inode == NULL and RC=0. 2. A negative dentry could become a positive dentry in the future. That is, its RC will go from 0 to 1. That can happen if the previously non-existent (or deleted file), is now re-created. - Start with dentry with RC=0 - user creates a new file w/ same name, so a new inode I is created - we link dentry->d_inode == newly created inode - and increase dentry RC=0 -> RC=1 This allows us to reuse/recycle a dentry that's already in memory, and relink it with a new inode. This turns a "negative dentry" into a "positive dentry". (This is a form of "undoing" the deletion.) * Linux VFS source files In HW2, each of you will clone a linux 5.15 kernel tree. Study it as needed (it's big, so focus on the relevant parts). Documentation (plain text files) folder Documentation/ : various docs on the kernel structures and functions Documentation/filesystems/ : various docs on the VFS and specific file systems Documentation/filesystems/vfs.rst: plaintext file about the VFS structures. The main VFS header file (always start with headers before C/C++ files): Main VFS header files: include/linux/fs.h and include/linux/dcache.h struct inode: In major data structures, each field starts with a prefix. In struct inode, it is "i_". This way, if you see some piece of code accessing ptr->i_xxx, you know that "ptr" is of type "struct inode *". starts with i_mode, i_uid, i_gid (all part of the stat(2) info that is returned to users upon stat(2) of a file). Why: b/c permission checking often needs all 3. Similarly, the timestamps are close to each other: i_atime, i_mtime, i_ctime. Why: b/c we often lookup and update timestamps together. The reason to locate fields that are likely to be accessed together, is for CPU cache-line alignment! OS developers are very concerned about performance, and want to get as much perf out of the CPU, to minimize memory-bound operations. inode->i_ino: the inode number inode->i_count: the reference counter for this object. Uses an atomic_t, from . atomic.h defines convenient methods for operating atomically on counters: increment, decrement, add/sub, read, reset to 0, init, destroy. For refcounts, we primarily use the atomic_inc() and atomic_dec() functions (and atomic_read). If you see what looks like an atomic_t xxx_count, it should give you a hint that this object is managed using reference counters. struct inode_operations *i_op; // this is the ptr to the vector if ops that can operate on "this" inode.o struct inode_operation: contains definitions or prototypes for the functions that operate on an inode: Note: it is convention in linux kernel that when you refer to "->xxx" (arrow first), you're often implying a field or method called "xxx". ->create: the create method for inodes, the fields are: int (*create) ( struct user_namespace *, // ignore: this is for advanced "namespaces" struct inode *, // the parent inode of the dir, where we want to // create the new file struct dentry *, // the name of the file to create umode_t, // this is the "mode_t mode" from userland bool); Recall creat(2) syscall has this prototype: int creat(const char *pathname, mode_t mode); So the VFS will translate a creat(2) syscall into a vfs method inode->i_creat() and pass it to the actual file system (e.g., ext4) to create the inode in question. The file name to create is stored in the "dentry" passed to the ->create() method. That dentry we pass starts as negative: i.e, the inode inside the dentry doesn't exist yet. If you return successfully from ->create(), then the method returns (else a negative -ERRNO, such as -EPERM). And the dentry that was passed, now has a valid inode: the dentry switched from negative to positive. A lot of inode methods, take a directory inode to operate in, and a dentry with a name to perform the action. int (*unlink) (struct inode *,struct dentry *); // takes a dir inode, and a positive dentry, and if successful, removes the dentry->d_inode object from disk, turning the positive dentry to a negative one. int (*readlink) (struct dentry *, char __user *,int); // used to read the value of a symlink. Takes a positive dentry, whose D->I represents a symlink object. And returns a string representing the content of the symlink. Recall that readlink just returns the string, but does not try to interpret it as a pathname. The string is returned into a "char *" buffer, just a character array, null terminated. There's a special subtype designation "__user", to indicate that the char* pointer is NOT in the kernel physical address space but rather in this process's USER virtual address space. When the kernel has to copy the bytes of the symlink content into the user buf, the kernel has to convert that user addr from virt to physical internally. Key: don't ever try to dereference/access a __user ptr in the kernel! You'd be treating a virt addr as if it were a phys addr, and likely corrupt your kernel memory or crash it.