* dentry In (not fs.h) dentry->d_inode: the inode of the dentry, NULL inode means negative dentry. dentry_operations (TBD) dentry->d_fsdata (a void*): any f/s specific info, for extensibility unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ dentry->d_iname is an array of chars (a string) holding "small names". The macro DNAME_INLINE_LEN is usually 32-40 bytes, depending on architecture. Note: in POSIX, a full pathname (e.g., /a/b/c/d/...) cannot exceed 4096 bytes. A single component (delimited by '/') cannot exceed 256 bytes. However, most people use short names/paths, that are easier to remember. So if dentry needs to store the string name of the path component, we can do it in two ways: 1. store a pointer like "char *name" from the dentry Good: b/c I can malloc and store exactly the length of the name/string I want, not wasting any more memory. Bad: every time I access a dentry and I need the name, I have to cross a pointer to a different mem location, most likely resulting in a CPU cache flush. That leads to a memory-bound performance, for a very popular data structure (at any time, there can be 100s or even 1000s of dentries cached in ram). 2. store the full bytes together with the structure: "char name[256]". Good: b/c the name bytes are right with the structure, reduce CPU cache flushes. Bad: b/c wasting a lot of memory b/c most names are short. Solution by Linux: a hybrid approach 1. embed short names directly into the dentry: dentry->d_iname[DNAME_INLINE_LEN]. 2. store longer names with a variable size pointer struct qstr d_name; struct qstr { // hash stuff const unsigned char *name; }; struct qstr is just a char* pointer, so it can store any variable length name (for longer names). Recall that long names will need to be malloc'd and free'd when stored inside dentry->d_name.name. Q: How do you know, given a dentry, where to look? A: you could consult the value of DNAME_INLINE_LEN macro/const, but that means you have to do another "if-then-else". A: we could use a small flag, stored it in dentry->d_flag (older kernels didn't, but newer ones may do so). Still requires an if-then-else check. A: a better solution is to always use the qstr.name, but point it to the embedded d_iname for short ones. So if we construct a dentry with a short name, we copy the bytes to dentry->d_iname, then we set dentry->d_name.name = & dentry->d_iname[0]; Thus, for short names, the qstr points just a few bytes later in the struct, short distance enough to avoid a CPU cache flush. * Dcache All dentries are organized in a big "index" table for lookups. Recall lookups can be expensive and happen a lot: so we want a very efficient dcache (which also caches inodes). * Linux lists and hash tables Linux designed several basic data structures: struct list_head from : implements simple linked lists, but without a typical pointer to the next element. Typical lists are designed as struct list_element { int x; float y; // whatever members you need struct list_element *next; // ptr to next element } Nut such ->next elements, would mean a different mem location (bad: CPU cache flushes). struct list_head uses a "container" to hold several elements as well as the reference to the "next" (or "prev" element). They also provide accessor methods to create a list, iterate over a list, count #elements, add/remove element from head/tail, etc. The code in list.h is generic C, highly efficient (can be used in userland). Then Linux developed doubly-linked lists on top of the basic list_head. Lists can be unordered or ordered, depending on the needs. Then they developed a hash table (HT) on top of a list_head: a HT is a array of "buckets", each bucket is a list_head (the start of a list). See . Also a very generic, efficient C code that can be reused in userland. Recall that HTs also need to support a "hash function" to find out the bucket for a given element. * Dcache use of lists and HTs struct hlist_bl_node d_hash; /* lookup hash list */ d_hash is a hash table for "lookup" purposes. HTs are more complicated but efficient for "finding" elements. All lookup methods will use d_hash to check "is there a dentry with name X in the dcache"? Each dentry belongs to a specific file system (denoted in dentry->d_sb), and is thus "linked" to all other dentries cached in the same SB. struct dentry *d_parent; /* parent directory */ We store the dentry for the parent directly inside this dentry, so you don't have to look it up. Reason is that we often lookup ".." (the parent name). Another reason is that the kernel may need to ascend one level up for locking and other reasons. Note: for the root dentry, the d_parent points to itself. struct list_head d_lru; /* LRU list */ a linked list of all dentries in this SB, ordered by least-recently used (LRU). Useful when we need to make room in memory, cache eviction. struct list_head d_child; /* child of parent list */ "child of parent list": that is, a sibling list. struct list_head d_subdirs; /* our children */ If this is a directory, what are the dentries for subdirs here. d_child and d_subdirs are useful for many apps/tools that perform ops on all objects in the same dir, or recursively descend into subdirs for some operation: ls: wants to know all siblings ls -R, find: recursive scan of all files/dirs below a certain point same with chmod -R, chown -R, chgrp -R, rm -rf, etc. Note: these lists/hashes can ONLY store (in dcache) items that have been looked up. For example, the d_subdirs will only records any subdir dentries that were looked up once and cache. It will not tell you what are all the on-disk subdirs! * dentry_operations ->d_hash: the VFS offers a default hash fxn (on the dentry's name) to decide which buckets to place it into. But your f/s can use a custom hash fxn if you wanted something more efficient. ->d_compare: the VFS uses strcmp() by default (to compare two dentries' names), but some file systems want their own custom dentry name comparison. strcmp() uses a case-sensitive comparison. Some file systems, like FAT32, do not support case-sensitive names. ->d_init, ->d_delete: constructor/destructor hooks ->d_release: called right before a dentry with RC=0 is about to be removed from dcache (helps f/s that stored private info in dentry->d_fsdata. ->d_iput: called before a refcnt is decreased on an inode connected to a dentry (that is, dentry->d_inode). Useful to know when a dentry might become negative, before it does. ->d_revalidate: by default, cached dentries are assumed to be valid. That is, the backend content of those dentries hasn't changed (e.g., a local ext3 f/s). But in network file systems (e.g., NFS), the server's files may have changed, while a client caches a dentry. This method lets a f/s called a quick, custom "revalidation" function, to test if the entry that's cached is valid? In NFS, it may send a quick RPC to the server to find out if the parent dir has changed. Any entry that is not valid, is removed from the dcache, and the caller has to reissue a ->lookup to get a fresh new (valid) entry. * Wrapfs Some file systems work on top of a block device (e.g., ext3 accesses disk LBAs). Some file systems are networked: they send/receive RPCs with messages to server. Some file systems operate on top of a floppy disk, or CDROM/DVDROM; some are ram-based file systems. Suppose you wanted to implement some new functionality, say transparent encryption/decryption of files. How would I implement this: 1. Start from scratch, design a file system like ext3, with built in encryption. Problem: it's a lot of effort! Many man-years of labor just to get a simple feature. 2. Take an existing f/s, like ext3, and modify its file ->read and ->write methods. Less work than starting from scratch. Modify ext3_write to encrypt the user buffer before storing it on disk; and modify ext3_read to decrypt the data read back from disk. Problem: each time ext3 changes, you have to update your code. Trying to convince ext3 maintainers to integrate your encryption code into mainline is fairly hard. Problem: some users would want encryption added to their favorite f/s, say xfs, or btrfs, etc. If you apply your code changes to other f/s, now you have to maintain several more f/s. 3. Instead of changing each f/s, add crypto support directly to the VFS, right after invoking the sys_read and sys_write calls. That way, your file encryption can work for all file systems. But, changing the VFS is a lot of work, lots of complex code developed over years: just in linux-5.15/fs/*.c there's over 75,000 LoC (not counting header files like fs.h). Likely would introduce bugs that would affect all file systems; and Linux maintainers unlikely to accept your change. Plus, you may have to support your changes on multiple kernel versions (because different distros and their releases use different kernel versions). 4. A better solution to quickly trying things out w/o changing the VFS or any existing f/s, is to overlay or "stack" your code on top of an existing f/s. That is what "wrapfs" does (one of several stackable file systems). Wrapfs mounts not on top of a device, but another directory (a dentry). It then intercepts the operations and by default just passes them down to the underlying f/s. But it gives you a way to inserting your custom code more easily. Wrapfs code base is under 2 KLoC. * Next time wrapfs code