* dentry

In <linux/dcache.h> (not fs.h)


dentry->d_inode: the inode of the dentry, NULL inode means negative dentry.

dentry_operations (TBD)

dentry->d_fsdata (a void*): any f/s specific info, for extensibility

	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names */

dentry->d_iname is an array of chars (a string) holding "small names".  The
macro DNAME_INLINE_LEN is usually 32-40 bytes, depending on architecture.

Note: in POSIX, a full pathname (e.g., /a/b/c/d/...) cannot exceed 4096
bytes.  A single component (delimited by '/') cannot exceed 256 bytes.

However, most people use short names/paths, that are easier to remember.  So
if dentry needs to store the string name of the path component, we can do it
in two ways:

1. store a pointer like "char *name" from the dentry

Good: b/c I can malloc and store exactly the length of the name/string I
want, not wasting any more memory.

Bad: every time I access a dentry and I need the name, I have to cross a
pointer to a different mem location, most likely resulting in a CPU cache
flush.  That leads to a memory-bound performance, for a very popular data
structure (at any time, there can be 100s or even 1000s of dentries cached
in ram).

2. store the full bytes together with the structure: "char name[256]".

Good: b/c the name bytes are right with the structure, reduce CPU cache
flushes.

Bad: b/c wasting a lot of memory b/c most names are short.

Solution by Linux: a hybrid approach

1. embed short names directly into the dentry:
dentry->d_iname[DNAME_INLINE_LEN].

2. store longer names with a variable size pointer

	struct qstr d_name;

struct qstr {
	// hash stuff
	const unsigned char *name;
};

struct qstr is just a char* pointer, so it can store any variable length
name (for longer names).  Recall that long names will need to be malloc'd
and free'd when stored inside dentry->d_name.name.

Q: How do you know, given a dentry, where to look?

A: you could consult the value of DNAME_INLINE_LEN macro/const, but that
means you have to do another "if-then-else".

A: we could use a small flag, stored it in dentry->d_flag (older kernels
didn't, but newer ones may do so).  Still requires an if-then-else check.

A: a better solution is to always use the qstr.name, but point it to the
embedded d_iname for short ones.  So if we construct a dentry with a short
name, we copy the bytes to dentry->d_iname, then we set

dentry->d_name.name = & dentry->d_iname[0];

Thus, for short names, the qstr points just a few bytes later in the struct,
short distance enough to avoid a CPU cache flush.

* Dcache

All dentries are organized in a big "index" table for lookups.  Recall
lookups can be expensive and happen a lot: so we want a very efficient
dcache (which also caches inodes).

* Linux lists and hash tables

Linux designed several basic data structures:

struct list_head from <linux/list.h>: implements simple linked lists, but
without a typical pointer to the next element.  Typical lists are designed
as

struct list_element {
	int x;
	float y;
	// whatever members you need
	struct list_element *next; // ptr to next element
}

Nut such ->next elements, would mean a different mem location (bad: CPU
cache flushes).

struct list_head uses a "container" to hold several elements as well as the
reference to the "next" (or "prev" element).  They also provide accessor
methods to create a list, iterate over a list, count #elements, add/remove
element from head/tail, etc.  The code in list.h is generic C, highly
efficient (can be used in userland).

Then Linux developed doubly-linked lists on top of the basic list_head.

Lists can be unordered or ordered, depending on the needs.

Then they developed a hash table (HT) on top of a list_head: a HT is a array
of "buckets", each bucket is a list_head (the start of a list).  See
<linux/hashtable.h>.  Also a very generic, efficient C code that can be
reused in userland.  Recall that HTs also need to support a "hash function"
to find out the bucket for a given element.

* Dcache use of lists and HTs

	struct hlist_bl_node d_hash;	/* lookup hash list */

d_hash is a hash table for "lookup" purposes.  HTs are more complicated but
efficient for "finding" elements.  All lookup methods will use d_hash to
check "is there a dentry with name X in the dcache"?

Each dentry belongs to a specific file system (denoted in dentry->d_sb), and
is thus "linked" to all other dentries cached in the same SB.

	struct dentry *d_parent;	/* parent directory */

We store the dentry for the parent directly inside this dentry, so you don't
have to look it up.  Reason is that we often lookup ".." (the parent name).
Another reason is that the kernel may need to ascend one level up for
locking and other reasons.

Note: for the root dentry, the d_parent points to itself.

		struct list_head d_lru;		/* LRU list */

a linked list of all dentries in this SB, ordered by least-recently used
(LRU).  Useful when we need to make room in memory, cache eviction.

	struct list_head d_child;	/* child of parent list */

"child of parent list": that is, a sibling list.

	struct list_head d_subdirs;	/* our children */

If this is a directory, what are the dentries for subdirs here.

d_child and d_subdirs are useful for many apps/tools that perform ops on all
objects in the same dir, or recursively descend into subdirs for some
operation:

ls: wants to know all siblings
ls -R, find: recursive scan of all files/dirs below a certain point
same with chmod -R, chown -R, chgrp -R, rm -rf, etc.

Note: these lists/hashes can ONLY store (in dcache) items that have been
looked up.  For example, the d_subdirs will only records any subdir dentries
that were looked up once and cache.  It will not tell you what are all the
on-disk subdirs!

* dentry_operations

->d_hash: the VFS offers a default hash fxn (on the dentry's name) to decide
  which buckets to place it into.  But your f/s can use a custom hash fxn if
  you wanted something more efficient.

->d_compare: the VFS uses strcmp() by default (to compare two dentries'
  names), but some file systems want their own custom dentry name
  comparison.  strcmp() uses a case-sensitive comparison.  Some file
  systems, like FAT32, do not support case-sensitive names.

->d_init, ->d_delete: constructor/destructor hooks

->d_release: called right before a dentry with RC=0 is about to be removed
  from dcache (helps f/s that stored private info in dentry->d_fsdata.

->d_iput: called before a refcnt is decreased on an inode connected to a
  dentry (that is, dentry->d_inode).  Useful to know when a dentry might
  become negative, before it does.

->d_revalidate: by default, cached dentries are assumed to be valid.  That
  is, the backend content of those dentries hasn't changed (e.g., a local
  ext3 f/s).  But in network file systems (e.g., NFS), the server's files
  may have changed, while a client caches a dentry.  This method lets a f/s
  called a quick, custom "revalidation" function, to test if the entry
  that's cached is valid?  In NFS, it may send a quick RPC to  the server to
  find out if the parent dir has changed.  Any entry that is not valid, is
  removed from the dcache, and the caller has to reissue a ->lookup to get a
  fresh new (valid) entry.


* Wrapfs

Some file systems work on top of a block device (e.g., ext3 accesses disk
LBAs).  Some file systems are networked: they send/receive RPCs with
messages to server.  Some file systems operate on top of a floppy disk, or
CDROM/DVDROM; some are ram-based file systems.

Suppose you wanted to implement some new functionality, say transparent
encryption/decryption of files.  How would I implement this:

1. Start from scratch, design a file system like ext3, with built in
encryption.  Problem: it's a lot of effort!  Many man-years of labor just to
get a simple feature.

2. Take an existing f/s, like ext3, and modify its file ->read and ->write
methods.  Less work than starting from scratch.  Modify ext3_write to
encrypt the user buffer before storing it on disk; and modify ext3_read to
decrypt the data read back from disk.

Problem: each time ext3 changes, you have to update your code.  Trying to
convince ext3 maintainers to integrate your encryption code into mainline is
fairly hard.

Problem: some users would want encryption added to their favorite f/s, say
xfs, or btrfs, etc.  If you apply your code changes to other f/s, now you
have to maintain several more f/s.

3. Instead of changing each f/s, add crypto support directly to the VFS,
right after invoking the sys_read and sys_write calls.  That way, your file
encryption can work for all file systems.  But, changing the VFS is a lot of
work, lots of complex code developed over years: just in linux-5.15/fs/*.c
there's over 75,000 LoC (not counting header files like fs.h).  Likely would
introduce bugs that would affect all file systems; and Linux maintainers
unlikely to accept your change.  Plus, you may have to support your changes
on multiple kernel versions (because different distros and their releases
use different kernel versions).

4. A better solution to quickly trying things out w/o changing the VFS or
any existing f/s, is to overlay or "stack" your code on top of an existing
f/s.  That is what "wrapfs" does (one of several stackable file systems).

Wrapfs mounts not on top of a device, but another directory (a dentry).  It
then intercepts the operations and by default just passes them down to the
underlying f/s.  But it gives you a way to inserting your custom code more
easily.  Wrapfs code base is under 2 KLoC.

* Next time

wrapfs code