* VFS structures cont.

When an object's RC reaches 0, the objects is NOT necessarily destroyed and
freed from memory.  It can remain there until memory is actually needed:
then you can free the object.

There's another good reason why an object with RC=0 doesn't always get freed
up immediately.  A: maybe you need the object again?

Example:

D -> I (a valid name D pointing to a valid inode I)

Suppose, someone deletes the file, and we remove I from disk, then D's RC=0,
and we remove D.  Next, someone tries to lookup the file whose name was in
D?  What do we have to do?  We have to go all the way down to the disk,
search for structures, only to find out that the file does not exist -- then
we can return an ENOENT.

Recall from dedup, we use bloom filters to quickly find out that something
does NOT exist.  Otherwise, proving a negative is very expensive (slow, lots
of I/O).

Instead, what kernels do, when a file is deleted, they KEEP the dentry with
RC=0 in the dcache.  And only purge it when absolutely needed.  This has two
benefits.

1. If we find a dentry w/o an inode (i.e., dentry with RC=0 but
dentry->"NULL inode"), for example dentry->d_inode==NULL, then we KNOW that
the on-disk file does NOT exist.  And we can immediately return ENOENT w/o
spending any I/O cycles.

Recall in typical caches, if you find an item X, it means the item exists in
the backend from where you cached.  But you can also cache "negative
entries": special entries whose existence in the cache indicates that the
backend item does NOT exist.

This kind of dentry is called a "negative dentry": one where dentry->d_inode
== NULL and RC=0.

2. A negative dentry could become a positive dentry in the future.  That is,
its RC will go from 0 to 1.  That can happen if the previously non-existent
(or deleted file), is now re-created.

- Start with dentry with RC=0
- user creates a new file w/ same name, so a new inode I is created
- we link dentry->d_inode == newly created inode
- and increase dentry RC=0 -> RC=1

This allows us to reuse/recycle a dentry that's already in memory, and relink
it with a new inode.  This turns a "negative dentry" into a "positive
dentry".  (This is a form of "undoing" the deletion.)


* Linux VFS source files

In HW2, each of you will clone a linux 5.15 kernel tree.  Study it as needed
(it's big, so focus on the relevant parts).

Documentation (plain text files)

folder Documentation/ : various docs on the kernel structures and functions

Documentation/filesystems/ : various docs on the VFS and specific file
systems

Documentation/filesystems/vfs.rst: plaintext file about the VFS structures.

The main VFS header file (always start with headers before C/C++ files):

Main VFS header files: include/linux/fs.h and include/linux/dcache.h

struct inode:

In major data structures, each field starts with a prefix.  In struct inode,
it is "i_".  This way, if you see some piece of code accessing ptr->i_xxx,
you know that "ptr" is of type "struct inode *".

starts with i_mode, i_uid, i_gid (all part of the stat(2) info that is
returned to users upon stat(2) of a file).  Why: b/c permission checking
often needs all 3.

Similarly, the timestamps are close to each other: i_atime, i_mtime,
i_ctime.  Why: b/c we often lookup and update timestamps together.

The reason to locate fields that are likely to be accessed together, is for
CPU cache-line alignment!  OS developers are very concerned about
performance, and want to get as much perf out of the CPU, to minimize
memory-bound operations.

inode->i_ino: the inode number

inode->i_count: the reference counter for this object.

Uses an atomic_t, from <linux/atomic.h>.  atomic.h defines convenient
methods for operating atomically on counters: increment, decrement, add/sub,
read, reset to 0, init, destroy.  For refcounts, we primarily use the
atomic_inc() and atomic_dec() functions (and atomic_read).

If you see what looks like an atomic_t xxx_count, it should give you a hint
that this object is managed using reference counters.

 struct inode_operations *i_op; // this is the ptr to the vector if ops that
 can operate on "this" inode.o

struct inode_operation: contains definitions or prototypes for the functions
that operate on an inode:

Note: it is convention in linux kernel that when you refer to "->xxx" (arrow
first), you're often implying a field or method called "xxx".

->create: the create method for inodes, the fields are:

int (*create) (
	struct user_namespace *, // ignore: this is for advanced "namespaces"
	struct inode *, // the parent inode of the dir, where we want to
			// create the new file
	struct dentry *, // the name of the file to create
	umode_t, // this is the "mode_t mode" from userland
	bool);

Recall creat(2) syscall has this prototype:

       int creat(const char *pathname, mode_t mode);

So the VFS will translate a creat(2) syscall into a vfs method
inode->i_creat() and pass it to the actual file system (e.g., ext4) to
create the inode in question.  The file name to create is stored in the
"dentry" passed to the ->create() method.  That dentry we pass starts as
negative: i.e, the inode inside the dentry doesn't exist yet.  If you return
successfully from ->create(), then the method returns (else a negative
-ERRNO, such as -EPERM).  And the dentry that was passed, now has a valid
inode: the dentry switched from negative to positive.

A lot of inode methods, take a directory inode to operate in, and a dentry
with a name to perform the action.

int (*unlink) (struct inode *,struct dentry *); // takes a dir inode, and a
positive dentry, and if successful, removes the dentry->d_inode object from
disk, turning the positive dentry to a negative one.

int (*readlink) (struct dentry *, char __user *,int); // used to read the
value of a symlink.  Takes a positive dentry, whose D->I represents a
symlink object.  And returns a string representing the content of the
symlink.  Recall that readlink just returns the string, but does not try to
interpret it as a pathname.

The string is returned into a "char *" buffer, just a character array, null
terminated.  There's a special subtype designation "__user", to indicate
that the char* pointer is NOT in the kernel physical address space but
rather in this process's USER virtual address space.  When the kernel has to
copy the bytes of the symlink content into the user buf, the kernel has to
convert that user addr from virt to physical internally.

Key: don't ever try to dereference/access a __user ptr in the kernel!  You'd
be treating a virt addr as if it were a phys addr, and likely corrupt your
kernel memory or crash it.