* The Linux Virtual File System (VFS)

Recall layers (from top to bottom):

1. User level:
applications
middleware
libraries
--------------------
2. Kernel:
system call entry points (e.g., sys_read, sys_open, etc.)
>Linux VFS
file systems (implement read, write, mkdir, etc.): ext4, ntfs, btrfs, fat32...
virtual block drivers (RAID, device mapper, LVM)
block drivers (e.g., SCSI)
I/O schedulers
device drivers
--------------------
3. Hardware:
(e.g., HDD, graphics, GPU)

* History/motivation

Originally, apps will issue an open(2) syscall, and the call will go into
kernel mode, then invoke a file-system specific function such as nfs_open(),
or ext4_open() -- developers had to know which f/s is mounted and where, so
they can call the right function.

After some time, OS designers realized that there was a lot of common code
b/t file systems.  And it was cumbersome for app developers to have to KNOW
which f/s they're working with.  So they decided to create another layer of
virtualization (indirection), called the "virtual" file system or VFS.

The VFS sits below syscall entry points and above specific file systems.
Has 2 main benefits:

1. It refactors common code, so developing an actual new f/s is easier: you
only have to implement the f/s aspects specific to YOUR new f/s.

2. It provides a library of common routines useful for file systems to call.
For example, routines to lookup paths, search for objects in caches,
handling common flags of system calls, and more.

* VFS structure

The VFS "layer" defines various data structures that embody common items for
f/s objects such as files, directories, inodes, etc.  And it permits
specific file systems to define their own custom items.

The VFS implements some basic routines for file systems, and defines a
vector of functions that specific file systems can implement.

The VFS data structures contain some fields that anyone can read, and some
fields that are protected so others aren't allowed to use.

Indeed, the VFS (and much of the linux kernel) uses an object-oriented
programming style.  Only that it does so in C, without official OOP support
that's available in, say, C++.

* Specific structures

Remember in userland we deal with paths, files, directories, file systems,
etc.  So the VFS tries to mirror that to some extent.

1. struct inode (abbreviated 'I')

Represents an object "on disk" or any persistent media.  The inode contains:

- inode number type of object: is the object a regular file? a directory?
- symlink?  who owns the object: user, group permissions to access the
- object timestamps and more...

Struct inode contains all the info that is needed to be returned back to
users in a stat(2) syscall.

Inode structure also contains: - locks that protect various fields to access
- function pointers (methods) for operating on THIS inode.  - pointers to
file system specific data (e.g., struct ext4_inode, struct nfs_inode)

2. struct dentry "directory entry" (abbreviated 'D')

Represents a name of an object, only one component of that name.  So if you
have a path like /home/user/foo.c, each component (delimited by '/') will
have a dentry in the kernel.

Struct dentry contains: - the string name of THIS object - pointers to the
parent dir name, child dir names, etc.  - locks, methods/function pointers,
etc. (same as inode)

Struct dentry provides the main object for caching pathnames of files that
have been seen and/or looked up.

Many dentries are cached in memory in a cache called the "directory entry
cache" or "dcache" (in Linux).  Different OSs call this cache something
else, but they all have a cache of recently seen dentries.

Thus, struct dentry has a lot of support/pointers to help caching items in
the dcache.  Items can be looked up in order, unordered, organized as a hash
table, and more.

Recall the name of an object is stored persistently, so struct dentry
represents names that are "on disk" (e.g., inside directories).

For example:

- when you try to open(2) a file, each component of the path is looked up,
searched in the dcache.  - if found, you can return success - if not found,
you can return ENOENT

3. struct superblock (abbreviated 'SB')

Represents the on-disk state of the whole file system.  It contains:

- size of file system (blocks, sectors, etc.)  how many files/inodes are
- used; how many free type of file system (e.g., ext4, ntfs, nfs) the usual
- function methods, locks, and ptr to f/s-specific superblocks

The stuff in struct superblock is returned back from statfs(2) (other OSs
may call this syscall differently, like statvfs(2)).

4. struct file (abbreviated 'F')

Represents information about an open file/directory.  Does NOT contain info
about the file on disk (that would be in the struct inode).  Struct file
contains:

- the open(2) flags and mode: e.g., O_RDONLY, O_RDWR, O_APPEND, etc.

- the offset where you read/write inside that file: each time you read/write
  the file, the offset counter is updated; so next time you read/write, you
  start from the offset you were at.

When you call l/seek(2), you just go and change the offset field in struct
file.

Once you close(2) a file in userland, the struct file can be discarded.

Struct file contains also methods for operating on THIS file.

How would a struct file be able to access the actual data of a file: that
info resides in another structure -- struct inode (which has pointers to
data blocks).  So struct file needs to be associated with the corresponding
inode.  Easiest is to add a pointer from structure to another.

* Connection between VFS structures

Four main structures: inode (I), dentry (D), superblock (SB), and file (F).

Suppose someone calls stat() on file "abc.txt".  What will be seen in
memory?  Assume the file exists.  So we need an inode (I) for that file, and
we need a dentry (D) for the name.  In memory, we'll have a D and I,
connected with a pointer:

D -> I

Meaning struct dentry (with name inside "abc.txt") pointing to the inode
that represents that file on disk.  Next, let's say someone opened the file
successfully: so a struct file is created in memory, and connected to this
D+I pair:

F -> D -> I

The arrows above represent pointers in C.

Suppose I have a hard-linked file: abc.txt linked to def.txt.  Recall a
hardlink is another name for the same inode.  And let's say that someone has
looked up the abc.txt name as well as the def.txt name.  We need to have two
dentries in the dcache:

D1 "abc.txt" D2 "def.txt"

Both need to point to the SAME inode in memory:

D1 -> I <- D2

The dcache thus caches both dentries and inodes.  But like any cache, it has
limited space and items have to be evicted to make room for new items.

Suppose I had

D -> I

And I just removed (free'd, deallocated) that 'I' from memory to make room?!
This could result in a "dangling pointer", a ptr that points to an invalid
structure.  This is no different than a user-land bug called "use after
free".  Bad corruptions could happen when that freed memory starts getting
used for something else.  Kernel bugs are much more severe than user level
bugs.

So we need a way to track which objects that are linked together are in use,
how much use, and when they're no longer used.  Indeed, we use reference
counters (RC) to track the "lifetime" of various objects.

* Refcounts

Assume X -> Y.

When an object (e.g., Y) has someone else (e.g., X) pointing to it, then the
pointed-to object has to have a refcount INCREASED by 1.  When the
link/pointer b/t the two objects is broken/removed, then the pointer-to
object has to have its refcount DECREASED by 1.

Objects can be freed when their refcount (RC) reaches 0: meaning that there
are no more "users" of this object.

The responsibility of maintaining the RCs is on the one object that
makes/removes the pointer.  In the above example, it has to be the code
where 'X' is creating a pointer to Y.  However, the RC counter must be
stored inside the destination object -- Y.

* RC examples inside the VFS

D (RC=0) -> I (RC=1)

Meaning: we cannot remove I from memory, but we CAN remove D from memory.
If we remove D from memory, the structure would look like:

I (RC=0)

And *now* I can also remove this 'I' from memory.

Hardlink example:

D1 (RC=0) -> I (RC=2) <- D2 (RC=0)

If I remove one dentry, say D2, then I's RC has to decrease by 1:

D1 (RC=0) -> I (RC=1)

Using refcounts allows for easier management of collection of objects.  You
can defer the actual "garbage collection" of objects with RC=0 till later.

After opening a file, I'll get

F (RC=1) -> D (RC=1) -> I (RC=1)

What's the RC of F?  It may look like a 0, but in fact it is a 1.  Suppose I
have this piece of code:

int fd; fd = open("abc.txt", O_RDONLY); if (fd < 0) // failed {
perror("abc.txt"); // will printf an error msg with ERRNO string exit(1); }
// if gets here, open() succeeded, so I have a valid fd >= 0

The "int fd" in userland, has to be connected to a 'struct file' in the
kernel.  That connection is made in a process/task table.  In linux:

struct task {
	// Lots of field
	struct file *open_files[MAX_FD];
	// Q: what is in open_files[0], open_files[1], and open_files[2]?
	// A: stdin (fd 0), stdout (fd 1), and stderr (fd 2)
	// an array of all struct file pointers for open files/dirs for THIS
	// process or task.
}

So, e.g., if in above C code, open() returned the number 4. That means that
slot #4 (recall we count from 0) in task->open_files[4] is pointing to the
file struct in question.

task->open_files[fd given to user] -> F (RC=1) -> D (RC=1) -> I (RC=1)

Sometimes you see a file struct F with RC=2 or more?  What happens inside
the kernel is that the same POINTER to the same struct file, is added into a
different slot inf open_files[] array.

What syscall can do that? dup(2) or dup2().

Sometimes you'll have F1 -> D, and F2 -> D (the same D, whose RC=2).  That
happens if you open(2) the same file twice, so you get 2 fds, each w/ its
own separate struct F.  Why have two different struct file's?  Maybe you
want one opened with readonly mode, and another in read/write mode?  Maybe
you have two threads that each need to access the file at different offsets,
so each struct file can maintain its own offset this way.