* The Linux Virtual File System (VFS) Recall layers (from top to bottom): 1. User level: applications middleware libraries -------------------- 2. Kernel: system call entry points (e.g., sys_read, sys_open, etc.) >Linux VFS file systems (implement read, write, mkdir, etc.): ext4, ntfs, btrfs, fat32... virtual block drivers (RAID, device mapper, LVM) block drivers (e.g., SCSI) I/O schedulers device drivers -------------------- 3. Hardware: (e.g., HDD, graphics, GPU) * History/motivation Originally, apps will issue an open(2) syscall, and the call will go into kernel mode, then invoke a file-system specific function such as nfs_open(), or ext4_open() -- developers had to know which f/s is mounted and where, so they can call the right function. After some time, OS designers realized that there was a lot of common code b/t file systems. And it was cumbersome for app developers to have to KNOW which f/s they're working with. So they decided to create another layer of virtualization (indirection), called the "virtual" file system or VFS. The VFS sits below syscall entry points and above specific file systems. Has 2 main benefits: 1. It refactors common code, so developing an actual new f/s is easier: you only have to implement the f/s aspects specific to YOUR new f/s. 2. It provides a library of common routines useful for file systems to call. For example, routines to lookup paths, search for objects in caches, handling common flags of system calls, and more. * VFS structure The VFS "layer" defines various data structures that embody common items for f/s objects such as files, directories, inodes, etc. And it permits specific file systems to define their own custom items. The VFS implements some basic routines for file systems, and defines a vector of functions that specific file systems can implement. The VFS data structures contain some fields that anyone can read, and some fields that are protected so others aren't allowed to use. Indeed, the VFS (and much of the linux kernel) uses an object-oriented programming style. Only that it does so in C, without official OOP support that's available in, say, C++. * Specific structures Remember in userland we deal with paths, files, directories, file systems, etc. So the VFS tries to mirror that to some extent. 1. struct inode (abbreviated 'I') Represents an object "on disk" or any persistent media. The inode contains: - inode number type of object: is the object a regular file? a directory? - symlink? who owns the object: user, group permissions to access the - object timestamps and more... Struct inode contains all the info that is needed to be returned back to users in a stat(2) syscall. Inode structure also contains: - locks that protect various fields to access - function pointers (methods) for operating on THIS inode. - pointers to file system specific data (e.g., struct ext4_inode, struct nfs_inode) 2. struct dentry "directory entry" (abbreviated 'D') Represents a name of an object, only one component of that name. So if you have a path like /home/user/foo.c, each component (delimited by '/') will have a dentry in the kernel. Struct dentry contains: - the string name of THIS object - pointers to the parent dir name, child dir names, etc. - locks, methods/function pointers, etc. (same as inode) Struct dentry provides the main object for caching pathnames of files that have been seen and/or looked up. Many dentries are cached in memory in a cache called the "directory entry cache" or "dcache" (in Linux). Different OSs call this cache something else, but they all have a cache of recently seen dentries. Thus, struct dentry has a lot of support/pointers to help caching items in the dcache. Items can be looked up in order, unordered, organized as a hash table, and more. Recall the name of an object is stored persistently, so struct dentry represents names that are "on disk" (e.g., inside directories). For example: - when you try to open(2) a file, each component of the path is looked up, searched in the dcache. - if found, you can return success - if not found, you can return ENOENT 3. struct superblock (abbreviated 'SB') Represents the on-disk state of the whole file system. It contains: - size of file system (blocks, sectors, etc.) how many files/inodes are - used; how many free type of file system (e.g., ext4, ntfs, nfs) the usual - function methods, locks, and ptr to f/s-specific superblocks The stuff in struct superblock is returned back from statfs(2) (other OSs may call this syscall differently, like statvfs(2)). 4. struct file (abbreviated 'F') Represents information about an open file/directory. Does NOT contain info about the file on disk (that would be in the struct inode). Struct file contains: - the open(2) flags and mode: e.g., O_RDONLY, O_RDWR, O_APPEND, etc. - the offset where you read/write inside that file: each time you read/write the file, the offset counter is updated; so next time you read/write, you start from the offset you were at. When you call l/seek(2), you just go and change the offset field in struct file. Once you close(2) a file in userland, the struct file can be discarded. Struct file contains also methods for operating on THIS file. How would a struct file be able to access the actual data of a file: that info resides in another structure -- struct inode (which has pointers to data blocks). So struct file needs to be associated with the corresponding inode. Easiest is to add a pointer from structure to another. * Connection between VFS structures Four main structures: inode (I), dentry (D), superblock (SB), and file (F). Suppose someone calls stat() on file "abc.txt". What will be seen in memory? Assume the file exists. So we need an inode (I) for that file, and we need a dentry (D) for the name. In memory, we'll have a D and I, connected with a pointer: D -> I Meaning struct dentry (with name inside "abc.txt") pointing to the inode that represents that file on disk. Next, let's say someone opened the file successfully: so a struct file is created in memory, and connected to this D+I pair: F -> D -> I The arrows above represent pointers in C. Suppose I have a hard-linked file: abc.txt linked to def.txt. Recall a hardlink is another name for the same inode. And let's say that someone has looked up the abc.txt name as well as the def.txt name. We need to have two dentries in the dcache: D1 "abc.txt" D2 "def.txt" Both need to point to the SAME inode in memory: D1 -> I <- D2 The dcache thus caches both dentries and inodes. But like any cache, it has limited space and items have to be evicted to make room for new items. Suppose I had D -> I And I just removed (free'd, deallocated) that 'I' from memory to make room?! This could result in a "dangling pointer", a ptr that points to an invalid structure. This is no different than a user-land bug called "use after free". Bad corruptions could happen when that freed memory starts getting used for something else. Kernel bugs are much more severe than user level bugs. So we need a way to track which objects that are linked together are in use, how much use, and when they're no longer used. Indeed, we use reference counters (RC) to track the "lifetime" of various objects. * Refcounts Assume X -> Y. When an object (e.g., Y) has someone else (e.g., X) pointing to it, then the pointed-to object has to have a refcount INCREASED by 1. When the link/pointer b/t the two objects is broken/removed, then the pointer-to object has to have its refcount DECREASED by 1. Objects can be freed when their refcount (RC) reaches 0: meaning that there are no more "users" of this object. The responsibility of maintaining the RCs is on the one object that makes/removes the pointer. In the above example, it has to be the code where 'X' is creating a pointer to Y. However, the RC counter must be stored inside the destination object -- Y. * RC examples inside the VFS D (RC=0) -> I (RC=1) Meaning: we cannot remove I from memory, but we CAN remove D from memory. If we remove D from memory, the structure would look like: I (RC=0) And *now* I can also remove this 'I' from memory. Hardlink example: D1 (RC=0) -> I (RC=2) <- D2 (RC=0) If I remove one dentry, say D2, then I's RC has to decrease by 1: D1 (RC=0) -> I (RC=1) Using refcounts allows for easier management of collection of objects. You can defer the actual "garbage collection" of objects with RC=0 till later. After opening a file, I'll get F (RC=1) -> D (RC=1) -> I (RC=1) What's the RC of F? It may look like a 0, but in fact it is a 1. Suppose I have this piece of code: int fd; fd = open("abc.txt", O_RDONLY); if (fd < 0) // failed { perror("abc.txt"); // will printf an error msg with ERRNO string exit(1); } // if gets here, open() succeeded, so I have a valid fd >= 0 The "int fd" in userland, has to be connected to a 'struct file' in the kernel. That connection is made in a process/task table. In linux: struct task { // Lots of field struct file *open_files[MAX_FD]; // Q: what is in open_files[0], open_files[1], and open_files[2]? // A: stdin (fd 0), stdout (fd 1), and stderr (fd 2) // an array of all struct file pointers for open files/dirs for THIS // process or task. } So, e.g., if in above C code, open() returned the number 4. That means that slot #4 (recall we count from 0) in task->open_files[4] is pointing to the file struct in question. task->open_files[fd given to user] -> F (RC=1) -> D (RC=1) -> I (RC=1) Sometimes you see a file struct F with RC=2 or more? What happens inside the kernel is that the same POINTER to the same struct file, is added into a different slot inf open_files[] array. What syscall can do that? dup(2) or dup2(). Sometimes you'll have F1 -> D, and F2 -> D (the same D, whose RC=2). That happens if you open(2) the same file twice, so you get 2 fds, each w/ its own separate struct F. Why have two different struct file's? Maybe you want one opened with readonly mode, and another in read/write mode? Maybe you have two threads that each need to access the file at different offsets, so each struct file can maintain its own offset this way.