* VFS introduction VFS: Virtual File System In the past 1970s-1980s: - every time a new file system was added to the kernel, OS designers added new syscalls for THAT file system. For example open_msdos, open_nfs, open_ufs, etc. for read_XXX, write_XXX, close_XXX. - That was annoying to users who have to know what f/s they're using. - The way code was written for a new f/s: copy the old code, and change it. This resulted in a lot of duplicate efforts and cut-n-paste bugs. Made kernel image bigger and slower. In the mid-1980s, Sun Microsystems, at the time a very small company (now acquired by Oracle). Sun wanted to develop new file systems more quickly for different media (disk, floppy, cdrom, network), but didn't want to "copy" the code; also Sun wanted to provide a uniform API to user-level programmers, so they don't have to know what f/s they're working on. Sun was developing their own OS, called SunOS (a derivative of BSD). So they refactored the code for multiple file systems into a generic layer that sits between the system call entry points, and the actual device/medium, and they called it a "virtual" file system (VFS). This way, each f/s would only have to worry about its OWN unique properties and features. For example a cdrom file system would have read-only operations; a hard-disk based f/s would need to communicate with a SCSI block driver; and a network f/s would need to know how to create, send, and receive network messages. So now, the layering looks like: 1. User program issues a read(2) syscall 2. Kernel invokes entry level function sys_read(), which calls 3. Kernel-level VFS function like vfs_read(), which finds out what f/s this specific file lives on, and issues a function call to that file system's "read" method, for example calling 4. If the file is on ext4, calls ext4_read(); if nfs, calls nfs_read(), etc. 5a. ext4 will then call a SCSI/SATA block driver to read/write individual disk blocks 5b. nfs will create a network message and transmit it to the file server over the network, waiting for a reply. So the VFS is a layer of abstraction between syscalls and actual f/s code. It exports a uniform "API" for f/s designers -- they just have to fill in a bunch of stub methods. The VFS also acts as a repository or a "library" of helper utility functions that f/s designers can use. It helps simplify data-structure management, interfacing with the (complex) page cache, object un/locking, object de/allocation, and more. Linux VFS is fairly mature and also complex, as it can handle ~100 different file systems (see kernel sources under fs/*/*). VFS code is largely in fs/*.[hc]. * Linux VFS Data Structures 1. struct inode, abbreviated 'I'. Inode contains information about a file that resides on some storage media like a hard disk. Inode contains everything that any f/s will need to know about that file, as well as what users want -- such as stat(2) info. The info in struct inode is mostly persistent, as it comes from the underlying media. Example of info: - inode number - file owner/group - file permissions - time stamps etc. other "stat(2)" items, so sys_stat just copies fields from inode into __user stat buffer. - there's much more inside an inode, as we'll see later 2. struct dentry, abbreviated 'D' Contains the NAME of an entity inside a directory (not a full pathname), as it appears on some disk media. Links to the underlying inode struct that corresponds to the file (note: file name is NOT in the inode struct, but in the dentry struct). Dentry also has pointers to parent dentry, child dentries, and other links that together form a "cache" of dentries, called a "dcache" in linux. In any OS, a lot of syscalls take pathnames. Each pathname has multiple name components, each component will correspond to a dentry in the kernel. Thus, translating pathnames to component names and dentries happens a lot inside the OS -- a process called "lookup" (linux) or "namei" (bsd) or pathname_lookup/lookup_pn/etc. (other OSs). Because looking up file names is popular in many syscalls, and yet the info lives in slow media, once found, this info gets stored in a special cache called: - directory entry (dentry) cache in linux -- or dcache - Directory Name Lookup Cache (DNLC) in some other OSs 3. struct file, abbreviated 'F' Struct file contains info about an OPENED, named object/file. Struct file maps to some file descriptor of some user process (struct task in linux). Struct file has - links to corresponding dentry/inode - open mode of the file, what permissions you asked to access the file at open(2), which is different than the "chmod" permission bit on the file on disk. - offset to read/write next in the file. - "process which opened the file" or "file descriptor" (part of struct task) * How these three objects relate? When you lookup (e.g., any syscall with a pathname, or a stat call) a name of an object, if the object is found on the file system, then the VFS creates an inode I and dentry D, and links them as follows: D->I There's a literal pointer from struct dentry to struct inode. The dentry is added to the dcache so it can be looked up later on more quickly. Once you get a dentry, you can follow the ptr to get the inode inside. When you open a file successfully, you also create a struct file F, and link it to the dentry (whether D->I was cached or not): F->D->I Objects could have multiple pointers pointing to them: D2 v D1->I That is, both D1 and D2 point to the same inode I. This can happen, if the same physical inode has multiple names AND at least 2 of those names were looked up and cached inside the dcache. That's what it mean to be a "hardlink" -- the same physical file content (and inode number) has multiple names. Symlinks are different: a symlink has a unique name, inode number, and struct inode. But the "content" of the symlink, if you were to read(2) it, is just a string that could be interpreted as yet another pathname. F1->D->I F2->D->I That is, two different struct file's point to the same dentry, which links to the same (only one) inode. Happens when two different file descriptors point to the same file: one or more processes, open'd the same file using the same name. F1->D3->I F2->D4->I That is, two different files, each pointing to a different name, but ultimately accessing the same inode content. Two opened files, each to a different name, but of the same hard-linked file. We separate the file's content (I) from its name (D) b/c files can have only one content but multiple names. Similarly, we separate the open file structs from D and I, b/c each process that opens any file, needs to keep its own offset for each file descriptor. Each process in linux has a corresponding struct task that maintains lots of info including all opened file descriptors: struct task { ... struct file open_files[]; // array of all possible opened files ... } task->open_files[0] is a 'struct file *' that corresponds to stdin task->open_files[1] is a 'struct file *' that corresponds to stdout task->open_files[2] is a 'struct file *' that corresponds to stderr When you open(2) the first file in any program, and you get a new fd, it is almost always the number 3. The offset into task->open_files *IS* the FD number that sys_open returns to user-land open(2) syscall. * next time - object mgmt (RC) - dcache for ENOENT entries, negative dentry - why F->I pointer - Cache-line efficiency - structures: public, private, methods, extensibility - other VFS data structures