* Linked listed, cont. Many reasons to create lists of items that share a common feature, eg - all inodes that have pending dirty writes: useful to know "how many inodes cannot be removed from mem b/c they have pending writes" - all inodes active in the cache: useful to know "how many inodes this f/s has?" - inodes in some order (e.g., LRU): useful for mem reclamation struct list_head and linux/list.h contain methods and helpers to allow you to find the enclosing struct that is linked together. Traditional linked list: struct foo { int i; // ... struct foo *next; // ptr points to start of (next) struct foo. // problem: if you use this, then you need a different ptr type for every object that linux wants to turn into a linked list. }; In linux, you do it differently struct foo { int i; // ... struct list_head purpose; // note type is always the same regardless // of the "enclosing" struct. Linux leverages a seldom used // compiler-time feature called offsetof(arg) which gives you the byte // offset of a field, relative to the start of the struct in which the // field is in. }; * Hash tables struct hlist_node implements a hash table on top of list_head. It adds an array of "buckets", each bucket leads to a linked list; and a hash fxn that you can define, to decide which bucket an item falls into. HTs are good when you want an efficient lookup of some item. A linked list can be O(n). A good HT can reduce lookup time to O(1) (if each bucket has one item) or at worse, O(n/num_buckets). struct inode: struct hlist_node i_hash; for finding an inode (usually by its inode number of inode* ptr), in a collection of all cached inodes for a given f/s. Rule: if you see an hlist_node inside ANY struct, it means there's probably many instances of this struct in memory, and someone wants to have an efficient lookup (also meaning to insert new items into the HT, and remove old items). struct dentry: struct hlist_bl_node d_hash; /* lookup hash list */ for looking up a dentry, usually by its name and parent dir. struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ lots of unix commands perform recursive ops: ls -R, find, rm -r, chmod/chown/chgrp -R, so they all need to descend into child dirs, and then walk back up to sibling dirs. struct dentry keeps lists of those for every dentry, to speed up these recursive ops. Note: the lists record only objects that have been looked up and cached in dcache. A directory could have many more subdirs that were never looked up, only the subset that were looked up and cached appear in these lists. struct dentry *d_parent; /* parent directory */ Every dentry points to a parent dentry. Useful for "cd .." or any time a pathname contains ".." for example "ls ../../../../usr/local". Allows you very quick name resolution for ".." relative to any given dir. Notes, in linux: 1. d_parent always exists 2. d_parent is not refcounted relative to this dentry. 3. d_parent of the root dentry, points to itself. That way, "cd /../../" stays inside "/" * Stackable file systems Motivation: there are lots of different f/s, each usually implements the VFS APIs, and then stores its data on some medium: a scsi/sata/usb hard disk or flash drive; a network socket (remote server); a floppy; a readonly cdrom; even a ramdisk (just a bunch of memory). Developing a new f/s is hard and takes often many man-years of work. First, kernel code is harder. Second, f/s are expected to never lose/corrupt data. In networking, there's a principle of "best effort": try to deliver packets, if you can't, drop 'em (sender will retry). In f/s, there's no retry: when you finish a write(2), and you get back a success, you expect the data to be preserved permanently. That means a higher standard for robustness and reliability for any code that stores/manages user/application data. Problem: what if you want to try a new feature in a file system? for example, suppose you want to encrypt or compress files automatically, using an existing f/s like ext4? Options: 1. develop a new f/s from scratch -- takes too long, may not be worth the time, just to try out 1-2 new features. 2. modify an existing f/s like ext4. Now you have to read and understand 40k+ LoC: lots of effort to understand, risky to "break" an existing (and otherwise stable) f/s. Even if you manage to add the feature to ext4, you'd have to keep maintaining it indefinitely, constantly integrating changes to baseline ext4. Becomes an ongoing maintenance nightmare for your own modified ext4. Linux developers aren't likely to take 3rd party code from "unknown" users, not w/o establishing a lot of credentials over years. So you have to maintain your own code for years, possibly forever. If you add a useful feature to ext4, what about other file systems? What about xfs, or btrfs, or nfs... So if you want to support your features for other users, you'd have to support them in N other file systems -- with even more dev/maintenance work. 3. Maybe you can develop the code in userland, using user-level file system APIs. In linux, it's called FUSE. FUSE is a kernel module that packs the f/s op args into a network packet, and sends it off to a user level server that listens to it. There's a libfuse that can be used to develop f/s in user-land. There are a lot of FUSE based file systems developed over the years. Many are small, toy, instructional ones. But a few user-level file servers are commercial. Pro: very easy to develop user-level f/s. FUSE even supports 2 different APIs -- a VFS-like rich API, and a higher level one that looks more like open, read/write, close. Cons: 1. Performance and latency! Every op requires local socket communications, and data copies b/t kernel and user space. 2. Dependence on user-level f/s. A user level process can be de/rescheduled at any time and competes with other processes. So if your f/s is important, you may find that under heavy load, the FUSE f/s server doesn't get scheduled enough -- hurting performance even more. 3. Safety and security. The kernel's job is to protect its h/w resources and prevent malicious access as best it can. User level processes cannot easily protect their own resources and the kernel cannot protect user processes from their own mistakes -- the kernel can protect ITSELF and its h/w from ANY user process. Note: it's not impossible to trigger a bug in kernel code. There have been a few such bugs in recent years. But they're usually rare and fixed right away. Conversely, user-level bugs are far more common place: there's lots more user level code and attacks at user-level are generally easier than at the kernel-level. Q: So how come there are commercial versions of user-level f/s?! A: they are all network-based or distributed file systems, where performance is bound by the network not user/kernel crossing. They usually have to implement their own security mechanisms into the distributed f/s. * stacking and wrapfs Another proposed solution was to develop new f/s in the kernel as a layer that can stack on top of some other layer. Stackable file systems can be implemented in both kernel and user-level. One example stackable kernel-level f/s is wrapfs (others are ecryptfs -- an encrypting stackable f/s). A stackable f/s can run in the kernel (better perf+security), but it doesn't need to be a fully implemented one from scratch. It intercepts VFS and f/s ops, and modifies those that need to be, the rest get just passed up/down. Usual layers: 1. user apps calling kernel 2. syscalls 3. VFS gets called 4. a native f/s like ext4 5. ext4 calls a block disk driver like SCSI With stacking: 1. user apps calling kernel 2. syscalls 3. VFS gets called 4a. a stackable f/s that just implements feature X (e.g., encryption) 4b. a stackable f/s that just implements feature Y (e.g., compression) 4c. one or more stackable f/s 5. some native f/s like ext4 6. ext4 calls a block disk driver like SCSI You can reorder stackable f/s in any way you wanted. As long as they implement the same VFS API, you should be able to stack (read: mount) them in any order. A stackable f/s doesn't "mount" on top of a hard disk or network, but rather on "another" file system, called the "lower" f/s. So the thing you mount, is another directory that's already mounted by some lower f/s. Wrapfs is under 2kloc and by default "does nothing" -- just pass the op to the lower layer. Wrapfs has two roles: 1. It has to treat a lower f/s as if it's the VFS calling the lower f/s. 2. It has to look to the actual VFS, as if it's a regular f/s. Another way to think about it is that wrapfs has two "halves": an upper half that look like a regular f/s to the VFS; and a lower half that treats lower f/s like the actual VFS. * wrapfs implementation In any f/s, when a file is opened, you're going to see F -> D -> I That's also true for wrapfs: it has to maintain its own objects. But it also has to be able to associate the lower f/s objects with its own (wrapfs) objects. Wrapfs (upper): F -> D -> I (these objects belong to wrapfs) Ext4 (lower): F -> D -> I (these belong to ext4, ops vectors, etc.) How to associate the lower and upper objects: use the void* pointers (or containers) to "store" the lower f/s info. Thus, you need to have pointers: F -> D -> I (upper) | | | v v v F -> D -> I (lower) An upper F points to a lower F, etc. That way, every time VFS calls a wrapfs method, the method can 'find' the lower objects and pass on the op to the lower f/s. The lower objects now have additional/new pointers pointing to them, so must maintain a proper refcount (at east 1+ what it was before). In stacking the code to associate an upper and lower object is called "interposition". Ops like ->unlink just need to find the lower inode+dentry and pass to vfs_unlink. Ops like ->create get a dir inode and neg dentry that looks like this: D -> NULL inode | v D -> NULL inode That is the dentry arg (not inode) to ->create is guaranteed to be a neg dentry, so no inode. But after wrapfs_create calls vfs_create, if the latter succeeded, we get back: D -> NULL inode | v D -> I (positive ext4 inode) So we must maintain the illusion, but creating the upper inode, linking it to our own wrapfs inode, and then connecting the new upper inode to the lower inode that ext4 gave us. This is what wrapfs_interpose does: it takes the above structure, and fills it in to produce this one: D -> I (upper) | | v v D -> I (lower)