* File system specific info, inode containers In Linux, there's ~100 f/s supported, and more "outside mainline" (not officially as part of the kernel src distro). Every inode has "common" info applicable to most/all file systems. Every f/s, has its own special requirements for additional info that it wants to store. Examples: 1. a network f/s would need to record the hostname/IP and port of the network server. 2. Am ecrypting file system would need to know ciphers+keys/etc. used to enc/dec a file. Every file system has its own "extension" to the inode, often called "struct XXX_inode_info", e.g., ext4_inode_info, ecryptfs_inode_info, nfs_inode_info, etc. But where to store this info so it gets cached and passed around with the inode ptr b/t the VFS and the actual f/s? One idea is inode->i_private (void*). If you have a void* ptr, a f/s could attach any size data-struct it wants to it (of course, typically, large structs are discouraged). Concurrent access to inode->i_private? Typically locking takes care of it. Note also that the i_private is for the actual f/s: the f/s will put stuff in i_private and deallocate it. The VFS is not allowed to touch i_private, just pass the enclosing inode around. On busy/big systems, there could be 1000s of inodes+dentries cached. Every time you cross a pointer in kernel space, you may have to flush your CPU caches and reload a different region of the memory into the CPU. When this is done many times a second, your workload becomes memory-bound. To improve CPU cache line efficiency, certain fields in inode/etc. structs are clustered together (so if a process accesses one field, the other likely fields to be accessed are already on the CPU cache line). A long time ago, struct inode ended not with the void*i_private but with: union { struct minix_inode_info minix_i; struct ext3_inode_info ext3_i; struct iso_inode_info isofs_i; struct nfs_inode_info nfs_i; // same for all other "supported" file systems. void *generic_ip; } u; So ext3, for example, could access inode->u.ext3_i; nfs would access inode->u.nfs_i; etc. These XXX_inode_info structs were EMBEDDED inside struct inode. Benefit: the per f/s info is embedded into struct inode, all laid out in memory CONTIGUOUSLY, so much better mem locality inside CPU caches. This was good for performance. Q: what is the sizeof(inode->u) in above example? A: the size of the LARGEST member of the union. Problem: Some memory is wasted, because we have to have enough room for the largest struct member of the union; but most others are probably smaller. Ironically, when you waste (kernel!) memory, you also increase mem pressure on the OS, and have "less" mem for the rest (e.g., other caches and user process pages). When you can cache less, overall performance DROPS. Also, each time you wanted to support anther f/s, you needed to modify struct inode and add another member to the union. What we really want is a way to allocate the VFS inode and the per-fs inode in contiguous memory, but allocate only what we need and no more. This was resolved by introducing the concept of a "container", which is another form of an out-of-band data structure. Another change that was needed was to make the actual f/s responsible for allocating a VFS struct inode + the actual f/s's own inode_info (before, with the union, the VFS de/allocated inodes). When VFS needs to alloc an inode, it calls the file system's "iget" method as part of the ->lookup (recall that lookup is the one that can result in a new dentry/inode pair being allocated). Inside the f/s code, the f/s would, e.g., ext4: 1. len = sizeof(struct ext4_inode_info) + struct(struct inode) - Note that "struct inode" is sometimes called the "VFS inode" 2. allocate a buffer of size len, say in "ptr" - you could allocate with kmalloc, - but in practice it's done using "custom mem allocators" (TBD) 3. ext4 would "stuff" its own info into ptr[0]..ptr[sizeof(struct ext4_inode_info)-1] 4. struct inode * ip = &ptr[sizeof(struct ext4_inode_info)] 5. ext4 returns ip to VFS, VFS can treat it like any other "struct inode". Next time, when VFS passes inode to any f/s method in ext4, ext4 can find out where its OWN inode_info bytes are as follows: foo(struct inode *inode) { struct ext4_inode_info *i_ext4 = &inode[-sizeof(struct ext4_inode_info)] // alternative struct ext4_inode_info *i_ext4 = inode - sizeof(struct ext4_inode_info); } Benefits: allocate only the bytes needed, and all are contiguous in memory Con: a bit more complex coding in file systems, but there's a lot of "helper" routines in the VFS, for example iget_locked, iget5_locked; and also all sorts of code to manage these "containers" (allocate a container pool to hold N such objects, find a few entry in the pool, deallocate, and extend the pool if needed). Technically, the inode_info can go AFTER the struct inode itself in memory, and you'll still not waste memory. But studies showed that statistically, it's better for the inode_info to be BEFORE the inode struct itself; and you can place the MOST frequently used fields at the (1) start of struct inode and (2) the end of struct XXX_inode_info. * struct dentry Main purpose to hold the name of a file/dir/object. struct dentry { ... struct qstr d_name; unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ ... } struct qstr { union { struct { HASH_LEN_DECLARE; }; u64 hash_len; }; const unsigned char *name; }; POSIX says that max file names is 256 bytes (realistically 255 to leave room for \0). That's different than max pathname (4096 bytes). Similar to inodes, we want the name to be in the same mem of the struct dentry itself, but don't want to waste mem. Alas, file names can be variable. If the dentry had char name[256]; then we could fit every possible name into the dentry, and it'll be "fast" b/c contiguous in memory, but we'd be wasting mem on any file name shorter than 255 bytes. Most file names are much, much shorter (easier for humans to create and remember). So how to ensure dentries are fast to access and don't waste memory. Studies showed that most names are short. So a dentry in linux, EMBEDS the size of "small" names (the most common case) directly into the dentry: unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */ e.g., in most 64-bit architectures, DNAME_INLINE_LEN is 32. This allows fast dentries for the vast majority of file names. But what if the name is longer than 32 bytes? In that case, we have no choice but to alloc a null-terminated string of len > 32 and store a pointer to it inside struct inode: dentry->d_name.name // var len char* being kmalloc'd dentry->d_name.hash_len // length of name the dentry stores So, hash_len always stores the strlen() of the dentry name, so you don't have to waste cycles computing this with strlen(). If hash_len <= DNAME_INLINE_LEN use dentry->d_iname else use dentry->d_name.name To further optimize, and make coding easier, the VFS will often set: If hash_len > DNAME_INLINE_LEN dentry->d_name.name = &dentry->d_iname IOW, a dentry may (for short names) have a pointer to another part of itself! Convenient, b/c a programmer doesn't have to worry about short vs. long dentry names: always use dentry->d_name.name, and you'll get the dentry name (which'd be more efficient if it's a short name). Regardless of length of name, hash_len is always set, so easy for anyone to know what's the length of the hash. Also, you MUST know this length, to know if you should kfree(dentry->d_name.name). * Linked Lists in kernel code OS code prefers simple data structures and algorithms: easy to dev and debug, easier locking semantics (TBD), and small footprint. The most popular data structures are lists and hashes. There's some tree structs (B-tree, B+, red-black) but they're used in specific modules as needed. Even linked lists can be hard to get right. Ptr bugs are bad inside a kernel. Linux decided to offer a "linked list" service. See . So if you want any struct to become a linked list, just add "struct list_head" inside. You can create different lists for the same struct. For example, inside inode: struct list_head i_lru; /* inode LRU list */ A list of all inode by LRU order: useful for mem mgmt (freeing up mem by removing "less used" inodes). struct list_head i_wb_list; /* backing dev writeback list */ A list of all inodes that have some pending "writeback" data (async writing in progress). Can't purge inodes that have some background write activity still going on. struct list_head i_sb_list; A list of all inodes in the file system (this superblock, SB), unsorted. Can't use an unsorted list to lookup or for LRU. But useful when you have to iterate over all inodes in no specific order, or to know when there's no more inodes active in memory for a SB. If i_sb_list is not empty, umount(2) returns -EBUSY; else we can proceed to unmount the f/s.