* VFS reminder Objects: inode (I), dentry (D), file (F) - saw locks in objects: some locks control certain fields - various fields - pointers to other structs - "void *" for extensibility - ops vectors - atomic_t refcnt fields - fields clustered together (for CPU cache-line efficiency) * Assigned reading Documentation/filesystems/api-summary.rst Documentation/filesystems/directory-locking.rst Documentation/filesystems/files.txt Documentation/filesystems/locking.rst *** Documentation/filesystems/locks.txt *** Documentation/filesystems/mandatory-locking.txt Documentation/filesystems/mount_api.txt Documentation/filesystems/path-lookup.rst Documentation/filesystems/path-lookup.txt *** Documentation/filesystems/porting.rst Documentation/filesystems/vfs.rst *** Documentation/filesystems/wrapfs.txt * ops vectors from include/linux/fs.h (main VFS header file, long) struct inode_operations: a list of per-fs ops that can be applied to the current allocated inode object. Inside inode->i_op, is the ops vector that can apply to the inode. In C, the syntax int (*foo)(int a, float b, char c); means: a field named 'foo', that can be assigned the address of a function (a ptr to a function). The fxn assigned must be one that take (int, float, char) and return "int". There is a convention in C/linux that when you ref to a "method" (fxn ptr), you use the prefix "->". For example, ->unlink means the unlink method of some object. The inode ops vector is assigned to a struct inode when it is allocated for the first time (TBD). An actual file system (e.g., ext4, nfs, vfat) assigns the actual vector to an inode, b/c inodes are allocated/destroyed inside actual f/s code. So if an inode happens to belong to ext4, then the inode ops for ->unlink will be pointing to ext4's own unlink function, often called ext4_unlink. For nfs, it'll be nfs_unlink, etc. Now, looking the inode ops themselves. int (*unlink) (struct inode *,struct dentry *); ->unlink looks fairly similar to the syscall unlink(2). Unlink has a prototype: int unlink(char *path); ->unlink takes two args: 1. inode ptr: parent inode of the dentry 2. dentry ptr: the name of the entity (file) to delete/unlink If you call "rm /home/jdoe/src/hw1/foo.c", which translates to the syscall unlink("/home/jdoe/src/hw1/foo.c"). So in the above example, 1. inode ptr: parent or "hw1"'s inode, which holds the actual "content" of that directory, namely the records of file names, inode numbers, etc. 2. dentry ptr: dentry that points to "foo.c". Lesson 1: file system ops take place one path component at a time, not a whole /full/path/name. Lesson 2: to ensure consistency of ops that modify the content of a directory (or any f/s object), we often need to lock that object. Therefore, the parent inode for ->unlink needs to be locked. To make life easier for f/s developers, the VFS takes care of un/locking parent inodes before calling the actual f/s ->unlink method. Lesson 3: VFS also will avoid calling ->unlink unless it's actually needed. So the VFS ensures that the dentry (and dentry->inode) actually exists (a "positive" dentry). int (*rmdir) (struct inode *,struct dentry *); same as ->unlink, but called when someone issues rmdir(2) syscall. Again, the inode is the locked parent dir of the dir whose name in dentry you want to delete. int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*create) (struct inode *,struct dentry *, umode_t, bool); Same as mkdir(2) and creat(2), take an extra arg umode_t, which is the default permissions to create this object with. For ->create, the 4th arg is "bool want_excl" (do you want O_EXCL file creation or not). int (*link) (struct dentry *,struct inode *,struct dentry *); Hard links, via link(2). Has (D1,I,D2): D1 is the "old dentry" (existing one), I is the locked parent dir of the new entry (D2) that you want to create. D1 must exist (positive dentry), else VFS won't call ->link. Similarly, D2 must NOT exist (negative dentry), else VFS won't call ->link. Upon successful return from ->link, D2 will become a POSITIVE dentry. int (*symlink) (struct inode *,struct dentry *,const char *); To create symbolic/soft links via symlink(2): last arg (char *) is what the symlink will point to. Note that the 'char*" is the content of the symlink object, or what it points to, just like a regular file has a content that you can read(2). int (*readlink) (struct dentry *, char __user *,int); ->readink operates much like read(2) to read the contents of a "file" (actually a symlink object). There's no parent inode passed, b/c reading a file/symlink's content doesn't change the parent dir's content. So "dentry" is a positive dentry of type symlink. "char __user *" is a user passed buffer (in user's addr space, just like read(2)). The last arg is the buffer's length. int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); mknod(2) syscall, for creating special char/block devices. Same args are ->creat, but with an extra "dev_t" arg (to describe the device major/minor properties). int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); rename(2) syscall: I1, D1, I2, D2, flags. is the old name (positive D1, locked I1). is the locked dir where the new name should be created, and D2 is (most often) a negative dentry name in I2. Upon success, D1->inode is moved to D2->inode, and D1 becomes negative. Note that the VFS will grab the locks on I1 and I2, before ever calling ->rename in your f/s code. POSIX allows one to rename on top of an existing file or (empty directory). So D2 *could* be positive, and upon success, D2's original D2->inode is released. If you want to test this, use rename(2) directly, not /bin/mv. Note: most methods return an int - 0 means success - >0 for some syscalls such as readlink(2) (no. of bytes read) - <0: -errno code * lookups struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); ->lookup is the method that will "find" a new dentry inside a directory. Takes 1. inode I: the parent dir inode of the name you're trying to lookup. This inode is also locked. It's locked not b/c ->lookup may modify the dir, but someone else could. 2. dentry D: the name of the object you're looking for. VFS will call ->lookup with a negative dentry. Upon success, the dentry turns positive (D->inode is filled). Note that before lookup, you don't know what type of object is named by "D"; after lookup, you can check inode flags/modes, to tell if what you found is another file, directory, symlink, etc. 3. flags: not used much, just for some optimizations (TBD) ->lookup will return an encoded PTR that you have to check with IS_ERR and PTR_ERR. Note that if ->lookup could not find the object, it may not return an encoded ENOENT, but instead will return a negative dentry. When ->lookup returns to the VFS, if the returned dentry is negative, the VFS will store the negative dentry in the dcache, and then return -ENOENT to the syscall that caused this ->lookup. Lookups happen ONE component at a time. Back to "unlink("/home/jdoe/src/hw1/foo.c"). ->unlink(I for "hw1", D for "foo.c") -- but how did we find 'I' in the first place? A: lookup. Actual procedure will look like this: ->lookup(I for "/", D for "home") -- found from the superblock SB->s_root. the first lookup starts from the "root" of the file system. ->lookup(I for "home", D for "jdoe") ->lookup(I for "jdoe", D for "src") ->lookup(I for "src", D for "hw1") 1. if lookup didn't find "hw1", return error, don't even continue 2. If lookup succeeded, then we have a positive dentry for "hw1", and we can pass D->I as 1st arg to ->unlink below. ->unlink(I for "hw1", D for "foo.c") Q: how do we get the "root" of the file system itself, where lookups can begin? A: in struct super_block (abbreviated "SB"), there are fields like any other VFS struct. SB holds info about a whole mounted f/s -- and is the file system equivalent of a an inode. Inside SB->s_root, is a dentry that is the "root" of that file system. Each time a new f/s is mounted, the f/s code creates this s_root dentry and stuffs it inside SB->s_root (with a dentry refcount of at least 1). s_root won't be released until entire SB is released, when the file system is unmounted using umount(2). What you see as "/" is the root dentry of *A* file system. Every mounted f/s has a root dentry where lookups inside that dentry take place. The root dentry is created during mount(2): this means that you have to have one f/s in order to mount another. So how do we get the very first root file system and its own "global" "/" root dentry? A: that happens at boot time, with special code for some file systems that are can be used as "bootstrap" file systems. The code often kmalloc's the objects needed (SB and root dentry) "by hand", so that subsequent lookups could take place. * permission checks int (*permission) (struct inode *, int); ->permission is an inode method, takes I: the inode to check permission for int flags: what kind of permissions you're looking for (read, write, execute, combination) ->permission compares the "struct task *current" uid/gid/perms against the just looked up inode->{i_mode,i_uid,i_gid} and (i_acl if used). Mostly follows POSIX permission model. Note that the VFS will perform basic/common permission checks, but then it hands the permission checks to a per-fs ->permission method. If your f/s doesn't implement ->permission, that means that you'd rather let the VFS perform common "POSIX" checks; otherwise, the VFS will call your ->permission method, and you can implement ANY policy you want. The lookup procedure has to perform ->permission after each successful ->lookup. And only if permission checks passed, do you continue the lookup, else return EPERM/EACCESS error.