* VFS Inode ops cont. int (*link) (struct dentry *,struct inode *,struct dentry *); hardlinking a file: the first dentry is the "source" name that you want to link to (and must exist). The 2nd arg (inode), is the destination directory into which you want to create a second name (alias) for the source name. the 3rd arg (dentry), is a negative dentry of the new name you want to create. If ->link is successful, then the 2nd dentry will become positive, pointing to the first dentry's inode. That inode's refcnt will increase by 1. int (*rename) (struct user_namespace *, struct inode *, // parent dir inode of "src" struct dentry *, // the name of "src" object to rename struct inode *, // parent dir inode of "dst" struct dentry *, // the name of the "dst" object to rename unsigned int); // special flags like RENAME_EXCHANGE rename(2) in userland is used as rename(src,dst). Renaming is hard to accomplish b/c you have to (1) remove an older name from some dir, and (2) add it to another dir. The kernel has to accomplish the add+remove of names ATOMICALLY. In linux kernel there are complicated locking mechanisms to support renaming (TBD). ->rename takes flags (see fs.h). One flag is RENAME_EXCHANGE, which allows a file system that natively supports it, to SWAP two names (atomically). These flags are not used with regular rename(2) syscall but with a new syscall rename2(2). How to swap two names (a.txt and b.txt) in userland: $ mv b.txt a.txt # success: a.txt will be replaced with b.txt! $ mv a.txt b.txt # error, get an ENOENT Must use a 3rd temp variable $ mv a.txt tmp.txt $ mv b.txt a.txt $ mv tmp.txt a.txt Thus, a RENAME_EXCHANGE has to perform THREE ops atomically. * Lookup & permission Most ->inode ops return an int, that translates directly to an errno code (ENOENT, EPERM, etc.) One big exception is ->lookup() which returns a dentry. struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); Lookup takes an inode for a directory that exists (1st arg), and a name for something to lookup (in the 2nd arg, dentry). The dentry passed as 2nd arg, is going to be a negative dentry. Then lookup goes into the f/s code, and searches for the object in question. Only the actual f/s can know HOW to search for named objects in its own formatted data (e.g., ext4 on hard disks, NFS on network servers, etc.). If successful, ->lookup returns a POSITIVE dentry (and it changes the 2nd arg passed from negative to positive). If ->lookup failed, it returns a NEGATIVE dentry. That dentry will be cached by the VFS in the dcache. Subsequent attempts to lookup that name will be captured by the VFS (b/c a cached entry exists), and the VFS will return ENOENT to the caller (syscall). int (*permission) (struct user_namespace *, struct inode *, int); The ->permission method takes an inode for an existing object (dir, file, etc.), and a 3rd arg of "flags" for what permission you're looking for. Flags can be READ, WRITE, RDWR, EXECUTE, etc. Flags are bitmapped, so they can be logically OR'd together. The ->permission method will be implemented by the f/s to check if the current running task (process) has permission to access the inode in question with the permissions asked for in the flags. Recall in the kernel there are many "struct task" objects, one per running process. Each task also knows who started it (the user or uid), and thus what permissions each task can access. At any point in time, you can check a global variable called "struct task *current" that refers to the currently running task on the CPU/core in question. ->permission would check the currently running process/task against the permissions and ownership of the file/inode in question. If they match (user, group, other -- like chmod), then permission is granted and ->permission will return a 0 (success); else ->permission will return an error (e.g., EPERM, EACCES). * Pathname resolution in the kernel Consider a user issuing the command $ rm /home/jdoe/src/project/foo.c That would translate into the unlink(2) syscall as unlink("/home/jdoe/src/project/foo.c"); In the kernel, we will invoke the syscall entry point, sys_unlink("/home/jdoe/src/project/foo.c"), and now the VFS takes over to begin issuing a sequence of methods to perform this unlink. ->unlink(inode, dentry) inode: the inode for the parent directory named "project". This means that there has to be a positive dentry with name "project". dentry: "foo.c" But how did the dentry+inode with name "project" came to exist in memory? A: we looked it up! Q: where did we lookup the "project" dentry? A: In its parent ("src") Q: and where did we look that one in? A: again, its parent, etc. Where do I start the lookup? A: I lookup "home" in a dentry for "/", the latter is called the "root dentry" or "root inode". Can I look up "/"? A: no, can't look it up. The root dentry of every file system is created at the time you mount the file system. That root dentry is allocated and filled in (not via lookup, but "manually") by the file system mount code (a superblock ->mount method). The root dentry for every f/s is stored inside "struct superblock". There is a single struct superblock for every mounted f/s. Pathname resolution procedure: ->lookup(inode for "project", dentry called "foo.c") if lookup fails, return error (e.g., ENOENT) right here. if lookup succeeded, continue ->permission(inode for "project", permission to modify directory (write to it) if no perm, return error, else cont. ->permission(inode for "foo.c", permission to remove it) if no perm, return error, else cont. else, we finally go on to call ->unlink ->unlink(inode for "project", dentry called "foo.c") if succeeded, returns 0 (which returns from syscall) else, return error Simplified: 1. Take a pathname like "/home/jdoe/src/project/foo.c" 2. break it up on a delimiter ('/' in unix, '\' in windows) 3. start to lookup each component in its parent dire e.g, lookup "home" in /, etc. 3a if the patname starts with a "/", (an "absolute" pathname) then begin looking up in the "root dentry" that's stored in the SB. 3b if the pathname does NOT start with a "/", we call it a "relative" pathname. Then, start the lookup from the "current working directory", which is stored in struct task, specifically in current->cwd (a field inside struct task that names the dentry of the current working dir (CWD). Note: chdir(2) or cd(1), changes the dentry of current->cwd. And the refcnt of the cwd dentry has to be +1. In sum, we perform pairs of lookup+permission, returning errors as soon as they're discovered, else continue, until we reach the final method. 4. For actual lookup, first check the dcache: if found cached entry, return it or use it in next lookup/permission pair. Call f/s ->lookup only if entry isn't found in dcache. Once f/s returns a dentry from ->lookup, then cache it in dcache for next time. 5. VFS knows what to expect. Each component of a pathname must be a directory. The last "leaf" component (foo.c) can be any type of object. So when ->lookup returns successfully on an intermediate component (e.g., "src"), the VFS has to check what type it is: if it is of type DIR, good, continue if it is of type FILE, return error (ENOTDIR) - same error if the type happens to be a block/char device if the type of object found is a SYMLINK, we now have to resolve the symlink: We now invoke a "recursive" procedure, to (a) issue ->readlink to retrieve the "content" of the symlinks (b) now treat the returned content string as if it replaced the "src" with the content of the symlink. NOW we actually begin to resolve the symlink as a pathname: the usual stuff as above (parse the pathname on a '/', do another lookup, then permission, cache lookup, etc.) If while resolving this symlink's content, we find another symlink, then we invoke another "recursive" procedure to ->readlink, then lookup+permission. Note: when starting a "new" lookup routine, I may come across symlinks multiple times. Each time results in another "recursive" call. If the no. of symlinks I've come across during this one pathname lookup exceeds a threshold, then we abort the entire lookup and return ELOOP. 6. If while looking up, we find a new mount point, then we have to traverse into that mount point's superblock root directory (sb->d_root). 7. after each lookup, check ->permission Overall: This lookup is quite complex and why the kernel spends a lot of time, effort, and code. Traditionally called the namei() routine, sometimes called lookup_pn() or path_walk(). Q: what about '~' in userland. A: '~' is replaced by the shell (bash, zsh, etc.) with the contents of a user-level variable called $HOME. $HOME is usually set to one's home, such as "/homes/jdoc". In short, the kernel doesn't see a '~' but an actual pathname.