* VFS lookup procedure (aka "namei") syscall is "unlink("/home/jdoe/src/hw1/foo.c") ->lookup(I for "/", D for "home") -- found from the superblock SB->s_root. if any lookup failed to find he object, return ENOENT and stop if lookup found the object, then check permissions ->permission(I for "home", permissions to "read the directory") if permission is not granted, abort, return EPERM/EACCESS else, continue to next lookup... ->lookup(I for "home", D for "jdoe") ->lookup(I for "jdoe", D for "src") ->lookup(I for "src", D for "hw1") ->unlink(I for "hw1", D for "foo.c") finally, get to the actual op corresponding to the syscall, execute the op inside fs-specific code, and return success/failure. A pathname lookup procedure (also historically called a "namei" or "lookup_pn"), executes the following: 1. check if pathname starts with "/" or not, to find out where lookups should begin. 1a. if starts from "/", find the root dentry to start lookups from (from SB). this is called an "absolute pathname" 1b. if path is a "relative path", doesn't start with a "/", for example "unlink("hw1/foo.c"), then check struct task *current for the "cwd" (current working directory) field, which is just another dentry with refcount++. current's cwd field is controlled, or changed using cd(1) or chdir(2). If you ever try to rmdir a dir that another window/process is chdir'd into, you'll get an EBUSY error (kernel can't delete a dentry with RC>0). Q: what if the path starts with ~ (relative to home dir)? Q: what if the path starts with $VAR (expand $VAR to whatever content it has)? A: the shell (sh, bash, zsh, etc.) is the one that replaces '~' or $VARs with their content as set by the user's profile. IOW, the kernel never sees these! 2. break the pathname using a delimiter ("/" in UNIX, "\" in windows). For example using kernel equivalent of strtok(3). 3. Issue a series of lookup + permission pairs (as seen above), in a loop, for each pathname component, till we get to the last component. If any error happens, or can't find the dir/file object, return -ENOENT. - OPTIONAL: note that ->permission is an optional f/s method in Linux: file systems don't have to implement it. If the f/s ->permission method is not implemented (i.e., fxn ptr is NULL), then the VFS offers a generic_permission method that performs basic POSIX checks. Most file systems can use the generic method. More complex systems with special security needs (e.g., encryption) or distributed/networked file systems implement their own ->permission b/c they need to check permission on a network of hosts/servers. - CACHING: every dentry/inode (positive or negative) is cached in the dcache. So the VFS lookup FIRST checks to see if an entry is in the dcache: if so, return it (w/o calling the f/s). If there is NO cached entry, then you call the f/s ->lookup method: if ->lookup returned neg/pos dentry, cache it. If an error or neg dentry: stop with ENOENT, else continue with the VFS lookup procedure. 4a. The lookup routine KNOWS what is expected to be a directory vs. the last component (which may be a file, dir, or anything else). IOW, anything in a pathname up to the last "/", must be a directory. So for the unlink example above, if you find "src" inside "jdoe", and you have permission to read "src", then now the VFS lookup procedure, needs to also verify that "src" has an inode of type DIR (must be a directory). If the inode, is, say, type FILE (regular file), then return error -ENOTDIR. 4b. The inode we found, could be something else: char or blk device (again, this'd be an error, return ENOTDIR) 4c. The inode found (e.g., for "src") could be of type SYMLINK, in the middle of a pathname resolution. At this point the VFS lookup has to perform a sort of "subroutine" call to process the symlink: * issue a "readlink" call (either ->readlink method, or vfs_readlink helper) to get the "contents" of the symlink. * Recall that when the symlink was created, users can have the symlink's content point to ANYTHING (even invalid paths). But now, in VFS lookup, we need to process this path. * Assuming readlink succeeded, VFS will take the symlink's contents, and interpret it as another pathname that "stands" for the component being resolved, i.e., "src" in this example. * Symlink can point to any path, relative, or absolute; a symlink can have multiple path/names as well. * so have to process the symlink just like any other path: lookup+permission pairs, error checking, etc. * When "done" with symlink, the last component of what the symlink points to becomes the parent dir/inode to lookup for the next component of the original pathname (e.g., next component to lookup is "hw1") Q: What happens if one symlink points to another, or even one component of a symlink points to another? The risk is of creating loops inside the kernel's VFS code. A: there's lots of wonderful algorithms for detecting cycles in a graph. But none are used: too complex, too slow. A: actual procedure. Each time the VFS starts processing a new path on behalf of a system call, it starts a counter from 0 that records how many times we've seen and 'crossed' a symlink while processing THIS path. Each time you get a symlink, increment the counter by 1. If any any point in time, the counter exceeds N, return -ELOOP (a specific errno to say "the chain of symlinks in this pathname is too long). This doesn't prevent or detect cycles but it will prevent getting into an infinite loop while looking up paths. N is commonly 20, but on modern OSs you can set that value dynamically. Note that you can create a chain of symlinks (without a cycle) whose length is > N, but you cannot lookup such long chains. 5. When lookup finds a dentry, you also have to find out if this dentry happens to be a mount point for another file system. Each f/s has its own SB object, and each SB has a "root dentry" where lookup start from. Say you have two file systems: ext4 and nfs. At boot time, you mount the main root file system like # mount -t ext4 /dev/sda1 / Now you'll be able to list entries and lookup under /. Let's say your user homes are in a network file system (NFS), that would be mounted as # mount -t nfs server:/some/remote/path /users/home Meaning: go to hostname 'server', ask it for access to a path on the server called '/some/remote/path', and if get back proper credentials, create a struct SB with its own root dentry for looking up inside THIS file system. And then "attach" this new SB root dentry to "/users/home" on the local system. Note that /users/home is a directory inside the ext4 file system we already mounted. Before you mount NFS, if you "ls /users/home", you may just get an empty dir (or whatever happens to be inside). Mount points (usually empty dirs) have to exist for mount(2) to succeed. After a successful mount(2), the mnt pt becomes "hidden" from users outside the kernel. You can no longer access what was in ext4 inside /users/home, b/c the VFS will "transparently" switch from looking up inside ext4 to looking up inside nfs (the f/s mounted "on top of" /users/home). The way this transition happens is using this flag in struct dentry->d_flags: #define DCACHE_MOUNTED 0x00010000 /* is a mountpoint */ As part of the loop of lookup+permission, after a successful lookup, we get a positive dentry. Once we identify that dentry as a directory, we ALSO check if the dentry->d_flags has DCACHE_MOUNTED on. That indicates that this dentry/inode is a mount point, and you should NOT continue to look inside the mount, but transition to the mounted file system. Now the VFS has to "stop" and perform another subroutine check, to find the mount point. In linux, there's a global hash table that maps all dentry mount points to their corresponding superblocks of the mounted f/s. So VFS now looks up this HT with the key being the ptr addr of the dentry that has DCACHE_MOUNTED on. It gets back an SB object. This is the SB for the mounted f/s (e.g., NFS). Then the VFS gets the SB->s_root dentry for that mounted f/s, and resumes lookups there. 1. We did a ->lookup in ext4 and found a dentry that has DCACHE_MOUNTED. Here we called ext4_lookup. 2. We locate the SB of the mounted f/s for this dentry. 3. Re-issue the lookup but now to the NFS f/s own ->lookup(NFS's root dentry) Now we call nfs_lookup. Think of the ext4's mnt pt dentry as the "lower" dentry. And the NFS's root dentry that was mounted there as the "upper" dentry that shadows the lower one. While the VFS processes any pathname, it could cross multiple mount points. In fact, you can mount a file system on top of another mounted f/s in the same directory. Overall, the VFS lookup (aka "namei") is long, complex, and highly optimized. Recall struct task current's CWD? When looking up "/", you usually consider the root dentry of the SB you're in. But there's a syscall chroot(2) that allows you to restrict a task's view of "/" to another directory, e.g., /var/apache/www. This restricts a process so whenever it looks up "/", it'll actually look it up relative to the chrooted dir (/var/apache/www). This is useful, for internet servers, in case they get hacked, so no one could possible copy /etc/passwd from a hacked into apache server. And even if the broken server tries to chdir("../../../"), it'll not "escape" its chroot "jail".