* Rename Inode op: int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *, unsigned int); ->rename takes (I1, D1, I2, D2, flags): I1+D1 are old (from), and I2+D2 are new (where to rename to). Recall that in inode ops, the "I" is locked by the VFS before calling the f/s inode method. So that means that ->rename has to lock BOTH I1+I2. Assume that vfs_rename() does this: 1. lock(I1) 2. lock(I2) 3. call f/s ->rename() method 4. unlock(I2) 4. unlock(I1) Looks ok, but what if issued this rename(2) call via mv(1): $ cd ~/src $ mv a.c b.c If VFS operated as above, it'd would lock I1 first, then try to lock I2; but b/c the file is being renamed in the same dir, then I1==I2 (same parent dir). If it's the same parent dir, trying to lock(I2) will DEADLOCK, because the same inode is already locked by step 1! This is called a "self deadlock" when the same thread/code tries to lock the same resource more than once. This issue can be resolved by comparing, so now vfs_rename has to check if they're the same or not: 1. lock(I1) 2. if (I1 != I2) lock(I2) 3. call f/s ->rename() method 4. upon return from f/s, has to unlock ONLY that which was locked before. Next, suppose that users are renaming files in different directories: User 1 does: $ mkdir ~/src ~/obj $ mv ~/src/a.txt ~/obj/b.txt But what if the user performs two concurrent renames at the same time: # in shell, "&" at end of line says to perform asynchronously in the background $ mv ~/src/a.txt ~/obj/b.txt & $ mv ~/obj/c.txt ~/src/d.txt & Problem: above means issuing two rename(2) syscalls, let's say that they issue exactly in parallel Time Syscall 1 (first "mv") syscall 2 (second "mv") ---- ---------------------- ----------------------- t0 verify I1!=I2 verify I1!=I2 t1 lock(I1="src") lock(I1="obj") t2 lock(I2="obj") lock(I2="src") Problem: we have a deadlock b/t the two syscalls: each one grabbed one resource that the other needs; and each one is trying to lock the second resource that's held by another. This is a classic deadlock b/t two concurrent threads of execution. Solution: use a deadlock avoidance technique, often by forcing a recording of the resources being locked. So if you need to lock 2, 3, or more -- be sure that every thread that wants to lock them, will lock them in a precise, deterministic order. For inodes, we can lock them by the inode number, or by the actual address of "struct inode *". So actual procedure in Linux does something like this (assuming you verified that I1 != I2): // always lock the inode whose ptr addr is a smaller number if (I1 < I2) { lock (I1); lock (I2); } else { lock (I2); lock (I1); } With ordering of resources to lock, the two rename(2) system calls will not deadlock: one of them will "win the race" and lock both; the other will be blocked on the first lock() attempt, until the syscall than "won" releases both. Up until now, we've seen how to rename FILES inside the same (or different directories). But rename(2) allows you to rename WHOLE directory trees, and possibly move them up/down the entire file system hierarchy. The problem is that while you're moving possibly a whole subtree around, other processes could be in the middle of namei() to process a path being resolved, or some processes may be looking up ".." (dentry->d_parent). How to handle a big change to a namespace of a f/s, while lookups are taking place, and processes are running? Some proposals: 1. lock the entire pathname of both dirs being renamed 2. lock every object below the pathname of both dirs being renamed 3. find the common ancestor dir and lock that None of such solutions were successful, because it's way too complicated and cumbersome to lock many objects, in order, for a rename() that's relatively rare: it's common to rename files in the same dir; much less common to rename dirs across different parts of the namespace (i.e., other dirs). Another problem: what if two rename(2) calls are trying to change a major part of the file system tree at the same time. For example: 1: mv /usr/local/ /some/other/place/ & 2: mv /home/jdoe/bin/ /usr/local/bin/ & The problem above is that if you issue the two syscalls in series, not concurrently: the results may succeed or fail, but the f/s tree will look different. So can't afford to risk a race condition that produces a different f/s namespace. Linux decided to solve this problem by forcing the serialization of all directory renames (that is, a rename of a directory from one place to another). To serialize them, they grab this very special lock inside struct super_block /* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */ struct mutex s_vfs_rename_mutex; /* Kludge */ This means that only on syscall at a time, can rename(2) a directory from one place to another. A serialization of an operation that anyway is rare. So the final vfs_rename procedure performs roughly this: 1. if the objects being renamed are directories, then lock(sb->s_vfs_rename_mutex). 2. If I1 == I2: lock I1 3. if I1 != I2: lock both, in order of the inode whose ptr addr is smaller. Lesson: renaming in a file system is a complex operation that's challenging to implement.