* superblock records start/end or start+len of various segments of the disk: data blocks, inodes, bitmaps, etc. records the max no. of inodes and data blocks possible, and how many are in use. Useful for the df(1) command, which issues the statfs(2) syscalls. # show current usage of data blocks for '/' f/s $ df / # same: show inodes $ df -i / superblock, also marks a special "magic" number/fingerprint so that the mount command can recognize if this f/s was formatted for this f/s driver. The format command also has to initialize certain data segments: - zero out the bitmaps - zero out the superblock - data blocks + inodes don't need to be zeroed out until we alloc them (saves time at format) * data reliability and availability availability: can I access the data that I had before different than "integrity": is the data I'm retrieving, the same that I wrote before? Must assume that failures will happen: - media failures: bit rot, corruptions of parts of the disk - firmware: disk/device firmware bugs - software failures: OS, application, etc. (bugs) - power failures: Example: suppose I want to create a new file and write some data to it. 1. get an unused inode 2. get an unused data block 3. mark allocation bitmaps for inodes and for data blocks 4. update the superblock (e.g., count of un/used data/inodes) 5. create a named entry in some directory, write tuple Block devices can only guarantee that they'll read/write a single LBA atomically. Devices try very hard not to write partial LBAs or return partially read ones. If you lose power in the middle of writing the 4+ items above, you'll have partial data stored on the disk -- and it'll be inconsistent. You can wind up with many problems: - bitmaps say data is allocated but the actual data blocks aren't used - data blocks are used but bitmaps not updated yet - directory entry points to an inode, but inode wasn't initialized yet - inode created but points to data blocks that aren't initialized or used - e.g., a lot of "dangling" points and references, aka "orphans" If this happens, my f/s is said to be "corrupted". Eventually you'd have to reboot and "cleanup" the corruption. Every f/s comes with a user level tool called a "file system checker" (fsck). Fsck runs on block devices before they're mounted, scans the entire f/s, builds a set of data structures, and looks for inconsistencies. It then will start fixing things, or give the superuser an option "do you want to fix this?". e.g., if it finds a used inode w/ data blocks, but no directory entry for it, it'll make a "fake" dirent for you, and it'll place it in the "lost+found" directory at the top of the f/s mountpoint. fsck just makes up names like "inodeXXXX" where XXXX is the inode_num. fsck will run a long time, in phases, looking at directories, data blocks, inodes, superblocks, etc. fsck may not know what to do: e.g., an allocated data block w/ no inode pointing to it. It can take a long time to answer all these questions. fsck may reach a point where it CANNOT fix the file system fully: too many corruptions. e.g., if the actual superblock was badly corrupted. In that case, the data is lost and you have to recover it from backup (if you have any). What we'd like is "Atomicity" (aka database ACID properties): the ability to write multiple pieces of data at once (or not at all). * How to mitigation failures of availability (or reduce them) 1. Capacitors, batteries, etc. Enough power left so that even if main power is lost, we can write all cached data consistently, then spin down the disk. Better: have a UPS connected to your computer, with a usb/serial cable b/t UPS and computer, so computer can get a signal when UPS went on battery. The computer can decide when/if to shutdown, e.g., when battery goes too low. 2. create replicas of data Most f/s will keep several copies (~3-4) of the superblock and even alloc bitmaps, spread throughout the disk. So if one is bad, the f/s code can use another to recover/restore as much data. The more replicas, the better the availability, but - code complexity increases - more space consumed - slower operations (have to write data in different locations on disk) 3. Journalling Provides data Durability and Atomicity ('A' and 'D' in ACID). Instead of writing the data directly to the f/s, we APPEND it to a "journal". A journal is a special hidden file w/ pre-allocated room on the disk, where you can write records of user activity. At format time, the f/s reserves room for the journal: a fix no. of LBAs at a known offset. If a user wants to create the new file, we create a transaction (or journal record) that looks like a tuple: R= The record is appended to the journal, with start/end markers J=, txn_end> If power is lost in the middle of writing that entry J, the mount code can identify that there's an incomplete journal entry (e.g., we see the txn_start but not the txn_end). It means this J was partially committed to the journal and that data is lost, but at least my f/s isn't corrupted. Normally, the f/s code will go through journal entries, one at a time, and "apply" the changes recorded there (e.g., "create a dir", delete a file, make a symlink). Each valid journal entry, is then applied to the rest of the f/s. So if a power failure happened in the middle of applying a txn to the f/s, upon reboot, we can "fix" the rest. Only when the txn is completely applied to the f/s, can we remove it from the journal. Journaling f/s: more complex, more code, slower (write to journal then the main f/s), but a lot more reliable.