* superblock

records start/end or start+len of various segments of the disk: data blocks,
inodes, bitmaps, etc.

records the max no. of inodes and data blocks possible, and how many are in
use.  Useful for the df(1) command, which issues the statfs(2) syscalls.

# show current usage of data blocks for '/' f/s
$ df /
# same: show inodes
$ df -i /

superblock, also marks a special "magic" number/fingerprint so that the
mount command can recognize if this f/s was formatted for this f/s driver.

The format command also has to initialize certain data segments:
- zero out the bitmaps
- zero out the superblock
- data blocks + inodes don't need to be zeroed out until we alloc them
  (saves time at format)

* data reliability and availability

availability: can I access the data that I had before

different than "integrity": is the data I'm retrieving, the same that I
wrote before?

Must assume that failures will happen:
- media failures: bit rot, corruptions of parts of the disk
- firmware: disk/device firmware bugs
- software failures: OS, application, etc. (bugs)
- power failures:

Example: suppose I want to create a new file and write some data to it.
1. get an unused inode
2. get an unused data block
3. mark allocation bitmaps for inodes and for data blocks
4. update the superblock (e.g., count of un/used data/inodes)
5. create a named entry in some directory, write <name,inode> tuple

Block devices can only guarantee that they'll read/write a single LBA
atomically.  Devices try very hard not to write partial LBAs or return
partially read ones.

If you lose power in the middle of writing the 4+ items above, you'll have
partial data stored on the disk -- and it'll be inconsistent.
You can wind up with many problems:

- bitmaps say data is allocated but the actual data blocks aren't used
- data blocks are used but bitmaps not updated yet
- directory entry points to an inode, but inode wasn't initialized yet
- inode created but points to data blocks that aren't initialized or used
- e.g., a lot of "dangling" points and references, aka "orphans"

If this happens, my f/s is said to be "corrupted".  Eventually you'd have to
reboot and "cleanup" the corruption.  Every f/s comes with a user level tool
called a "file system checker" (fsck).

Fsck runs on block devices before they're mounted, scans the entire f/s,
builds a set of data structures, and looks for inconsistencies.  It then
will start fixing things, or give the superuser an option "do you want to
fix this?".

e.g., if it finds a used inode w/ data blocks, but no directory entry for
it, it'll make a "fake" dirent for you, and it'll place it in the
"lost+found" directory at the top of the f/s mountpoint.  fsck just makes up
names like "inodeXXXX" where XXXX is the inode_num.

fsck will run a long time, in phases, looking at directories, data blocks,
inodes, superblocks, etc.

fsck may not know what to do: e.g., an allocated data block w/ no inode
pointing to it.  It can take a long time to answer all these questions.

fsck may reach a point where it CANNOT fix the file system fully: too many
corruptions.  e.g., if the actual superblock was badly corrupted.  In that
case, the data is lost and you have to recover it from backup (if you have
any).

What we'd like is "Atomicity" (aka database ACID properties): the ability to
write multiple pieces of data at once (or not at all).

* How to mitigation failures of availability (or reduce them)

1. Capacitors, batteries, etc.

Enough power left so that even if main power is lost, we can write all
cached data consistently, then spin down the disk.

Better: have a UPS connected to your computer, with a usb/serial cable b/t
UPS and computer, so computer can get a signal when UPS went on battery.
The computer can decide when/if to shutdown, e.g., when battery goes too
low.

2. create replicas of data

Most f/s will keep several copies (~3-4) of the superblock and even alloc
bitmaps, spread throughout the disk.  So if one is bad, the f/s code can use
another to recover/restore as much data.

The more replicas, the better the availability, but
- code complexity increases
- more space consumed
- slower operations (have to write data in different locations on disk)

3. Journalling

Provides data Durability and Atomicity ('A' and 'D' in ACID).

Instead of writing the data directly to the f/s, we APPEND it to a
"journal".

A journal is a special hidden file w/ pre-allocated room on the disk, where
you can write records of user activity.

At format time, the f/s reserves room for the journal: a fix no. of LBAs at
a known offset.

If a user wants to create the new file, we create a transaction (or journal
record) that looks like a tuple:

R=<new name, type, inode#, new data>

The record is appended to the journal, with start/end markers

J=<txn_start, timestamp, <R>, txn_end>

If power is lost in the middle of writing that entry J, the mount code can
identify that there's an incomplete journal entry (e.g., we see the txn_start
but not the txn_end).  It means this J was partially committed to the journal
and that data is lost, but at least my f/s isn't corrupted.

Normally, the f/s code will go through journal entries, one at a time, and
"apply" the changes recorded there (e.g., "create a dir", delete a file,
make a symlink).  Each valid journal entry, is then applied to the rest of
the f/s.  So if a power failure happened in the middle of applying a txn to
the f/s, upon reboot, we can "fix" the rest.

Only when the txn is completely applied to the f/s, can we remove it from
the journal.

Journaling f/s: more complex, more code, slower (write to journal then the
main f/s), but a lot more reliable.