* Journaling cont.

Writing to journal is fast b/c it's serial.

Usually write concurrently to in-mem journal, then flush it to on-disk
journal.

File systems can journal:
- just meta-data
- just data
- both

Journaling only m-d is faster, but taking a risk that you would lose file
data on power loss.  m-d is typically more important than data, and is 1-4%
the size of data.

The size of in-mem and on-disk journal are configurable; the flushing
frequency from in-mem to on-disk, and then to main f/s, are also
configurable.

If either journal is full (or above a certain threshold), we block new
records from coming in, suspending processes above: this is called
"throttling the heavy writer" (TBD).

Note: a write(2) often just writes data to dirty buffer in memory, not
persistently, so can be lost.  If you want to persist it, issue sync(),
fsync(), flush(), etc. calls.  You can also mount(2) a f/s in "sync" mode,
so every write is flushed immediately (slower, but more reliable).

* F/s optimizations

In age of HDDs, seeking is slow.  So if inodes are far away from data, a lot
of seeking happens.  Instead, f/s break the partition into "cylinder groups"
(CG) -- large areas of the disk (but less than the whole disk), and place an
inode+data area in each CG.

Benefit: an inode and its data can live in the same CG: less seeking,
faster.

Optimization: files in the same directory are allocated by default in the
same CG.  Take advantage of the "locality".

You can config the no. and size of CGs at format time.

But if you run out of data space in one CG, you have to spill over to
another.

If you rename a file, the dirent is moved, but the file's inode+data remain.

Over time, with file/dir additions, changes, deletion, there will be "holes"
in the f/s -- gaps resulting in fragmentation.

In aged f/s, esp. if they're nearly full, fragmentation is unavoidable.
There are commercial tools that can defragment and "optimize" a f/s.
Alternatively, you can copy the whole f/s to a new disk, and that reduces
fragmentation.

* extents

Old f/s would allocate space one block at a time (512B or 4KB).  Large files
will also have lots of in/direct pointers wasting space.

Better: allocate a contiguous space on the disk, an "extent".

Use fallocate(2) syscall to reserve extents.  The f/s will try and give you
a single contiguous extent; if it can't, it might break it into 2-3 extents.

You can ask the f/s to pre-allocate a large extent for a file you're about
to write.  But what if the data isn't written?  How long do we wait.  A:
configure a timeout, after which, any unused extent space is reclaimed.

You may also limit the size of extents users can ask for.

* CoW, snapshotting

Copy-on-Write (CoW): make a copy of the data before you overwrite it.

This allows you to keep a history of all blocks in the f/s, and hence a
history of files/directories.  Very useful for "browsing" older versions of
files and even recovering them in case of accidents.

Many backup systems keep versions of files (e.g., Windows backup, macOS
TimeMachine).

If you CoW too much, you may run out of space.  So need a "retention"
policy: how many versions to keep (by time), or how much max space to
allocate for copies.

CoW trades off more space (more complex code) for better availability.

CoW vs. Backups:

- Backups perform copies every N hours/days (coarse granularity).  If you
  backup too frequently, system will be slow.

- CoW captures immediate changes, much finer granularity: useful b/c you can
  recover every change, but costs in more space and resources (CPU, mem,
  etc.)

Q: Is there an intermediate solution b/t backups and CoW?
A: yes, snapshots.

Snapshots first freeze the f/s to prevent incoming writes.  Then it captures
the state of the whole f/s at time T.  Then unfreeze the f/s so new writes
come in.

Only changes from time T are preserved: perform CoW on any block changed
since time T.  But if multiple writes come in, you don't preserve all of
them -- only the state of the block at time T.

Snapshot implementations:

- file systems: ZFS (B-tree), XFS (B-tree), Btrfs (B+ tree)

- block layer: appliances (NetApp, EMC, Dell)

Note: B-tree f/s are easier to support snapshots (make a copy of a subtree).
Also directory searches are faster O(log n), instead of O(n) in a f/s whose
directory is an unsorted list of <name,inode> dirents.

You can configure a policy for snapshot capture and retention: how many,
hourly/daily/etc. And how many to keep.

Previous snapshots are often marked "read only": immutable.  But you can
"clone" a readonly snapshot temporarily and modify it in place.  Snapshots
are often used in hypervisors to capture state of VMs, restore, test.

Snapshots can be transmitted to another storage node, where they can
accumulate for disaster recovery.  As follows

1. first, make a full copy of the orig f/s to another node (yes, 2x the
space).

2. then snapshot primary system, copy snaps (incrementals) over to backup
system.  These are "snapshot mirrors".

3. if main system is gone, you can bring up the alternative.

Snapshots and mirrors of snapshots are useful to recover from malware
infections as well as ransomware attacks.