* Journaling cont. Writing to journal is fast b/c it's serial. Usually write concurrently to in-mem journal, then flush it to on-disk journal. File systems can journal: - just meta-data - just data - both Journaling only m-d is faster, but taking a risk that you would lose file data on power loss. m-d is typically more important than data, and is 1-4% the size of data. The size of in-mem and on-disk journal are configurable; the flushing frequency from in-mem to on-disk, and then to main f/s, are also configurable. If either journal is full (or above a certain threshold), we block new records from coming in, suspending processes above: this is called "throttling the heavy writer" (TBD). Note: a write(2) often just writes data to dirty buffer in memory, not persistently, so can be lost. If you want to persist it, issue sync(), fsync(), flush(), etc. calls. You can also mount(2) a f/s in "sync" mode, so every write is flushed immediately (slower, but more reliable). * F/s optimizations In age of HDDs, seeking is slow. So if inodes are far away from data, a lot of seeking happens. Instead, f/s break the partition into "cylinder groups" (CG) -- large areas of the disk (but less than the whole disk), and place an inode+data area in each CG. Benefit: an inode and its data can live in the same CG: less seeking, faster. Optimization: files in the same directory are allocated by default in the same CG. Take advantage of the "locality". You can config the no. and size of CGs at format time. But if you run out of data space in one CG, you have to spill over to another. If you rename a file, the dirent is moved, but the file's inode+data remain. Over time, with file/dir additions, changes, deletion, there will be "holes" in the f/s -- gaps resulting in fragmentation. In aged f/s, esp. if they're nearly full, fragmentation is unavoidable. There are commercial tools that can defragment and "optimize" a f/s. Alternatively, you can copy the whole f/s to a new disk, and that reduces fragmentation. * extents Old f/s would allocate space one block at a time (512B or 4KB). Large files will also have lots of in/direct pointers wasting space. Better: allocate a contiguous space on the disk, an "extent". Use fallocate(2) syscall to reserve extents. The f/s will try and give you a single contiguous extent; if it can't, it might break it into 2-3 extents. You can ask the f/s to pre-allocate a large extent for a file you're about to write. But what if the data isn't written? How long do we wait. A: configure a timeout, after which, any unused extent space is reclaimed. You may also limit the size of extents users can ask for. * CoW, snapshotting Copy-on-Write (CoW): make a copy of the data before you overwrite it. This allows you to keep a history of all blocks in the f/s, and hence a history of files/directories. Very useful for "browsing" older versions of files and even recovering them in case of accidents. Many backup systems keep versions of files (e.g., Windows backup, macOS TimeMachine). If you CoW too much, you may run out of space. So need a "retention" policy: how many versions to keep (by time), or how much max space to allocate for copies. CoW trades off more space (more complex code) for better availability. CoW vs. Backups: - Backups perform copies every N hours/days (coarse granularity). If you backup too frequently, system will be slow. - CoW captures immediate changes, much finer granularity: useful b/c you can recover every change, but costs in more space and resources (CPU, mem, etc.) Q: Is there an intermediate solution b/t backups and CoW? A: yes, snapshots. Snapshots first freeze the f/s to prevent incoming writes. Then it captures the state of the whole f/s at time T. Then unfreeze the f/s so new writes come in. Only changes from time T are preserved: perform CoW on any block changed since time T. But if multiple writes come in, you don't preserve all of them -- only the state of the block at time T. Snapshot implementations: - file systems: ZFS (B-tree), XFS (B-tree), Btrfs (B+ tree) - block layer: appliances (NetApp, EMC, Dell) Note: B-tree f/s are easier to support snapshots (make a copy of a subtree). Also directory searches are faster O(log n), instead of O(n) in a f/s whose directory is an unsorted list of dirents. You can configure a policy for snapshot capture and retention: how many, hourly/daily/etc. And how many to keep. Previous snapshots are often marked "read only": immutable. But you can "clone" a readonly snapshot temporarily and modify it in place. Snapshots are often used in hypervisors to capture state of VMs, restore, test. Snapshots can be transmitted to another storage node, where they can accumulate for disaster recovery. As follows 1. first, make a full copy of the orig f/s to another node (yes, 2x the space). 2. then snapshot primary system, copy snaps (incrementals) over to backup system. These are "snapshot mirrors". 3. if main system is gone, you can bring up the alternative. Snapshots and mirrors of snapshots are useful to recover from malware infections as well as ransomware attacks.