* storage stack and file systems

In simplest terms, the storage stack comprises:

1. lowest: hardware
- hardware itself may have layers: firmware, caching, physical media.
2. middle: OS
3. upper: user applications, incl. networks, clouds
- low level libraries (libc)
- middleware libraries (e.g., libssl)
- applications (e.g., Web server)
- networks...
- servers, clouds, distributed systems, ...

Inside the OS (from lowest to highest):

* 1. device drivers:

- standardize access to devices
- understand specifics such as how to read/write, spin down, etc.
- so that upper layers can use a unified API for accessing all devices of
  the same type (e.g., HDDs)

* 2. I/O schedulers

- decides what to send to a disk and in what order
- requests to r/w LBAs may come from different apps/sources
- requests may interleave in any order
-
e.g., assume HDD has 10 tracks, each track with 100 LBAs
track 1: LBAs 0..99
track 2: LBAs 100..199
...
track 10: LBAs 900..999

Assume the sequence of LBAs to access (read or write) is:

1, 300, 3, 400, 730, 402
naive: go to track 1, read LBA 1
then go to track 4, read lba 300,
then go back to track 1, read LBA 3
result: a lot of head movement, high latencies seen by users/apps

better: sort the incoming requests by LBA#, and issue them in order.  That
way, the you read as many as you can from one track, then move to an
adjacent track, read from that track, etc.  This minimizes head seeks.

Once head is at innermost track, then we can reverse sort the next set of
requests, and read LBAs in descending order.

I/O scheduler will
1. get N requests, wait up to time T.

If N=1: essentially sending reqs in FCFS order, can result in random seeks.
This is sometimes called the "noop" scheduler.  IOW, a small N will result
in more randomness.

If N is large: then you're waiting too long to submit requests, resulting in
artificial delays (latency) that could be worse than head seeks.

So the "right" value for N may depend on various factors: the load on the
system, the speed of the HDDs, the kind of apps running, users preferences,
etc.  That's why I/O schedulers' params can be configured.

T: how long to wait before submitting I/Os?  Don't want to wait too long
before submitting your requests, even if fewer than N are queued.

2. monitor the position of the HDD head (using ctrl codes)
3. sort the requests in ascending/descending order
4. send them in that order to the HDD
5. repeat steps 1-4 alternating in sorting order.

Called the "elevator algorithm", which sweeps the HDD back and forth.

There are many I/O schedulers in existence, with config params.  They tend
to be optimized for different settings:

- specific complex workloads (e.g., DB)
- specific devices or device types
- parallelism in some systems: multi-queue I/O schedulers
- priority queues
- optimizing for access patterns, temporal, frequency based, etc.
- follow similar algorithms for process/thread schedulers.

Note: if the device is busy, the I/O scheduler can sense it using control
commands ("are you free or busy?").  If the disk is busy a lot of the time,
then the queue in the I/O scheduler starts growing... eventually upper
layers will slow down as well, until the application is slowed and even
suspended (throttled).  We'll discuss this more in the module on async
queues.

* file systems

Logically and traditionally, the file system i the next layer above I/O
schedulers.  (Note: there may be more "virtualization" layers in between,
TBD.)

A f/s decides how to layer specific information on top of a storage device.
Recall the device just gives an abstraction of N x LBAs, whose size is fixed
(512B or 4KB).  The f/s provides the abstraction of accessing files to upper
layers and applications.

What info a file systems stores:

(a) files:
- inode: m-d about the file
- content of the file
- the name of the file (sometimes considered m-d, sometimes called "namespace")

What is meta-data: data about data.

Typically: m-d is much smaller than data, usually on the order of 1-5%
relative to the actual data.  However, m-d is *more* important than the
file's data.

File meta-data:
- name (is sometimes considered m-d)
- dates when file was created, modified, last read, etc.
- size of the file
- type of the file:
	regular file
	directory
	symbolic links (not hard links)
	sockets, pipes
	devices (block and char)
- permissions: who can access
- owner(s) of the file, also groups that can access a file
- links or pointers to where the actual blocks of the file are on media

Why different "types"?
- all objects have m-d + inode and possible data
- diff types enforce diff semantics inside the OS

For example

REGULAR FILES: open, read, write, close, rename, delete, append

DIRECTORIES:
- a directory is specially structured file that looks like a table of
  records structured like <name,inode#>
- when we lookup for a "name" in a directory, the table is searched until we
  find a matching string, then we can return the inode# found; else, we return
  an error ENOENT.
- note: the inode alone does not tell me what type if object that inode# is.
- historically, some file systems, add more info to the directory entry
  (called "dirent"), for example the type of the object represented by that
  inode.
- you could copy more and more info into the dirent, makes some ops faster
  (so don't have to go and retrieve actual inode).  But the more you copy
  into the dirent, the more you have to worry about syncing the info b/t
  inode and dirent.

Lesson: any time you copy data b/t points A and B, you have to think is the
data in sync, is there a change that one of them could have changed w/o you
knowing?  In a cache, can the source have changed?  If so, the cached info
(another "copy") is stale.  In a single system, superusers can access the
HDD's raw data directly and r/w blocks.  In a networked system, other users
from different locations/nodes, or the server, could have changed the data.

What ops are valid on a directory:
- don't allow regular file r/w + seeks (easy to mess up dirent)
- allow ops on whole dirents:
	mkdir, rmdir
	create, unlink
	mknod, unlink
	listing the dirents
	rename

If you want to delete an entry, just zero out the name, but also have to
mark the inode as "free" (and all of its data blocks)

Rename: just change the string name, no change to inode

create/mkdir/mknod: allocate a new string entry + new inode

SYMLINKS:
- an inode and name
- the content of this "object" is usually capped at 4K
- the content can be interpreted as another path name (/usr/local/some/thing)
- provides a level of indirection from one named object to another
- a symlink can point to another symlink, etc.
- ops allowed: create symlinks, delete symlink, "follow" symlink, read
  symlink contents (same as reading a file).
- part of the "namei" lookup process (upper OS layer TBD)
- Q: how do I prevent symlink loops?!

PIPES:
- a named entity in a directory, with an inode
- but the reading/write of that object is redirected to a running process
- pipes are useful/used for control commands to services
- apache Web server may create a pipe e.g., /var/run/apachectl
- then apachectl user CLI utility can use that pipe to send commands to the
  running Web server (e.g., reload your config, install a new certificate,
  etc).