* storage stack and file systems In simplest terms, the storage stack comprises: 1. lowest: hardware - hardware itself may have layers: firmware, caching, physical media. 2. middle: OS 3. upper: user applications, incl. networks, clouds - low level libraries (libc) - middleware libraries (e.g., libssl) - applications (e.g., Web server) - networks... - servers, clouds, distributed systems, ... Inside the OS (from lowest to highest): * 1. device drivers: - standardize access to devices - understand specifics such as how to read/write, spin down, etc. - so that upper layers can use a unified API for accessing all devices of the same type (e.g., HDDs) * 2. I/O schedulers - decides what to send to a disk and in what order - requests to r/w LBAs may come from different apps/sources - requests may interleave in any order - e.g., assume HDD has 10 tracks, each track with 100 LBAs track 1: LBAs 0..99 track 2: LBAs 100..199 ... track 10: LBAs 900..999 Assume the sequence of LBAs to access (read or write) is: 1, 300, 3, 400, 730, 402 naive: go to track 1, read LBA 1 then go to track 4, read lba 300, then go back to track 1, read LBA 3 result: a lot of head movement, high latencies seen by users/apps better: sort the incoming requests by LBA#, and issue them in order. That way, the you read as many as you can from one track, then move to an adjacent track, read from that track, etc. This minimizes head seeks. Once head is at innermost track, then we can reverse sort the next set of requests, and read LBAs in descending order. I/O scheduler will 1. get N requests, wait up to time T. If N=1: essentially sending reqs in FCFS order, can result in random seeks. This is sometimes called the "noop" scheduler. IOW, a small N will result in more randomness. If N is large: then you're waiting too long to submit requests, resulting in artificial delays (latency) that could be worse than head seeks. So the "right" value for N may depend on various factors: the load on the system, the speed of the HDDs, the kind of apps running, users preferences, etc. That's why I/O schedulers' params can be configured. T: how long to wait before submitting I/Os? Don't want to wait too long before submitting your requests, even if fewer than N are queued. 2. monitor the position of the HDD head (using ctrl codes) 3. sort the requests in ascending/descending order 4. send them in that order to the HDD 5. repeat steps 1-4 alternating in sorting order. Called the "elevator algorithm", which sweeps the HDD back and forth. There are many I/O schedulers in existence, with config params. They tend to be optimized for different settings: - specific complex workloads (e.g., DB) - specific devices or device types - parallelism in some systems: multi-queue I/O schedulers - priority queues - optimizing for access patterns, temporal, frequency based, etc. - follow similar algorithms for process/thread schedulers. Note: if the device is busy, the I/O scheduler can sense it using control commands ("are you free or busy?"). If the disk is busy a lot of the time, then the queue in the I/O scheduler starts growing... eventually upper layers will slow down as well, until the application is slowed and even suspended (throttled). We'll discuss this more in the module on async queues. * file systems Logically and traditionally, the file system i the next layer above I/O schedulers. (Note: there may be more "virtualization" layers in between, TBD.) A f/s decides how to layer specific information on top of a storage device. Recall the device just gives an abstraction of N x LBAs, whose size is fixed (512B or 4KB). The f/s provides the abstraction of accessing files to upper layers and applications. What info a file systems stores: (a) files: - inode: m-d about the file - content of the file - the name of the file (sometimes considered m-d, sometimes called "namespace") What is meta-data: data about data. Typically: m-d is much smaller than data, usually on the order of 1-5% relative to the actual data. However, m-d is *more* important than the file's data. File meta-data: - name (is sometimes considered m-d) - dates when file was created, modified, last read, etc. - size of the file - type of the file: regular file directory symbolic links (not hard links) sockets, pipes devices (block and char) - permissions: who can access - owner(s) of the file, also groups that can access a file - links or pointers to where the actual blocks of the file are on media Why different "types"? - all objects have m-d + inode and possible data - diff types enforce diff semantics inside the OS For example REGULAR FILES: open, read, write, close, rename, delete, append DIRECTORIES: - a directory is specially structured file that looks like a table of records structured like - when we lookup for a "name" in a directory, the table is searched until we find a matching string, then we can return the inode# found; else, we return an error ENOENT. - note: the inode alone does not tell me what type if object that inode# is. - historically, some file systems, add more info to the directory entry (called "dirent"), for example the type of the object represented by that inode. - you could copy more and more info into the dirent, makes some ops faster (so don't have to go and retrieve actual inode). But the more you copy into the dirent, the more you have to worry about syncing the info b/t inode and dirent. Lesson: any time you copy data b/t points A and B, you have to think is the data in sync, is there a change that one of them could have changed w/o you knowing? In a cache, can the source have changed? If so, the cached info (another "copy") is stale. In a single system, superusers can access the HDD's raw data directly and r/w blocks. In a networked system, other users from different locations/nodes, or the server, could have changed the data. What ops are valid on a directory: - don't allow regular file r/w + seeks (easy to mess up dirent) - allow ops on whole dirents: mkdir, rmdir create, unlink mknod, unlink listing the dirents rename If you want to delete an entry, just zero out the name, but also have to mark the inode as "free" (and all of its data blocks) Rename: just change the string name, no change to inode create/mkdir/mknod: allocate a new string entry + new inode SYMLINKS: - an inode and name - the content of this "object" is usually capped at 4K - the content can be interpreted as another path name (/usr/local/some/thing) - provides a level of indirection from one named object to another - a symlink can point to another symlink, etc. - ops allowed: create symlinks, delete symlink, "follow" symlink, read symlink contents (same as reading a file). - part of the "namei" lookup process (upper OS layer TBD) - Q: how do I prevent symlink loops?! PIPES: - a named entity in a directory, with an inode - but the reading/write of that object is redirected to a running process - pipes are useful/used for control commands to services - apache Web server may create a pipe e.g., /var/run/apachectl - then apachectl user CLI utility can use that pipe to send commands to the running Web server (e.g., reload your config, install a new certificate, etc).