* RAID combinations

RAID 10: combine 2 or more RAID0 stripes into another RAID1 mirror.

RAID 50: combine 2+ RAID5 arrays and mirror them

* replacing drives

While a drive is being replaced, system is in "degraded" mode.  One a new
drive is added, it needs to be initialized to bring the array back to normal
working mode (the process called "rebuilding" or "re-silvering").

In professional systems, you have a few standby drives called "hot spares"
(HS).  If a drive fails anywhere, the system automatically disconnects the
bad drive and replaces it with one of the HS drives.

* integrity

In RAID parity, you can detect integrity violations but can't tell where
they came from.  Still, you "waste" 1 whole drive (in RAID5) just for
parity, to check integrity.  In RAID6, we spend even more storage space on
parity.

Q: How can we detect integrity with even less space?
A: hash functions

* Hash functions

D: data of any size, usually large
H: a "hash" of the data D, usually much smaller than D
f(): hash function used to convert D to H

Properties:

1.  sizeof(H) << sizeof(D)

2. H "uniquely" represents D

If I take a different data D1, D2, ..., and run it it through f(), I'm going
to get very different H1, H2, ... hashes.

This calls for a specific property called the "avalanche property": a change
of 1 bit in the input, should on average flip sizeof(H)/2 (half) of the hash
bits.  The distribution of #hashbits changed from a single input bit change
should form a perfect Gaussian distribution.

Said differently, if a hash is of size B bits, there 2^B possible hash
numbers.  But there's way more data (2^D).  What we want is to avoid any
collisions where two different data items hash to the same hash number.
That is, the hashes should be well distributed in the hash space of 2^B.

Thus, hash functions are said to be "probabilistically unique": can't
eliminate all possibility of collisions, but you can make it very very low.

Non-invertibiity: given a hash H, it's very hard to find a data D that
matches the hash (i.e., you cannot "invert" a hash, and why the hash
functions are called "one way functions").  That property is very helpful for
hashes to be used in crypto, digital signatures, and more.  In fact, hashes
that are "large enough" (e.g., SHA256) are called "cryptographically strong
hashes".

The chance of collisions grows if the hash size is smaller, and the number
of items you hash grows

Examples:

CRC32: 32 bit "hash", rather a checksum.
MD4: 64-bit
MD5: 128-bit (16B)
SHA1: 160 bits (20B)
SHA256: 256 bits.
and more

Lots of hash functions to try and minimize the chance of collisions, work
faster (smaller hashes are faster), and use less space.

Uses:

1. Hashes are good to detect integrity violations
2. Can use to detect duplicates (used for "deduplication" systems)

* virtual device drivers

Take a disk (or any storage that behaves like one), carve out a portion to
store hashes.  On a write, write the data, calc a hash, and store the hash
too.  On a read, read data, calc hash, compare to stored hash: if mismatch,
can report it as error to caller (integrity violation).  Benefit: you get
integrity checking with a lot less space taken.

A lot of OSs support all kinds of virtual block drivers: they work on top of
1 or more other block devices, and they export the view of another block
device.  In Linux, the technology is called Device Mapper (DM).

Examples:

1. RAID
2. dmintegrity: detects integrity violations using hashes.
3. dmcrypt: transparently encrypt/decrypt data
4. deduplication driver

* dedup

take the raw storage, break it into data containers + data structure that
includes:

- the user logical LBAs (e.g., 17, 343)
- the actual physical stored LBA (e.g., 2)
- the hash number of the block
- maybe the block size (e.g., 2k)
- a refcount: how many different logical LBAs use this hash

Just like hard links, you keep a refcount, and release the data when the
last ref has been released (e.g., users deleting files)

The mapping of logical to physical LBAs is called an "indirection map".

Dedup driver:
- on a read of logical LBA X, find the physical LBA Y, and return that.
- hashes used to detect duplicates, but can also be used for integrity
- on a write of LBA Z, calc hash, see if seen before
	if seen, store Z together with existing hash structure
	if not seen, calc hash and store new entry (refcount=1)

dedup systems have been shown to get space reductions of 10-40x at times!

Two forms of dedup:

- inline: performs dedup right in the read/write I/O path, as data comes in.
  Can slow user activity, but detects dups right away.

- offline: dedup process runs in background periodically, scans files/disks
  looking for dups, then dedups them.  Doesn't slow down user I/Os, but
  takes longer to detect dups and save the space.

Hash collisions in dedup systems are bad!  Data can be lost.  Vendors have
thus increased hash sizes.

Dedup systems can have a lot of data (petabytes).  The num of hashes alone,
can be many terabytes.  Access to hashes is seen as random accesses, so
typical caching algorithms ala LRU/LFU don't work well.  Solution: Bloom
filters.

* hash collision probabilities

See https://preshing.com/20110504/hash-collision-probabilities/