* hash collision probabilities

See https://preshing.com/20110504/hash-collision-probabilities/

* dedup cont.

Caching algorithms like LRU/LFU like patterns.  Random access patterns are
the hardest to predict: traditional OSs turn of readahead if they detect
what looks like random access patterns.  Bringing in readahead data for
random access only wastes resources and fills the cache with "junk" data:
we call it "cache pollution."

Sometimes for random data, it's better not to cache the data all, and just
access the IO system directly for every read/write.  This is also called
DIRECT I/O (O_DIRECT flag to open(2) syscall).  O_DIRECT also useful for
maintaining atomicity in databases.

What about hashes?  Hashes are by design completely random.  Hence caching
them will be very hard, no access patterns.  So accessing the hash
database/index in a large dedup system is going to be very slow.  Just the
hashes could consume many gigs of RAM.  One (expensive) solution is to
configure a dedup system with lots of RAM, so most/all of the hashes can
reside in fast RAM.

* Bloom filters

Class of "Approximate Membership Query" (AMQs) algorithms.

1. Design a bit space of B bits (e.g., 10,000 bits).  Initialize all bits to
zero (0).

2. Select several hash functions: decide how many functions and what they'll
be.  e.g., h1(), h2(), h3() (3 functions).  Each function takes an input
data, and produces an output "offset" number from 1..B (or 0 to B-1)

3. When input data D comes, you compute all hashes:

	H1 = h1(D)
	H2 = h1(D)
	H3 = h1(D)

4. Then, in B, turn on (set to '1') the bits H1, H2, and H3 (normalized to
10,000)

5. For every new input, calc the hashes, and check the bits in B.  If ANY of
the bits are 0, it means I've not indexed this item, and thus it's never
been seen/indexed before in the filter.

6. If all bits are 1, however, we don't know if they were turned on b/c of
the new input or a combination of other inputs over time.  Meaning that the
item may or may not have been seen/indexed before?  This is a "collision".

7. To reduce probability of Bloom filter collisions: index fewer items, use
a larger B space, and fewer hash functions.

Bloom filters are "odd" b/c they tell you definitively if something does NOT
exist.  Usually we're used to looking up if something exists.

Using bloom filters with dedup systems:

1. We have the very large data store of actual deduped data blocks.

2. We have a large store of all hashes (can be many 100s of Gigs)

3. Now, layer a Bloom filter (BF) on top of the hashes.  So every hash
becomes an input to the Bloom filter's own hashes.

Note: proving a negative is the most expensive and challenging operation.
e.g., looking up a name in an unsorted directory usually takes O(N/2); but
looking up an item that does not exist always takes O(N).  Similarly,
looking for a hash in a dedup system will be most expensive if the hash
hasn't been seen before.

Thus, the BF will tell you if a dedup'd hash does NOT exist definitively.
So index every dedup'ed hash into the BF.  And query the BF for every new
hash.  The BF can be large enough (millions of bits) and still fit in RAM.

If you search a dedup hash in the BF and the BF says "not a member", then
you've saved a lot of time searching the large dedup hash database, and
just go and create a new entry in the dedup system.

If the BF can't tell definitively that the dedup hash does NOT exist, then I
still have to search the hash index of the dedup system: the item may or may
not exist.  But I already saved a lot of slow I/O.

* partitions

Ability to segment a disk into several units.  Usually store a partition
table in sector 0:

partition	start	end	type
0		1	11	FAT32
1		12	217	Linux
...

A full device (e.g., first SCSI disk) might be /dev/sda.  With partitions,
it can be seen as /dev/sda1 (1st partition), /dev/sda2 (second partition),
etc.

The OS can ensure that access to partitions doesn't overlap others.  Each
partitions can be formatted and used as a separate file system or block
device.

See https://en.wikipedia.org/wiki/Partition_type

But partitions are inflexible: each can contain a very different f/s.  If
you run out of room in one, can't easily reclaim space from another.  So
users had to get new disks, reformat them w/ new partitions and then copy
data.

Some companies came up with a set of tools that understand the inner
formats, and allow you to readjust the boundaries, and move space from
another.  The tools took a long time to run, and had to be careful not to
break and f/s structures.

The problem is that resizing file systems is very hard and format+OS
dependent.

Solution: a Logical Volume Manager (LVM).  LVMs allow you to have a big
container and create volumes or f/s inside, which can be easily resized.
Linux has LVM; macOS has APFS containers; Solaris has ZFS; every OS now has
one.  Professional storage system has it.

* next time

ECCs