* hash collision probabilities See https://preshing.com/20110504/hash-collision-probabilities/ * dedup cont. Caching algorithms like LRU/LFU like patterns. Random access patterns are the hardest to predict: traditional OSs turn of readahead if they detect what looks like random access patterns. Bringing in readahead data for random access only wastes resources and fills the cache with "junk" data: we call it "cache pollution." Sometimes for random data, it's better not to cache the data all, and just access the IO system directly for every read/write. This is also called DIRECT I/O (O_DIRECT flag to open(2) syscall). O_DIRECT also useful for maintaining atomicity in databases. What about hashes? Hashes are by design completely random. Hence caching them will be very hard, no access patterns. So accessing the hash database/index in a large dedup system is going to be very slow. Just the hashes could consume many gigs of RAM. One (expensive) solution is to configure a dedup system with lots of RAM, so most/all of the hashes can reside in fast RAM. * Bloom filters Class of "Approximate Membership Query" (AMQs) algorithms. 1. Design a bit space of B bits (e.g., 10,000 bits). Initialize all bits to zero (0). 2. Select several hash functions: decide how many functions and what they'll be. e.g., h1(), h2(), h3() (3 functions). Each function takes an input data, and produces an output "offset" number from 1..B (or 0 to B-1) 3. When input data D comes, you compute all hashes: H1 = h1(D) H2 = h1(D) H3 = h1(D) 4. Then, in B, turn on (set to '1') the bits H1, H2, and H3 (normalized to 10,000) 5. For every new input, calc the hashes, and check the bits in B. If ANY of the bits are 0, it means I've not indexed this item, and thus it's never been seen/indexed before in the filter. 6. If all bits are 1, however, we don't know if they were turned on b/c of the new input or a combination of other inputs over time. Meaning that the item may or may not have been seen/indexed before? This is a "collision". 7. To reduce probability of Bloom filter collisions: index fewer items, use a larger B space, and fewer hash functions. Bloom filters are "odd" b/c they tell you definitively if something does NOT exist. Usually we're used to looking up if something exists. Using bloom filters with dedup systems: 1. We have the very large data store of actual deduped data blocks. 2. We have a large store of all hashes (can be many 100s of Gigs) 3. Now, layer a Bloom filter (BF) on top of the hashes. So every hash becomes an input to the Bloom filter's own hashes. Note: proving a negative is the most expensive and challenging operation. e.g., looking up a name in an unsorted directory usually takes O(N/2); but looking up an item that does not exist always takes O(N). Similarly, looking for a hash in a dedup system will be most expensive if the hash hasn't been seen before. Thus, the BF will tell you if a dedup'd hash does NOT exist definitively. So index every dedup'ed hash into the BF. And query the BF for every new hash. The BF can be large enough (millions of bits) and still fit in RAM. If you search a dedup hash in the BF and the BF says "not a member", then you've saved a lot of time searching the large dedup hash database, and just go and create a new entry in the dedup system. If the BF can't tell definitively that the dedup hash does NOT exist, then I still have to search the hash index of the dedup system: the item may or may not exist. But I already saved a lot of slow I/O. * partitions Ability to segment a disk into several units. Usually store a partition table in sector 0: partition start end type 0 1 11 FAT32 1 12 217 Linux ... A full device (e.g., first SCSI disk) might be /dev/sda. With partitions, it can be seen as /dev/sda1 (1st partition), /dev/sda2 (second partition), etc. The OS can ensure that access to partitions doesn't overlap others. Each partitions can be formatted and used as a separate file system or block device. See https://en.wikipedia.org/wiki/Partition_type But partitions are inflexible: each can contain a very different f/s. If you run out of room in one, can't easily reclaim space from another. So users had to get new disks, reformat them w/ new partitions and then copy data. Some companies came up with a set of tools that understand the inner formats, and allow you to readjust the boundaries, and move space from another. The tools took a long time to run, and had to be careful not to break and f/s structures. The problem is that resizing file systems is very hard and format+OS dependent. Solution: a Logical Volume Manager (LVM). LVMs allow you to have a big container and create volumes or f/s inside, which can be easily resized. Linux has LVM; macOS has APFS containers; Solaris has ZFS; every OS now has one. Professional storage system has it. * next time ECCs