* HW1 -- cleaning up after partial failures The assignment is about a system to encrypt/decrypt files, while copying a source to a destination. Typical use case: 1. you have a file in cleartext (unencrypted) 'X' 2. you want to protect it 3. encrypt X and produce a "ciphertext" file Y 4. delete X so you don't leave behind the unencrypted data What if step 3 fails? What if the user does not notice the failure or the program didn't alert properly? A user could miss a vital clue, and inadvertently DELETE file X, thinking that file Y was successfully produced! This will result in permanent data loss. Solution, ensure that you properly warn, and alert users, but also don't leave partial data behind. Best to never leave a partial output file on the f/s, and also to preserve what previous data was there (b/c the existing o/p file may be valuable, for example a previous version of the same source file). First technique: 1. if there's a file Y, rename Y to Y.tmp 2. encrypt X -> Y 3. if step 2 succeeded: - delete Y.tmp 4. if step 2 failed: - delete any partially written Y, if any - rename back Y.tmp to Y Second technique: 1. if asked to encrypt X -> Y, instead encrypt X -> Y.tmp 2. if step 1 succeeded - delete old Y, if any - rename Y.tmp to Y 3. if step 1 failed - delete Y.tmp Both techniques are a form of "atomicity" (ala databases). But, absent an actual transactional storage system, we're just trying our best to emulate it: 1. What if the temp name already exists? Should we return an error, or pick a different temp name, or override the temp name (if we think it's an old/stale name). Often, when picking a temp name, we use a unique number like the PID of the process creating a temp name, see mkstemp(3). 2. What if deleting any name fails? Also, what if renaming something fails? What is the chance that this could happen? You have to perform an error analysis and estimate the probability of failure. When copying a file (with encryption or not), you are consuming storage space. The likely partial failures to write are ENOSPC or ENOQUOTA. But when deleting a file, using unlink(2), we are actually FREEING space, so unlink(2) is more likely to succeed than write(2). Also, when renaming a file, we're not changing any storage consumption (data blocks or inodes), but merely replacing one name with another in a directory entry. So rename(2) is more likely to succeed than write(2). Other errors you could get are: EIO (h/w failure): not much to do. Reboot. ENOMEM: could be a transient problem, lack of ram. Maybe wait a bit and try again? (Note that EAGAIN usually happens when reading/writing to network sockets, not to regular files.) EPERM/EACCESS: would suggest that permission was ok when you started the syscall but changed mid-way. Can happen, but rare. Some times you may have no choice but to "give up", but at least leave a console log message that something went wrong. For example, "couldn't delete temp file name Y.tmp". That way, a sysadmin could go off and delete those names by hand later on. Lesson: different errors have different severity levels and what to do about them depends largely on the usage case and application. Look at vfs_unlink() and vfs_rename() as helper methods. Study them carefully. vfs_rename has very special pre/post conditions that have to be met: special locking of directory objects being passed, etc. Note also UNIX permits you to 'delete' an opened file, but the actual file's contents do not get deleted until the last close(2) of the file. Meaning: don't call vfs_unlink until you have filp_close'd the file you want to unlink! * HW1 crypto issues Why HW1 passwd -> hash (as key) -> hash again in kernel (to verify "key"). We want to ensure that decryption with the wrong key will fail? Why, b/c symmetric ciphers are "just math" and can't fail on their own. It'd be bad if users "decrypt" a file with the wrong key and get the wrong "cleartext". Naive: store the cipher key in the output file. Obviously that's bad b/c you never want to expose the actual cipher key! Instead, store the HASH of the cipher key (just as used in login programs mentioned below). Then, you syscall can compute the hash of the decryption key provided, compare it to the hash stored in the ciphertext file, and if they match -- decrypt; else, return error. Encryption keys should be long, but long binary keys are bad passwords, b/c users can't remember them. How to give users an easy to remember passphrase or password, while using a strong, say 256bit encryption key? A: convert the passphrase itself into a key, using techniques in PKCS or even a plain hash function. Let's say your password is P. That's NOT the actual cipher key! The cipher key is H(P) -> K. That's the key to encrypt (pass to syscall) and that's the key that must be stored securely! You don't store the key K directly in the file's preamble, but rather a HASH of it, or H(K) -> X. We took P, hashed it again to produce cipher key K, and then hashed that again to produce X that we store safely in the preamble, so we can verify that we're decrypting with correct original cipher key K. * Crash course on encryption Security very broad. Define a policy first: 1. privacy: how, using encryption for example. 2. authentication: verify your identity (login usernames and passwords, PINs, 2FA with smartphones, Duo, etc.) 3. authorization: access to data/resource at a later time. 4. integrity: verify that data you're accessing hasn't been modified or corrupted. Usually done through use of checksums or hashes. 5. non-repudiation: ability to track actions that cannot be refuted. Ex., collect log files of users' actions, encrypt them, add integrity checking, and transmit securely to a 3rd trusted entity. * Hashes, checksums, digital fingerprints, parity, signatures class of functions that can take an input data D of any length L, and produce an output number H of size S. Usually S<128 bits even as much as 512), we call it a "cryptographically strong function" (e.g., MD5, SHA1, SHA256, ...) BTW, today, most Intel processors have built in instructions to compute hashes (and even encryption), using "SSE" extensions -- this is much faster than computing it on your own. STRONG HASHES: are guaranteed to produce a "unique" hash with very small probability of collision. Collision is defined as two different inputs X and Y, that produce the same hash H. Collision probabilities are very small: the larger the hash size S is, and the fewer pieces of data D you hash, the lower the probability. See https://preshing.com/20110504/hash-collision-probabilities/ Uniqueness property is easier to accomplish if the hash function distributes the output hashes nearly uniformly. E.g., MD5, 128 bits, 2^128, 10E42. Another important property: changing a single input bit in D, should result on avg in 50% of the bits in hash H having changed (Normally distributed). If you plot a figure of no. of changed bits, you'll get a Normal/Gaussian. This is also called the "avalanche property". Non-invertability: given a hash H, it should be VERY hard to find any data input D that will have produced that hash. Hash functions are thus called "one-way" hash functions. Example, in login programs: 1. when you re/set your password P, the computer hashes P and produces a hash H. 2. store the H in some file, in unix it's /etc/passwd or /etc/shadow 3. next time you login, you type your password P' 4. computer computes H' from P' using same hash function 5. compare previously saved H and H' 6. if they match, login proceeds; else, deny login, re-prompt you for new password, or even lock out your account. Other uses of hashes: - data deduplication: find matching files or chunks of files and save having to store duplicate data. Often yielding 10-40x dedup ratios. - verify integrity of data you store. Each time you store a file or any data, the OS (or storage system) will store with it a hash of that data. Next time you read the data/file back, the OS will recalc the hash and compare to the stored hash. Alert or produce an error if they don't match. Note: hashes can tell you if your data's integrity has been compromised, but cannot tell you how to fix (get back your original data). For that, you need some offline backups, or extra copies/replicas, or "error correcting codes" (ECCs). Example of very simple hash function to calc parity, or a CRC. 1. assume you want an 8 bit hash (CRC) 2. take your input data, can be long 3. read each byte in the input as a number b/t 0..255 4. sum up all the bytes, truncating to just the lower 8 bits 5. at the end, what you're left with is H = (sum of all bytes of data D) % 256 * ENCRYPTION preserve the privacy of some data D. original unencrypted data is called "cleartext". encrypted data is called "ciphertext". encryption alg/software is called the "cipher". ciphers take an input data and at least one (often secret) key K, and produce ciphertext. Ciphers have different properties/classes: 1. symmetric ciphers use the same key K to enc/dec (HW1) 2. asymmetric ciphers use a different K1 to enc, and K2 to decrypt. - symmetric ciphers are much faster than asymmetric ones. - but asymmetric ciphers are considered more secure Example how enc works: 1. share many properties with hash functions (lots of AND, OR, XOR, bit shifts and rotates). 2. XOR (eXclusive OR) is a primary useful function X Y XOR(x,y) 0 0 0 0 1 1 1 0 1 1 1 0 If X and Y are same, o/p is 0; if they differ, output is 1. For crypto: if you take any input Z and XOR it with a '1', Z's value is flipped. So a good cipher, can use XOR where the key is a random number, and everywhere there's a 0 in the key, the bits remain; everywhere there's a '1' in the key, the input bits flip. Result: ciphertext looks nothing like cleartext. If you XOR a piece of data with a key K twice, you get back the original data. XOR is a symmetric function. K size has to be large enough: if it's too small, the attacker can try brute force of all keys, and try to decrypt your data until it looks like "text". They need to know how long your key was, and what cipher was used. Many ciphers are called "block" ciphers: meaning they encrypt in units of a certain size, often 64-bits. Meaning cipher breaks data into units of 64bits and encrypts them. In most ciphers, you have to say what's your input unit that YOU want to encrypt, e.g., 4KB. Internally, the cipher will encrypt each small 64bit chunk, and then it'll use a previous 64bit chunk encrypted, and add it to the mix when encrypting the next chunk. Suppose I broke my input into 64bits (8B) and encrypt unit separately. Ciphers are deterministic: given an input X and key K and cipher C, you'll always get the same output Y. What it means is that if you have multiple inputs that are the same, they'd all encrypt to the same output sequence. That would be a "dictionary problem" when multiple ciphertext chunks are all the same, gives attacker a way to guess your input (e.g., English text has certain letters/words that are more frequent). To prevent these dictionary attacks, encrypt the new chunk of data with some of the material of the just previously encrypted data: 1. read D1 of size 64 bits 2. enc D1 with key K, you get C1 3. read next 64-bit chunk, call it D2 4. enc (D2 XOR Z) with K -> C2 Z can be all of part of C1 or D1 5. repeat steps 3-4 until no more data to encrypt. This is a form of "chaining" in ciphers. Many ciphers have different modes of operations that use "chaining" or "feedback" to prevent dictionary attacks. One disadvantage of these modes is that to decrypt data at end of file, you have to decrypt all data that came before -- because it depends on it. Internally, if you give a cipher a chunk of, say 4KB, it'll break it into smaller native units (64B), and use an internal chaining/feedback mode. That allows more efficiency, b/c you only have to decrypt 4KB (aligned) units to get the data you want -- not whole file. BUT, different 4KB ciphertexts, are still vulnerable to dictionary attacks (less than if you encrypted one byte). To prevent dictionary attacks at the level of the input you give, we use an "Initialization Vector" (IV). A common way is to use an integer that increments. E.g., for the first 4KB of the file set IV=0; for the next 4KB chunk, use IV=1; next one, IV=2, etc... The important property to remember is that even if attacker KNOWS what IV numbers you've used, it still doesn't help them. And you have to know what IV you've used to enc a chunk and pass same IV to same chunk when you try to decrypt. Also important: if you chose your enc/dec unit to be 4KB (or any other multiple of pagesize), you must use the same unit, when decrypting. Note: symmetric ciphers are just math. They can't fail. Which means: if you give the wrong IV or ciphertext or key upon decryption, you won't get the same orig cleartext! There are way to combine encryption + integrity techniques together, such that the cipher can "detect" if it's decrypting with the wrong key: these are called "authenticated encryption." I recommended using the AES alg, but there's a new version called "ARIA". More variants of ARIA available in ubuntu 20, esp. "Counter mode" (CTR). CTR mode ensures that your o/p is same size as input. Other modes may round up your file size to next multiple of 8B (cipher block size). Then you'd have to record how long was orig. file and truncate(2) it after decryption. * Asymmetric Ciphers Asymmetric ciphers use a different K1 to enc, and K2 to decrypt. - User 1 can enc data D with key K1, produce C, send C to user 2. - User 2 can dec C with K2, and they'll get the original data D. - this is what public key cryptography is all about (PKI) - K1 is often called a "private" key; and K2 is called the public key - user 1 protects K1! No one should know it. - user 1 can freely distribute K2 publicly. - asymmetric ciphers can "fail" as they can detect if C was NOT encrypted with K1. (Unlike symmetric ciphers which are just "math ops" and always succeed.) - if anyone (even user 2) tries to decrypt C with anything other than K2, it'll fail. User2 can be assured that the ciphertext data could only have come from User 1 (who presumably protects their private key K1), b/c C can only be decrypted with K2. - The guarantee is that user 2 knows that the message C could only have come from User 1. Useful, for example, if you're getting a secure email, or connecting to your banking Web site. Digital Signatures: - same as before, but now, the user 1, will take data D, hash it, produce H. - user 1 will encrypt D+H with K1, and send it (C') to user 2 - user 2 will decrypt C' with K2, first verifying that it worked - user 2 will call hash on part of data that was original data - user 2 will compare hashes - The added guarantee is that the data inside the message not just came from a known source, but that the data wasn't corrupted or changed along the way. This is how vendors release, e.g., patches to their software/OS. Private 1-to-1 communications. 1. U1 enc D with U1's K1 (private) and U2 (target user)'s public key J2, produces C. 2. U1 sends C to U2 3. U2 decrypts C with U2's priv key J1, then with U1's pub key K2. - result: both U1 and U2 can communicate privately, bi-directionally, and guarantee they know who's on the other end. PKI example are the RSA alg. most famous. Used heavily in SSL Web site certificates all over the world. BUT PKI is slow! Generating a key-pair takes time, the math is complex too (but quite interesting). Using PKI is also slow, b/c the math ops are more CPU intensive than mere XORs (which are used a lot in symmetric ciphers). So, use PKI is used to establish a trust relationship b/t two entities (e.g., your Web browser connecting to a bank's Web server). Then exchange a randomly generated cipher key to be used with a much faster symmetric cipher. User symmetric cipher to encrypt all subsequent communications.