Version 8 (2015-02-09) Notes: - Feel free to contact project sponsor(s) for more details. - Ideally, every project group would do a different project. ChangeLog from v7: - added 4b (larger ext4 extents) - expand references for project 1e. - added project 7a (Single-Object Cloud-Like Storage) ChangeLog from v6: - add resource for project 1c. ChangeLog from v5: - more resources and clarifications for projects 2a and 2b. ChangeLog from v4: - added project 1e: optimized anti-virus scanner ChangeLog from v3: - added project 1d, revised 1a (NFSv4) projects - added project 6: storage scrubbing ChangeLog from v2: - expanded FUSE project 5 ChangeLog from v1: - added project 2d (cluster dedup) - updated 2c: offline dedup ****************************************************************************** *** (1) NFSv4 Projects *** Project sponsor(S): Ming Chen and Arun Olappamanna Vasudevan Reading and resources for all NFSv4 projects: - http://www.cs.sunysb.edu/~ezk/cse595-s15/nfs/ch1+6.pdf - http://nfs.sourceforge.net - https://github.com/nfs-ganesha NFSv4 is an extensible network file-access protocol. It has a protocol with compound multi-operations in one message and even a callback protocol. As such, it's highly flexible. Ganesha is an implementation of NFS4 in user-level, making it easier to try new ideas (most NFS client/server code is in the kernel). (1a) Transactional compounds Currently, you pack multiple ops into a compound, and send it to the server. The server executes each op in order, and stops as soon as an op fails. This makes error handling difficult for compounds with many operations because each operation inside may fail in many different manners. Design an extension to NFS4's compounds so that when multiple ops are packed into a singe compound, they can be executed as a single transaction: either all of them would succeed or not. You'd have to decide how to mark the start/end of a transaction, whether to pack multiple operations in a single compound, and how to ensure atomicity (e.g., if you executed one op and it failed, how do you undo the effects of previous ops). The project has two parts: one on the NFS client side, and one on the server side. On the client side, the first task is to design a transactional file-system (FS) API, which should be powerful and flexible so that it significantly simplifies development of FS-related applications. Meanwhile, the API should be simple so that it is intuitive to use and feasible to implement. The second task is to implement the API as a C library. To make the library efficient, we probably need to add a general NFS compounding support into the Linux kernel. The in-kernel compounding support can be implemented by adding hints to the existing in-kernel NFS client, which already use compounds. The last task on the client side is to modify some existing (or to develop new) applications using the transactional API so that we can measure the benefits of our transactional compounds in terms of performance and line-of-code reductions. On the server side, the first task is to come up with a transactional local file-system as our storage back-end. FSL has designed and implemented a couple of transactional FS before, but the code is old and not being used. It will take time to make the code running properly. The second task is to integrate the transactional FS into NFS-Ganesha as a back-end. (More specifically, a NFS-Ganesha back-end is called FSAL, short for File System Abstraction Layer.) Extra readings of NFS transactional compounds: - Building Workload-Independent Storage with VT-Trees, http://www.fsl.cs.sunysb.edu/docs/fast-kvfs/kvfs.pdf - Enabling Transactional File Access via Lightweight Kernel Extensions, http://www.fsl.cs.sunysb.edu/docs/valor/valor_fast2009.pdf - A System for Improving Application Performance Through System Call Composition, http://www.fsl.cs.sunysb.edu/docs/cosy-perf/cosy.pdf (1b) Write delegations. Currently, Linux/Ganesha support only read delegations. A read delegation is allows a client to "lock" a file for a period of time and read the file without anyone else allowed to modify the file. The delegation is enforced at the server (in case another client tried to access the file). Implement write delegations: a client should be able to "lock" a file for a period of time, operate on the file during the lease/delegation time, and when done, flush all changes at once. The delegation has to be enforced at the server. Note that while one client is modifying a file privately, it may be possible to allow other clients to read-only a version of the file on the server; but somehow, the other clients might want to know that the file is being delegated to someone for writing purposes. Possible extensions: you may also consider adding a callback to recall a write delegation, force a client to flush its data but keep the delegation, etc. (1c) Asynchronous operations. Currently, when an NFS client sends an op to a server, you have to wait for the server's response before returning back to the user level app. But Linux supports async I/O syscalls. Implement a method to mark an NFS compound as asynchronous and hook it together with the Linux async I/O syscalls. This means that when the server is done, it'd have to send a callback to inform the client that the operations completed (with success/error codes as needed). Then, the NFS client code can inform the user application that the async I/O call is done. You'd have to study async I/O APIs in Linux. Resources: - Oracle Direct NFS Client performs asynchronous I/O - http://www.oracle.com/technetwork/articles/directnfsclient-11gr1-twp-129785.pdf (1d) NFS caching in multi-cloud storage gateways. We are building multi-cloud storage gateways that export cloud storage as NFS service to clients. Cloud storage has high availability and low maintenance cost, and is thus ideal for archiving storage. However, cloud storage suffers from high latency of the Internet. Moreover, network bandwidth is limited and can be expensive in case of huge traffic volume. Caching is the key enabler of cloud storage gateways that overcomes the drawbacks of cloud storage. How to achieve high performance (high cache hit ratio) and low cost is an interesting problem, especially when gateways are caching data from multiple public clouds with different performance traits and different cost-charging policies. The multi-cloud storage gateways we are building have a version number for each file block, and this allows an optimization of NFS clients' cache revalidation algorithm. NFS provides close-to-open consistency on a per-file basis using timestamps. When an NFS client closes a file, it saves the file's timestamp from the server as it gets the reply of the close request. Later, right before the NFS client re-opens the same file, it reads the file's timestamp from the server again and compares the new timestamp with the previously saved timestamp. If the two timestamps are the same, then the client-side cache of the file is considered valid, otherwise, all client-side cache of the file is invalidated. This whole-file cache revalidation approach can be expensive for cached large files, especially if only a small portion of the cache has been updated since the preceding close. In our storage gateways, because each file block has a version number, we can perform block-level cache revalidation by comparing the client- and server-side version numbers. Therefore, we can invalidate only the blocks that have been changed, and keep the blocks that stay the same. Extra readings of NFS caching in multi-cloud storage gateways: - SCFS: A Shared Cloud-backed File System, https://www.usenix.org/system/files/conference/atc14/atc14-paper-bessani.pdf - Hybris: Robust Hybrid Cloud Storage, http://doi.acm.org/10.1145/2670979.2670991 (1e) Optimized Anti Virus scanner for NFS proxy When there are semi-trusted clients accessing an untrusted cloud, security of data can be enforced in a trusted proxy. Privacy (using encryption/decryption), Integrity (checksum verification), and malware detection are some of the security requirements that can be enforced using a trusted proxy. While encryption and integrity checks deal with well-defined blocks of data, malware detection involve dealing with data that is scattered across a file and not pre-determinable. This makes scanning for malware a costly affair and calls for the need of optimizing performance so as to effectively deploy on an NFS proxy. ClamAV is an industry-class open source anti-virus software that is signature-based. Once initialized with a signature database, ClamAV takes a filename as input and returns 'clean' or 'infected'. Scanning one whole file for every read and write is a bad idea. Scanning the entire signature database with each file, again, is not efficient. The goal of the project is to optimize ClamAV scanner in an NFS-ganesha proxy. Following are some optimizations that the project can start off with: - ClamAV should take a buffer pointer as input instead of file name or file descriptor. - Over 95% of signatures are for PE files. ClamAV scans each section in PE file for viruses. So one optimization is to have ClamAV take a PE file section as input. - Scan only the newly added signatures: ClamAV can take signature database version as input in addition to data and scan it against all signatures added after the input version. - Look out for more possible optimizations for other file formats - html, pdf, etc. - Bonus: Explore challenges of getting ClamAV to work with VMDK files. Extra reading for Optimized Anti Virus scanner for NFS proxy: - Avfs: An On-Access Anti-Virus File System http://www.fsl.cs.sunysb.edu/docs/avfs-security04/avfs.pdf - ClamAV http://www.clamav.net/index.html https://github.com/vrtadmin/clamav-devel - ClamAV Signatures https://github.com/vrtadmin/clamav-devel/blob/master/docs/signatures.pdf - Ganesha proxies and caches: https://www.usenix.org/conference/fast-07/nache-design-and-implementation-caching-proxy-nfsv4 ****************************************************************************** *** (2) Deduplication Projects Project sponsor(s): Sonam Mandal and Jason Sun Resources: - Amar's Thesis: http://www.fsl.cs.sunysb.edu/docs/mudrankit-msthesis/mudrankit-msthesis.pdf - Muthitacharoen, Athicha, Benjie Chen, and David Mazieres. "A low-bandwidth network file system." ACM SIGOPS Operating Systems Review. Vol. 35. No. 5. ACM, 2001. - linux/Documentation/device-mapper/*.txt - Dmdedup: Device Mapper Target for Data Deduplication https://www.kernel.org/doc/mirror/ols2014.pdf - http://git.fsl.cs.sunysb.edu/?p=linux-dmdedup.git;a=shortlog;h=refs/heads/dm-dedup-devel The Device Mapper (DM) layer in Linux allows one to create stackable block devices or "targets" in Linux's terminology. Targets often need to maintain persistent data structures describing the current state of a device. A special subsystem called "Persistent data" was created in Linux so that different targets could reuse it when they need to consistently store their on-disk data structures. We have implemented a dedup device mapper called dmdedup, the code for which is publicly available, and we also publish a paper on this. (2a) User-level dmdedup integrity checker and fixer Dmdedup includes a collection of data and metadata in various data structures on disk. The code runs in the kernel. Implement user-level tools to scan the dmdedup on-disk structures and verify their integrity: these tools are similar to how "fsck" can scan and fix a file system such as ext4 (but dmdedup-fsck would be much easier to develop). The tools should first scan and report errors. In the 2nd stage, they should try and fix any errors that can be fixed. (2b) dmdedup metadata using other data structures Dmdedup currently uses the Copy-On-Write (COW) Persistent B-Trees as an underlying data structure to store metadata. Since it was implemented, Linux added other persistent metadata methods, especially an array based one. The latter could be very useful and faster for certain workloads. Implement a new meta-data backend which uses the DM array data structure. Also, one can explore device mapper and see if there are other possible backends and implement one of those instead of an array based one. (2c) Offline Deduplication Support in Dmdedup Dmdedup is a Linux device-mapper deduplication target. Currently, Dmdedup supports only inline deduplication, i.e., every write request is deduplicated at the time of its arrival in the target. For certain workloads this can raise the write-path latency to the levels that are unacceptable to user applications. In this project, the students will add offline deduplication support to Dmdedup. In offline mode, Dmdedup buffers all writes in persistent storage without deduplicating them. When a device idles for sufficiently long a deduplication worker thread wakes up and attempts to eliminate duplicates in the buffered data. The project involves four stages: 1) Understanding existing code base (about 3000 LoC) 2) Design 3) Implementation 4) Evaluation (experimentally demonstrate off-line deduplication advantages compared to inline deduplication); (2d) Cluster Deduplication Resources: - Content-aware Load Balancing for Distributed Backup https://www.usenix.org/legacy/event/lisa11/tech/full_papers/Douglis.pdf - Tradeoffs in scalable data routing for deduplication clusters https://www.usenix.org/legacy/events/fast11/tech/full_papers/Dong.pdf A single storage node may not be able to support the capacity, performance, and availability requirements of large organizations. Clustered storage is an attractive option as new nodes can be added to the cluster to scale with growing requirements. Creating a cluster of deduplicated nodes is interesting as deduplication increases the effective capacity of a system, but there are new challenges. To achieve high deduplication, repeated content needs to be stored on the same node for duplicates to be recognized. On the other hand, deduplication increases latency and decreases throughput because of the worse locality of deduplicated storage. This project has multiple phases: 1. Create a FUSE file system that mounts more than one dmdedup storage target. 2. Create a mechanism to track where files are stored between the storage targets. 3. Direct files to storage targets, in an inline manner, to balance capacity and performance while maximizing deduplication achieved. 4. To achieve better overall results, reposition files in a background thread between target nodes. ****************************************************************************** *** (3) eCryptfs Projects Project sponsor: Erez Zadok Resources: - Linux ecryptfs kernel sources - ecryptfs.org eCryptfs is a stackable encryption f/s in linux. It encrypts individual files. (3a) eCryptfs PID/UID Based Authorization. Ecryptfs's authorization method, however, allows any super-user to access files. Add support so that only authorized users and/or processes would be allowed to access and transparently enc/decrypt files. In particular, even a root user (or hacker who broke in) shouldn't be able to decrypt sensitive files. This'll require adding some extra data structures to eCryptfs, a way to specify a list of allowed users/processes, and verifying each op against that list. ****************************************************************************** *** (4) JBD and Ext4 Projects Sponsor(s): Erez Zadok Ext4 is the latest Linux file system. It supports journaling and extents, among other features. JBD2 is the Journaling Block Device (version 2): JBD offers journaling infrastructure for file systems to implement their journals. (4a) JBD-based multi-op transactions Currently JBD and Ext4 only guarantee atomicity for a single operation (both the data and meta-data of that operation). Modify and use JBD2 to allow multiple operations to be included in a single transaction: for example, creating a file, writing data to the file, and closing the file. Ideally, JBD would export an API to Ext4, which would in turn export it (e.g., via an ioctl(2)) to user-land programs, so users can take advantage of this facility. (Note: this ability could be very useful for the NFSv4 transactional compounds above.) (4b) Larger Extents in Ext4 Extents are contiguous spaces on the storage device, and hence benefit from sequential access. However, Ext4 limits extents to 32,768 x 4KB or 128MB in size. Also, 4KB appears to be the largest file-system block size supported. We want to modify Ext4 and ext-utils (mkfs/etc.) to support larger extents and larger block sizes. For example, we want to be able to write a 10GB file or larger as one large extent. ****************************************************************************** *** (5) FUSE performance Sponsor(s): Erez Zadok and Vasily Tarasov Modern UNIX-like Operating Systems (OSes) rely on monolithic kernels which implement the majority of OS services. Alternatively, one could implement only a small number of basic services in the kernel and the rest of the services in the user space - so called microkernel approach. Microkernel-based OSes provide richer development toolbox and higher system reliability. However, their overhead might be prohibitive. Resources: - http://fuse.sourceforge.net/ (5a) FUSE performance evaluation. File system in USEr-space (FUSE) is an attempt to apply micro-kernel approach to the storage stack of a monolithic kernel. Linux FUSE gained significant popularity in the recent years but there is no reliably scientific evaluation of the performance hit it causes. For example, it is not known if existing in-kernel file systems can be moved to user space or the performance impact would be too severe. The answer is probably "it depends" and this project aims to experimentally draw a boundary between the two possibilities. In this project, students will evaluate Linux FUSE performance with a set of of micro and macro workloads. FUSE pass-through file system will be compared to a native ext3. The project involves five steps: 1) Understand Linux FUSE design and implementation 2) Port FUSE pass-through file system to the latest FUSE version 3) Run experiments for ext3 and pass-through-ext3 configurations 4) Evaluate performance difference and their causes 5) Implement write-through for the latest FUSE version (extra points) Notice, we target to submit the results of this project to a conference with a deadline in March. As a result, the majority of the work need to be completed in the first half of the semester. So please plan accordingly. ****************************************************************************** *** (6) Storage Scrubbing (6a) Storage Scrubbing Resources: [TBS] It is well known that storage devices suffer from "bit rot", in which stored data may deteriorate and become unreadable over the long term. Some file systems (e.g., ZFS) and most NAS devices offer a "scrubbing" feature in which every valid block or file is periodically checksummed and compared with a previously calculated checksum. If the checksums do not match, the file may be automatically recovered from other data (e.g., in a RAID system) or simply reported to the user. A problem with scrubbing is that it imposes a significant load on the system. Thus, it is necessary to do scrubbing in the background at a lower priority. Create a background scrubber that will work for any Unix file system. The scrubber must have the following features: 1. All operations must occur in the background and must not interfere with normal operation. 2. There must be a way to create an initial set of checksums. 3. The scrubber must detect files that have legitimately changed, and update their recorded checksums but not attempt to recover or report them. It also must ignore non-files (e.g., /dev, /proc). 4. The scrubber must also detect unexpected i-node changes (permissions, ownership, etc.) and changes in directories. 5. If a file has Linux extended attributes, those should also be verified. 6. The database of checksums should be able to be stored on either the filesystem being checked or on another device. 7. The scrubber should have simple, user-friendly tuning parameters. For example, the maximum load should be configurable, and it should be simple to predict how long a single scrubbing cycle will take. (In production, a cycle of several months is usually considered adequate.) 8. The scrubber should be robust against system crashes; when a system reboots the scrubber should pick up where it left off rather than going back to the beginning. 9. For extra credit, the scrubber should detect the amount of load on the disk and adjust its activity accordingly, but should never generate more than a configurable amount of load even on a completely idle system. 10. For extra credit, the scrubber should be able to operate in parallel on multiple devices. ****************************************************************************** *** (7) Cloud-Like Storage Resources: - http://docs.openstack.org/developer/swift/api/object_api_v1_overview.html - http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html (7a) Single-Object Cloud-Like Storage Implement a single object storage system in Linux. It could be similar to Ext4, etc., but instead of a POSIX interface, it supports (in its entirety or more likely a portion of) one or both of the 2 dominant object storage APIs (Swift and Amazon S3). One way to do this would be to put a WSGI Web server layer as the interface and then store the actual objects as whole files in Ext4, or the other way would be to have a WSGI Web server layer but then actually handle data placement in the block layer. At its simplest, it could just support 3 APIs (GET/PUT/POST). Objects would be stored and retrieved using a WSGI mini Web server. ############################################################################## # Local Spell-Checker per-file words # LocalWords: Arun Olappamanna Vasudevan FSAL SCFS Hybris ClamAV filename Avfs # LocalWords: ganesha NAS checksummed dev filesystem ChangeLog dedup VMDK DM # LocalWords: Sonam Mandal Amar's Muthitacharoen Athicha Mazieres SIGOPS nd # LocalWords: Dmdedup shortlog dmdedup deduplicating Ecryptfs's JBD Tarasov # LocalWords: USEr pdf Transactional atomicity Benjie txt ecryptfs decrypt # LocalWords: Vasily microkernel WSGI