Version 8 (2015-02-09)

Notes:

- Feel free to contact project sponsor(s) for more details.

- Ideally, every project group would do a different project.

ChangeLog from v7:
- added 4b (larger ext4 extents)
- expand references for project 1e.
- added project 7a (Single-Object Cloud-Like Storage)
ChangeLog from v6:
- add resource for project 1c.
ChangeLog from v5:
- more resources and clarifications for projects 2a and 2b.
ChangeLog from v4:
- added project 1e: optimized anti-virus scanner
ChangeLog from v3:
- added project 1d, revised 1a (NFSv4) projects
- added project 6: storage scrubbing
ChangeLog from v2:
- expanded FUSE project 5
ChangeLog from v1:
- added project 2d (cluster dedup)
- updated 2c: offline dedup

******************************************************************************
*** (1) NFSv4 Projects ***

Project sponsor(S): Ming Chen and Arun Olappamanna Vasudevan

Reading and resources for all NFSv4 projects:

- http://www.cs.sunysb.edu/~ezk/cse595-s15/nfs/ch1+6.pdf
- http://nfs.sourceforge.net
- https://github.com/nfs-ganesha

NFSv4 is an extensible network file-access protocol.  It has a protocol with
compound multi-operations in one message and even a callback protocol.  As
such, it's highly flexible.  Ganesha is an implementation of NFS4 in
user-level, making it easier to try new ideas (most NFS client/server code
is in the kernel).

(1a) Transactional compounds

Currently, you pack multiple ops into a compound, and send it to the server.
The server executes each op in order, and stops as soon as an op fails.
This makes error handling difficult for compounds with many operations
because each operation inside may fail in many different manners.  Design an
extension to NFS4's compounds so that when multiple ops are packed into a
singe compound, they can be executed as a single transaction: either all of
them would succeed or not.  You'd have to decide how to mark the start/end
of a transaction, whether to pack multiple operations in a single compound,
and how to ensure atomicity (e.g., if you executed one op and it failed, how
do you undo the effects of previous ops).

The project has two parts: one on the NFS client side, and one on the server
side.  On the client side, the first task is to design a transactional
file-system (FS) API, which should be powerful and flexible so that it
significantly simplifies development of FS-related applications.  Meanwhile,
the API should be simple so that it is intuitive to use and feasible to
implement.  The second task is to implement the API as a C library.  To make
the library efficient, we probably need to add a general NFS compounding
support into the Linux kernel.  The in-kernel compounding support can be
implemented by adding hints to the existing in-kernel NFS client, which
already use compounds.  The last task on the client side is to modify some
existing (or to develop new) applications using the transactional API so
that we can measure the benefits of our transactional compounds in terms of
performance and line-of-code reductions.

On the server side, the first task is to come up with a transactional local
file-system as our storage back-end.  FSL has designed and implemented a
couple of transactional FS before, but the code is old and not being used.
It will take time to make the code running properly.  The second task is to
integrate the transactional FS into NFS-Ganesha as a back-end.  (More
specifically, a NFS-Ganesha back-end is called FSAL, short for File System
Abstraction Layer.)

Extra readings of NFS transactional compounds:

- Building Workload-Independent Storage with VT-Trees,
  http://www.fsl.cs.sunysb.edu/docs/fast-kvfs/kvfs.pdf

- Enabling Transactional File Access via Lightweight Kernel Extensions,
  http://www.fsl.cs.sunysb.edu/docs/valor/valor_fast2009.pdf

- A System for Improving Application Performance Through System Call
  Composition, http://www.fsl.cs.sunysb.edu/docs/cosy-perf/cosy.pdf

(1b) Write delegations.

Currently, Linux/Ganesha support only read delegations.  A read delegation
is allows a client to "lock" a file for a period of time and read the file
without anyone else allowed to modify the file.  The delegation is enforced
at the server (in case another client tried to access the file).  Implement
write delegations: a client should be able to "lock" a file for a period of
time, operate on the file during the lease/delegation time, and when done,
flush all changes at once.  The delegation has to be enforced at the
server.  Note that while one client is modifying a file privately, it may be
possible to allow other clients to read-only a version of the file on the
server; but somehow, the other clients might want to know that the file is
being delegated to someone for writing purposes.

Possible extensions: you may also consider adding a callback to recall a
write delegation, force a client to flush its data but keep the delegation,
etc.

(1c) Asynchronous operations.

Currently, when an NFS client sends an op to a server, you have to wait for
the server's response before returning back to the user level app.  But
Linux supports async I/O syscalls.  Implement a method to mark an NFS
compound as asynchronous and hook it together with the Linux async I/O
syscalls.  This means that when the server is done, it'd have to send a
callback to inform the client that the operations completed (with
success/error codes as needed).  Then, the NFS client code can inform the
user application that the async I/O call is done.  You'd have to study async
I/O APIs in Linux.

Resources:
- Oracle Direct NFS Client performs asynchronous I/O
- http://www.oracle.com/technetwork/articles/directnfsclient-11gr1-twp-129785.pdf

(1d) NFS caching in multi-cloud storage gateways.

We are building multi-cloud storage gateways that export cloud storage as
NFS service to clients.  Cloud storage has high availability and low
maintenance cost, and is thus ideal for archiving storage.  However, cloud
storage suffers from high latency of the Internet.  Moreover, network
bandwidth is limited and can be expensive in case of huge traffic volume.
Caching is the key enabler of cloud storage gateways that overcomes the
drawbacks of cloud storage.  How to achieve high performance (high cache hit
ratio) and low cost is an interesting problem, especially when gateways are
caching data from multiple public clouds with different performance traits
and different cost-charging policies.

The multi-cloud storage gateways we are building have a version number for
each file block, and this allows an optimization of NFS clients' cache
revalidation algorithm.  NFS provides close-to-open consistency on a
per-file basis using timestamps.  When an NFS client closes a file, it saves
the file's timestamp from the server as it gets the reply of the close
request.  Later, right before the NFS client re-opens the same file, it
reads the file's timestamp from the server again and compares the new
timestamp with the previously saved timestamp.  If the two timestamps are
the same, then the client-side cache of the file is considered valid,
otherwise, all client-side cache of the file is invalidated.  This
whole-file cache revalidation approach can be expensive for cached large
files, especially if only a small portion of the cache has been updated
since the preceding close.  In our storage gateways, because each file block
has a version number, we can perform block-level cache revalidation by
comparing the client- and server-side version numbers.  Therefore, we can
invalidate only the blocks that have been changed, and keep the blocks that
stay the same.

Extra readings of NFS caching in multi-cloud storage gateways:

- SCFS: A Shared Cloud-backed File System,
  https://www.usenix.org/system/files/conference/atc14/atc14-paper-bessani.pdf

- Hybris: Robust Hybrid Cloud Storage,
  http://doi.acm.org/10.1145/2670979.2670991

(1e) Optimized Anti Virus scanner for NFS proxy

When there are semi-trusted clients accessing an untrusted cloud, security
of data can be enforced in a trusted proxy.  Privacy (using
encryption/decryption), Integrity (checksum verification), and malware
detection are some of the security requirements that can be enforced using a
trusted proxy.  While encryption and integrity checks deal with well-defined
blocks of data, malware detection involve dealing with data that is
scattered across a file and not pre-determinable.  This makes scanning for
malware a costly affair and calls for the need of optimizing performance so
as to effectively deploy on an NFS proxy.

ClamAV is an industry-class open source anti-virus software that is
signature-based.  Once initialized with a signature database, ClamAV takes a
filename as input and returns 'clean' or 'infected'.  Scanning one whole
file for every read and write is a bad idea.  Scanning the entire signature
database with each file, again, is not efficient.

The goal of the project is to optimize ClamAV scanner in an NFS-ganesha
proxy.  Following are some optimizations that the project can start off
with:

- ClamAV should take a buffer pointer as input instead of file name or file
  descriptor.

- Over 95% of signatures are for PE files.  ClamAV scans each section in PE
  file for viruses.  So one optimization is to have ClamAV take a PE file
  section as input.

- Scan only the newly added signatures: ClamAV can take signature database
  version as input in addition to data and scan it against all signatures
  added after the input version.

- Look out for more possible optimizations for other file formats - html,
  pdf, etc.

- Bonus: Explore challenges of getting ClamAV to work with VMDK files.

Extra reading for Optimized Anti Virus scanner for NFS proxy:

- Avfs: An On-Access Anti-Virus File System
  http://www.fsl.cs.sunysb.edu/docs/avfs-security04/avfs.pdf

- ClamAV http://www.clamav.net/index.html
  https://github.com/vrtadmin/clamav-devel

- ClamAV Signatures
  https://github.com/vrtadmin/clamav-devel/blob/master/docs/signatures.pdf

- Ganesha proxies and caches:
  https://www.usenix.org/conference/fast-07/nache-design-and-implementation-caching-proxy-nfsv4

******************************************************************************
*** (2) Deduplication Projects

Project sponsor(s): Sonam Mandal and Jason Sun

Resources:
- Amar's Thesis:
  http://www.fsl.cs.sunysb.edu/docs/mudrankit-msthesis/mudrankit-msthesis.pdf
- Muthitacharoen, Athicha, Benjie Chen, and David Mazieres. "A low-bandwidth
  network file system." ACM SIGOPS Operating Systems Review. Vol. 35. No. 5.
  ACM, 2001.
- linux/Documentation/device-mapper/*.txt
- Dmdedup: Device Mapper Target for Data Deduplication
  https://www.kernel.org/doc/mirror/ols2014.pdf
- http://git.fsl.cs.sunysb.edu/?p=linux-dmdedup.git;a=shortlog;h=refs/heads/dm-dedup-devel

The Device Mapper (DM) layer in Linux allows one to create stackable block
devices or "targets" in Linux's terminology.  Targets often need to maintain
persistent data structures describing the current state of a device.  A
special subsystem called "Persistent data" was created in Linux so that
different targets could reuse it when they need to consistently store their
on-disk data structures.  We have implemented a dedup device mapper called
dmdedup, the code for which is publicly available, and we also publish a
paper on this.

(2a) User-level dmdedup integrity checker and fixer

Dmdedup includes a collection of data and metadata in various data
structures on disk.  The code runs in the kernel.  Implement user-level
tools to scan the dmdedup on-disk structures and verify their integrity:
these tools are similar to how "fsck" can scan and fix a file system such as
ext4 (but dmdedup-fsck would be much easier to develop).  The tools should
first scan and report errors.  In the 2nd stage, they should try and fix any
errors that can be fixed.

(2b) dmdedup metadata using other data structures

Dmdedup currently uses the Copy-On-Write (COW) Persistent B-Trees as an
underlying data structure to store metadata.  Since it was implemented,
Linux added other persistent metadata methods, especially an array based
one.  The latter could be very useful and faster for certain workloads.
Implement a new meta-data backend which uses the DM array data structure.
Also, one can explore device mapper and see if there are other possible
backends and implement one of those instead of an array based one.

(2c) Offline Deduplication Support in Dmdedup

Dmdedup is a Linux device-mapper deduplication target.  Currently, Dmdedup
supports only inline deduplication, i.e., every write request is
deduplicated at the time of its arrival in the target.  For certain
workloads this can raise the write-path latency to the levels that are
unacceptable to user applications.

In this project, the students will add offline deduplication support to
Dmdedup.  In offline mode, Dmdedup buffers all writes in persistent storage
without deduplicating them.  When a device idles for sufficiently long a
deduplication worker thread wakes up and attempts to eliminate duplicates in
the buffered data.

The project involves four stages:

1) Understanding existing code base (about 3000 LoC)
2) Design
3) Implementation
4) Evaluation (experimentally demonstrate off-line deduplication advantages
   compared to inline deduplication);

(2d) Cluster Deduplication

Resources:

- Content-aware Load Balancing for Distributed Backup
  https://www.usenix.org/legacy/event/lisa11/tech/full_papers/Douglis.pdf

- Tradeoffs in scalable data routing for deduplication clusters
  https://www.usenix.org/legacy/events/fast11/tech/full_papers/Dong.pdf

A single storage node may not be able to support the capacity, performance,
and availability requirements of large organizations.  Clustered storage is
an attractive option as new nodes can be added to the cluster to scale with
growing requirements.  Creating a cluster of deduplicated nodes is
interesting as deduplication increases the effective capacity of a system,
but there are new challenges.  To achieve high deduplication, repeated
content needs to be stored on the same node for duplicates to be recognized.
On the other hand, deduplication increases latency and decreases throughput
because of the worse locality of deduplicated storage.  This project has
multiple phases:

1. Create a FUSE file system that mounts more than one dmdedup storage
target.

2. Create a mechanism to track where files are stored between the storage
targets.

3. Direct files to storage targets, in an inline manner, to balance capacity
and performance while maximizing deduplication achieved.

4. To achieve better overall results, reposition files in a background
thread between target nodes.


******************************************************************************
*** (3) eCryptfs Projects

Project sponsor: Erez Zadok

Resources:
- Linux ecryptfs kernel sources
- ecryptfs.org

eCryptfs is a stackable encryption f/s in linux.  It encrypts individual
files.

(3a) eCryptfs PID/UID Based Authorization.

Ecryptfs's authorization method, however, allows any super-user to access
files.  Add support so that only authorized users and/or processes would be
allowed to access and transparently enc/decrypt files.  In particular, even
a root user (or hacker who broke in) shouldn't be able to decrypt sensitive
files.  This'll require adding some extra data structures to eCryptfs, a way
to specify a list of allowed users/processes, and verifying each op against
that list.

******************************************************************************
*** (4) JBD and Ext4 Projects

Sponsor(s): Erez Zadok

Ext4 is the latest Linux file system.  It supports journaling and extents,
among other features.  JBD2 is the Journaling Block Device (version 2): JBD
offers journaling infrastructure for file systems to implement their
journals.

(4a) JBD-based multi-op transactions

Currently JBD and Ext4 only guarantee atomicity for a single operation (both
the data and meta-data of that operation).  Modify and use JBD2 to allow
multiple operations to be included in a single transaction: for example,
creating a file, writing data to the file, and closing the file.  Ideally,
JBD would export an API to Ext4, which would in turn export it (e.g., via an
ioctl(2)) to user-land programs, so users can take advantage of this
facility.  (Note: this ability could be very useful for the NFSv4
transactional compounds above.)

(4b) Larger Extents in Ext4

Extents are contiguous spaces on the storage device, and hence benefit from
sequential access.  However, Ext4 limits extents to 32,768 x 4KB or 128MB in
size.  Also, 4KB appears to be the largest file-system block size supported.
We want to modify Ext4 and ext-utils (mkfs/etc.) to support larger extents
and larger block sizes.  For example, we want to be able to write a 10GB
file or larger as one large extent.

******************************************************************************
*** (5) FUSE performance

Sponsor(s): Erez Zadok and Vasily Tarasov

Modern UNIX-like Operating Systems (OSes) rely on monolithic kernels which
implement the majority of OS services.  Alternatively, one could implement
only a small number of basic services in the kernel and the rest of the
services in the user space - so called microkernel approach.
Microkernel-based OSes provide richer development toolbox and higher system
reliability.  However, their overhead might be prohibitive.

Resources:
- http://fuse.sourceforge.net/

(5a) FUSE performance evaluation.

File system in USEr-space (FUSE) is an attempt to apply micro-kernel
approach to the storage stack of a monolithic kernel.  Linux FUSE gained
significant popularity in the recent years but there is no reliably
scientific evaluation of the performance hit it causes.  For example, it is
not known if existing in-kernel file systems can be moved to user space or
the performance impact would be too severe.  The answer is probably "it
depends" and this project aims to experimentally draw a boundary between the
two possibilities.

In this project, students will evaluate Linux FUSE performance with a set of
of micro and macro workloads.  FUSE pass-through file system will be
compared to a native ext3.  The project involves five steps:

1) Understand Linux FUSE design and implementation

2) Port FUSE pass-through file system to the latest FUSE version

3) Run experiments for ext3 and pass-through-ext3 configurations

4) Evaluate performance difference and their causes

5) Implement write-through for the latest FUSE version (extra points)

Notice, we target to submit the results of this project to a conference with
a deadline in March.  As a result, the majority of the work need to be
completed in the first half of the semester.  So please plan accordingly.

******************************************************************************
*** (6) Storage Scrubbing

(6a) Storage Scrubbing

Resources: [TBS]

It is well known that storage devices suffer from "bit rot", in which stored
data may deteriorate and become unreadable over the long term.  Some file
systems (e.g., ZFS) and most NAS devices offer a "scrubbing" feature in
which every valid block or file is periodically checksummed and compared
with a previously calculated checksum.  If the checksums do not match, the
file may be automatically recovered from other data (e.g., in a RAID system)
or simply reported to the user.

A problem with scrubbing is that it imposes a significant load on the
system.  Thus, it is necessary to do scrubbing in the background at a lower
priority.

Create a background scrubber that will work for any Unix file system.  The
scrubber must have the following features:

1. All operations must occur in the background and must not interfere with
   normal operation.

2. There must be a way to create an initial set of checksums.

3. The scrubber must detect files that have legitimately changed, and update
   their recorded checksums but not attempt to recover or report them.  It
   also must ignore non-files (e.g., /dev, /proc).

4. The scrubber must also detect unexpected i-node changes (permissions,
   ownership, etc.) and changes in directories.

5. If a file has Linux extended attributes, those should also be verified.

6. The database of checksums should be able to be stored on either the
   filesystem being checked or on another device.

7. The scrubber should have simple, user-friendly tuning parameters.  For
   example, the maximum load should be configurable, and it should be simple
   to predict how long a single scrubbing cycle will take.  (In production,
   a cycle of several months is usually considered adequate.)

8. The scrubber should be robust against system crashes; when a system
   reboots the scrubber should pick up where it left off rather than going
   back to the beginning.

9. For extra credit, the scrubber should detect the amount of load on the
   disk and adjust its activity accordingly, but should never generate more
   than a configurable amount of load even on a completely idle system.

10. For extra credit, the scrubber should be able to operate in parallel on
    multiple devices.

******************************************************************************
*** (7) Cloud-Like Storage

Resources:

- http://docs.openstack.org/developer/swift/api/object_api_v1_overview.html
- http://docs.aws.amazon.com/AmazonS3/latest/API/APIRest.html

(7a) Single-Object Cloud-Like Storage

Implement a single object storage system in Linux.  It could be similar to
Ext4, etc., but instead of a POSIX interface, it supports (in its entirety
or more likely a portion of) one or both of the 2 dominant object storage
APIs (Swift and Amazon S3).  One way to do this would be to put a WSGI Web
server layer as the interface and then store the actual objects as whole
files in Ext4, or the other way would be to have a WSGI Web server layer but
then actually handle data placement in the block layer.  At its simplest, it
could just support 3 APIs (GET/PUT/POST).  Objects would be stored and
retrieved using a WSGI mini Web server.

##############################################################################
# Local Spell-Checker per-file words
# LocalWords:  Arun Olappamanna Vasudevan FSAL SCFS Hybris ClamAV filename Avfs
# LocalWords:  ganesha NAS checksummed dev filesystem ChangeLog dedup VMDK DM
# LocalWords:  Sonam Mandal Amar's Muthitacharoen Athicha Mazieres SIGOPS nd
# LocalWords:  Dmdedup shortlog dmdedup deduplicating Ecryptfs's JBD Tarasov
# LocalWords:  USEr pdf Transactional atomicity Benjie txt ecryptfs decrypt
# LocalWords:  Vasily microkernel WSGI