CSCI 333
Storage Systems
Home
| Schedule
| Labs
| Williams CS
| Piazza
Final Project
Posted
|
Saturday, 05/09
|
Due Date
|
Proposals: 05/13, 10am
Project: 05/22 at 5pm
|
Overview
Originally, final projects were supposed be a chance for each of
us to "dive deep" into a topic that we found interesting while
also simulating the experience of doing "real systems research".
Then came COVID-19. There is not enough time left in the
semester to do a substantial implementation project, and even if
there were, our inability to access real hardware in-person
would make many exciting projects impractical. But we can still
explore topics that we find exciting without implementing new
system from scratch.
There are several ways that to produce and share knowledge.
For our final projects, I would like us to choose from among:
measurement/analysis, knowledge synthesis/summary of knowledge,
data collection and presentation, and/or implementation.
- Measurement/analysis. As computer scientists, we
should all practice the scientific method: Ask a question,
perform background research, form a hypothesis, test the
hypothesis, draw conclusions, and communicate the
results. If you wish to do measurement/analysis, your
final project would start with us developing a
hypothesis. Once we can articulate a hypothesis, we can design
an experiment to test that hypothesis or we could model an
algorithm or system's asymptotic performance to verify that
hypothesis. Although a complex set of tests may be too
difficult or time consuming for our remaining time, there are
definitely options: we can make a small change to an existing
system and test its implications or we could evaluate
important performance measures of an existing system under
different parameter settings/inputs. A great outcome doesn't
require fixing a problem; often simply identifying an
opportunity is incredibly valuable.
-
Knowledge synthesis/summary of knowledge. We could
read and learn about a topic, and present what we've learned
in an accessible/creative way. There are traditional forms of
disseminating knowledge; USENIX
conferences provide
templates for formatting research papers using Latex or MS
Word. We have also read at least one blog post this semester
(Ben Stopford's
entry
on LSM-trees), and you've sat through many PowerPoint
presentations on different storage topics. If you wish to
do knowledge synthesis, your final project would start with
finding a topic area and identifying a set of sources to read,
selecting a presentation format (paper, web posting, or
short presentation), and outlining a plan for your
document.
- Data collection and presentation. In our unit on file
system aging, we
read one
paper that documented a long-term data collection effort,
an analysis of that data, and a system that was built and
verified using that data analysis. We don't have time for a
longitudinal study or time to build a full system, but we can
collect data that tells us something about the state of the
world right now. If you wish to do data collection and
presentation, your final project would start with us
discussing a data collection task, deciding upon a "data
format", and picking a set of systems on which to collect the
data.
-
Implementation. Finally, you could just implement
something. This type of project would be like a traditional
lab. You would not need to write a paper, but your README
would need to clearly describe what you've done and document
your work.
The final project is your chance to use the knowledge, tools, and
techniques that we have developed this semester to explore a topic
that interests you. Your final project should be about the size
and scope of a one-week lab/5-page paper/7-minute presentation.
However you have the opportunity to define the project parameters,
or you may choose to user or adapt one of the suggested projects
that fits your interests/goals.
Timelines
There is not much time left in the semester, so by Wednesday,
May 13 we should agree upon a a project proposal, and the
final deliverables should be completed by
Friday, May 22 (since we will not be having a final
exam, the
latest that written work may be due is 5:00 pm on the
third-to-last day of the exam period).
Proposals
-
You must propose a project by Wednesday, May 12
at 10AM Eastern. Proposals should be submitted
using this
form. When you submit your proposal, please email me to let me
know. We should also discuss your proposal, and in general, the
end of the semester, so in that email, please let me know what
times would be convenient for you to sync up.
-
I will review your proposals on Wednesday and return feedback to
you by the end of the day. If possible, we should brainstorm
together—over email, slack, zoom, and/or Piazza.
-
Your proposal needs to be detailed enough that it is clear to
all of us what you plan to do. It should include a list of group
members (if any), the topic, the format of your deliverables,
and a "definition of success". This is important so we can agree
upon a project scope.
I do NOT want this to be a burden; I want it to be fun and
low-stress. The format is flexible, the topic is flexible, and it
will be evaluated using the criteria that we collectively
define. If the project doesn't sound fun to you, don't propose it!
We can come up with something together. If nothing sounds fun to
you right now, we should talk about that too, and we can figure
something out.
Final Project
The project will be due on or before Friday, May 22 at 5pm.
Ideally, we would all be able to talk on the phone or over Zoom to
go over your project and give you feedback. So I would like to
schedule individual meetings with you all. We should agree upon a
meeting schedule (time and format) when we discuss your proposal.
Since the final deadline is Friday May 22, my hope is that we can
stagger meetings during "finals week"; earlier meetings would
discuss progress and plans, answer questions, and resolve any
issues. Later meetings could present more completed results.
Guidelines
Final projects may be:
- Completed in groups of 1-4 students
-
The project scope should scale with the size of the group (I expect a team of 3 students to cover a little bit more ground than 2 students, but it is obviously not a linear relationship)
-
You may define your own project, or you may choose/modify a project idea from the examples below.
Depending on your project, your "deliverables" will vary.
However, I will make a GitHub repository for each of group, and
your group will commit your final project resources to that
repository by the due date.
Sample Project Ideas
This is a non-exhaustive list of project ideas.
Some of them have been inspired by topics in this course,
and some of them have been projects from computer science courses
elsewhere.
I suggest that you use them for inspiration,
even if you decide to design your own project: techniques or suggestions
might be relevant.
Measurement/Analysis
- Hardware measurement and benchmark design from an OS course at UNC
- Recreate the evaluation of a published research paper, and
either confirm or dispute their results. Good candidates for
this type of project are papers that have published their
benchmarks and have provided clear descriptions of their
methodology. Or, papers that measure
the state
of the world at some point in time, and you want to see if
the world has changed since then.
- Recreate the "Bandwidth vs. I/O size curve" that I used to motivate seek costs on Hard drives (and later SSDs), but this time use a "cloud service" as your "disk". The goal would be to answer the question: "What is the natural transfer size for different cloud storage services?"
- Download the code for a filter implementation (Bloom or quotient filter), and experimentally verify that the parameter settings produce the desired false positive rate.
Knowledge Synthesis/Summary of Knowledge
- A survey of the challenges of building systems that are compliant with data privacy regulations. For example:
-
What is the GDPR and what are its implications at big companies like Facebook/Google?
-
What are the challenges of building
a HIPAA
compliant storage system at a hospital, when doctors in
different departments and different hospitals may want to access
a patient's data?
-
What are the challenges for designing a system like
MOSS—one that relies on collecting large user data sets
for plagiarism detection—to ensure that it is
FERPA-compliant?
-
A survey of crash recovery techniques, like write-ahead
logging, transactions, and
fsck
.
-
A survey of different LSM-tree designs and techniques
-
A presentation/notes for one of the textbook chapters that we didn't cover in class
-
A case study on how
git
works. The internals
(Ch 10) are quite interesting.
Straight Implementation
- Implement a Bloom filter, and verify its properties (e.g.,
false positive rate w.r.t.
k
and m
)
- Implement a compression algorithm from scratch
- Extend your FUSE file system in one or more interesting ways, such as:
- Add encryption and/or compression (this can be surprisingly simple to do, and it can be scaled up or scaled down depending on your appetite for complexity)
- Add some form of support for data caching
- Implement a new FUSE file system that does something interesting or
fun. I would imagine these would be simpler and more in line with
modifying the fuse_xmp pass-through file system template.
- Musical FUSE: writes to files proceed as normal, but reads use an external python library to "play" the contents of the files.
- A pseudo filesystem for printing PDF documents. One idea
is to have two directories: one for writing single-sided and
one for double-sided. There could be a single psuedo-file
called "print", and when I use the "write" system call to
write a filename to "print", the computer would prints the
contents with the appropriate sided-ness to the default
printer (the printer could be specified as a parameter during
mount). This would be spiritually similar to the hello-FUSE
lab
-
A FUSE file system that operates as a pass-through file
system, but issue git commits/pushes/pulls to keep your
contents synced with a remote repository. This could be an
interesting way to collaborate on a project. A challenge
here would be handling merge conflicts, but this project
seems both straightforward and fun!
-
Implement file-level compression on a "pass-through" style
FS (i.e., you read and write using an existing file system,
but you compress the file contents)
- Implement hardware conditioning scripts, and quantify the performance differences between fresh and aged devices. Here is a link to the SNIA document with details.
Data Collection
- Collect summary statistics about modern storage systems. Things like compressibility, directory structure, file sizes, deduplication rations, etc. are interesting to practitioners.
- Measure the fragmentation on the lab systems using
the
filefrag
tool, and do some analysis of the data
(average fragmentation levels, fragmentation by file type,
fragmentation by file size, etc.).
Williams College :: CSCI 333 :: Spring 2020