CSCI 333
Storage Systems
Home | Schedule | Labs | Williams CS | Piazza
Lab 1: Unix Utilities, System Calls, and C
Assigned | Sunday, 02/09 |
---|---|
Due Date | Thursday, 02/20 at 11:59pm. |
Objectives
This assignment serves several purposes, and as a result it may pose different challenges to each of us. The goals of this assignment are to:
Overview
In this lab, we will be implementing our own versions of important
Unix utilities.
These are useful utilities that we should all familiarize
ourselves with and incorporate in our everyday programming practice.
They are also utilities that may come in handy if you want to
analyze/benchmark
the performance of real systems: one possibility for a final project
is to extend some of these utilities in interesting/useful ways.
Your current task is to write simplified versions of the
cat
,
grep
,
zip
,
and
unzip
utilities.
We will implement a subset of each tool's functionality,
but you may extend them with additional features if desired.
Logistics
For this first lab, you will work alone. Each student will be given a private repository under the Williams-CS GitHub organization, and only you will have read/write access. Labs will be submitted by committing code to this repository before the due date. Git maintains a log of your commit history, but only your final commit before the deadline will be graded. I highly recommend that you commit code early and often: it will only help because the instructor can view your code and easily answer questions using the GitHub interface.
You will submit four individual .c
files, one for each
operation.
Each file should have a main function that, when run, performs the
behavior described for that task.
You should include brief documentation in your repository's
README.md
file that describes each program,
including instructions on how to compile and run your program, and
what the program's expected behavior(s) are.
Honor Code
You may discuss any and all aspects of this lab with the instructor
and your TA.
However, you are prohibited from sharing or discussing your code with
anyone else.
You may discuss the semantics of the Unix utilities
(i.e., the contents of the lab handout and/or any of the Unix
man
pages for these or other utilities),
and debugging strategies with your classmates (e.g., how to run
gdb
, the meanings of cryptic error messages,
printf
semantics and other C-language function behaviors),
but
you should never look at any full or partial solution that is
not your own! You should never look at any computer screen that has a
classmate's code on it.
Starter Code
There is no "starter code" for this lab, but you will
find empty .c
files that you will fill in with your
lab solutions.
In addition to the four empty .c
files,
each repository will contain a REAMDE.md
file with
additional links and lab details.
You will supplement that REAMDE.md
with documentation
as described above (e.g., documentation to compile, run, and use each
utility).
You can clone your personal private repository at https://github.com/williams-cs/cs333lab01-{USERNAME}. The directory that you store the repository is up to you, but all of your work should be completed and submitted in your personal repository.
Part I: my-cat.c
Your my-cat
program is a simplified version of the
standard Unix program cat
(at a high level, cat
reads the file specified by
the user and prints that file's exact contents to standard output).
In the following example of my-cat
usage,
in which a user wants to see the contents of the file
main.c
,
the user would pass main.c
as the only command-line
argument,
and the exact contents main.c
would be printed:
$ ./my-cat main.c #include#include ... the rest of main.c is also printed to standard output...
The "./"
before the my-cat
above
is a Unix thing; it tells the system which directory to find
my-cat
in
(in this case, in the "."
(dot) directory,
which means the current working directory).
To create the executable my-cat
binary,
you'll be editing a single source file, my-cat.c
,
and writing C code that implement this simplified version of
cat
.
To compile the program, you will do the following:
$ gcc -o my-cat my-cat.c -Wall -Werror
This will make a single executable binary called my-cat
,
which you can then run as shown above.
You'll need to learn how to use a few library routines
from the C standard library
(often called libc
)
in order to implement the source code for this program.
All C code is automatically linked with the C library,
which is full of useful functions that you can call
to implement your program.
Learn more about the C library
here.
There is a lot of code at your disposal thanks to the C library,
so at various points you will need to read the manual pages.
One slightly more readable version of C library code is the
musl libc project.
If you ever want to explore implementation details,
or perhaps even modify some libc functionality for your own purposes,
I encourage you to read the musl libc source code.
Helpful Commands
For this lab assignment,
we recommend using the following C library routines for file input
and output:
fopen()
, fgets()
, and fclose()
.
Whenever you use a new function like this,
the first thing you should do is read about it -- how else
will you learn to use it properly?
On UNIX systems,
the best way to read about such functions is to use what are
called the man
pages (short for manual
).
In our HTML/web-driven world,
the man pages feel a bit antiquated,
but they are quick and informative references and
generally quite easy to use.
Plus they are available on every Unix machine,
accessible right from the command line.
To access the man page for fopen()
,
for example, just type the following at your UNIX shell prompt:
$ man fopen
Then, read! Reading man pages effectively takes practice; why not start learning now?
The command `man fopen`
at the command line will open
up the manual page in the same interface as the `less`
command.
Use the arrows to navigate and type `q`
to quit.
In fact, I suggest using the command `man less`
to explore how to navigate man pages!
Your time investment will quickly pay off.
(Also note: many shortcuts in less
are also shortcuts
in emacs
and at the command line,
so being familiar with Unix tools and shortcuts
will make you more efficient in all aspects of systems programming!)
(Another quick aside: The Unix manual is broken into several
sections.
The 2nd section describes system calls, and the 3rd section
contains library functions, including libc.
You can specify a section when searching the manual to
ensure that you get the documentation that you want.
The syntax is
man SECTION_NUMBER FUNCTION_NAME
.)
Although you should definitely explore the man pages,
we will also give a short overview of important commands here.
The fopen()
function "opens" a
file, which is a common way in Unix systems to begin the process
of file access.
The fopen()
command just gives you back a pointer to a
structure of type FILE
,
which can then be passed to other library routines that
read, write, etc.
Here is a typical usage of fopen()
:
FILE *fp = fopen("main.c", "r");
if (fp == NULL) {
printf("cannot open file\n");
exit(1);
}
A couple of points here:
First, note that fopen()
takes two arguments:
the name of the file and the mode.
The mode restricts the ways that the FILE
structure
can be used to interact with the file.
In this case, because we only wish to read the file,
we pass "r"
as the second argument.
Read the man pages to see what other options are available.
In general, you should always grant the minimum privileges necessary
to perform your task;
this eliminates common errors by letting you state your intent,
and giving C the information it needs to enforce your intent.
Second, note the critical step of checking whether the
fopen()
function actually succeeded.
Unlike Java, C will not throw an exception when things go wrong;
rather, C expects that
(in good programs, i.e., the only kind you'd want to write)
you always will check whether the call succeeded.
Reading the manual page tells you the details of what is returned
on success (often a range of values indicate success,
and partial successes may be possible)
and on error
(often -1, with an appropriate errno set to indicate
why the function failed);
in this case, the macOS man page says:
Upon successful completion fopen(), fdopen(), freopen() and fmemopen() return a FILE pointer. Otherwise, NULL is returned and the global variable errno is set to indicate the error.
Thus, as the code above does,
please check that fopen()
does not return
NULL before trying to use the FILE pointer it returns.
Third, note that when the error case occurs, the program prints a message and exits with error status of 1. In Unix systems, it is traditional to return 0 upon success, and return non-zero upon failure. Here, we will use 1 to indicate failure.
Side note: if fopen()
does fail,
there are many possible reasons why.
You can use the functions
perror()
and/or strerror()
to print human-readable messages that describe why
the error occurred;
learn about those functions on your own
(using ... you guessed it ... the man pages!),
and use them in your assignment!
Once a file is opened,
there are many different ways to read from it.
The one we suggest that you use is fgets()
.
fgets()
yields data from files,
one line at a time.
To print out file contents, just use printf()
.
For example, after reading in a line with fgets()
into a variable buffer
,
you can just print out the buffer as follows:
printf("%s", buffer);
Note that you should not add a newline ("\n"
)
character to the printf()
format string,
because then your output would diverge from the file's true contents.
Just print the exact contents of the read-in buffer
(which will likely include newlines).
Finally, when you are done reading and printing,
use fclose()
to close the file,
thus indicating you no longer need to read from it.
Details
my-cat
should be invokable with one or more filename arguments
on the command line;
my-cat
should print out each file in succession.
my-cat
should exit with status code 0,
usually by returning a 0 from main()
or by calling exit(0)
.my-cat
should exit and return 0.
Note that this is slightly different than the behavior of
normal Unix cat
(it would be a good idea to figure out and document the difference
in your README.md
).
fopen()
a file and fails,
my-cat
should print the exact message:
my-cat: cannot open file(followed by a newline) and exit with status code 1. If multiple files are specified on the command line, the files should be printed out, in order, until all files are successfully printed OR an error opening a file is reached (at which point the error message is printed and
my-cat
exits).
Part II: my-grep.c
The second utility you will build is called my-grep
,
a variant of the Unix tool grep
.
At a high level, grep
scans through a file,
line by line,
trying to find a user-specified search term in each line.
If a line contains the search term,
that line is printed out;
otherwise the line is skipped.
Then grep
continues searching subsequent lines.
Standard grep
is quite customizable with many
command line many options to tweak its behavior.
Your my-grep
will just implement the basic functionality
described above.
Here is how a user would look for the term "foo
"
in the file bar.txt
:
$ ./my-grep foo bar.txt
this line has foo in it
so does this foolish line; do you see where?
even this line, which has barfood in it, will be printed.
This may seem like a silly thing to implement, or even a waste of time.
I assure you that it is not.
I (Bill) have personally used Unix grep
as a system evaluation benchmark
in multiple published papers.
grep
implements a linear scan through a file
(or if passed the -r
flag,
a recursive linear scan through a directory subtree in modified BFS
order).
When evaluating system performance, this is an important workload
(sequential scan)!
In fact, extending your implementation into a "super grep", and using
your "super grep" to evaluate an
existing system would be a final project that would be very exciting.
Please talk to me if you are curious.
Details
my-grep
is always passed a search term and zero or
more files to grep through (thus, more than one file is possible).
For each file,
your my-grep
should go through each line
and check if the search term is in the line;
if the search term is present,
the entire line containing the search term should be printed,
and if not, the line should be skipped.
foo
", lines
with "Foo
" will not match.
This will help simplify your implementation,
and it matches default grep
behavior.
\n
").
Your my-grep
should work
as expected even with very long lines.
For this reason,
you might want to look into the getline()
library call
(instead of fgets()
),
or create your own strategy for handling long lines.
my-grep
is passed no command-line arguments,
it should print the exact message:
my-grep: searchterm [file ...](followed by a newline) and exit with status 1.
my-grep
encounters a file that it cannot open,
it should print the exact message:
my-grep: cannot open file(followed by a newline) and exit with status 1.
my-grep
should exit with return code 0.
my-grep
should work,
but instead of reading from a file, my-grep
should read
from standard input.
Doing so is easy, because the file stream stdin
is already open;
you can use fgets()
(or similar routines) to
read from it.
The man pages should help with this.
my-grep
can either match NO lines or match ALL lines; both are acceptable.
Document your chosen behavior in your README.md
file.
Part III: my-zip.c
and my-unzip.c
The next tools you will build come in a pair,
because one (my-zip
) is a
file compression tool,
and the other (my-unzip
)
is a file decompression tool.
The type of compression used here is a simple form of compression called
run-length encoding (RLE).
RLE is quite simple: when you encounter n
characters of the same type in a row,
the compression tool (my-zip
) will
turn that into the number n
and a single instance of the character.
Thus, if we had a file with the following contents:
aaaaaaaaaabbbb
the tool would turn it (logically) into:
10a4b
However, the exact format of the compressed file is quite important; here, you will write out a 4-byte integer in binary format followed by the single character in ASCII. Thus, a compressed file will consist of some number of 5-byte entries, each of which is comprised of a 4-byte integer (the run length) and the single character.
To write out an integer in binary format (not ASCII),
you should use fwrite()
.
Read the man page for more details.
For my-zip
,
all output should be written to standard output
(the stdout
file stream,
which,
as with stdin
,
is already open when the program starts running).
Note that typical usage of the my-zip
tool
would thus use
shell redirection
in order to write the compressed output to a file.
For example,
to compress the file file.txt
into a (hopefully smaller) file.z
,
you would type:
$ ./my-zip file.txt > file.z
The "greater than" sign is a Unix shell redirection;
in this case,
it ensures that the output from my-zip
is written to the file file.z
(instead of being printed to the screen, i.e.,
all contents directed to standard out are instead redirected to file.z).
The my-unzip
tool implements the inverse behavior of the
my-zip
tool:
my-unzip
takes in a RLE compressed file
and writes (to standard output again)
the uncompressed results.
For example, to recover the contents of file.txt
(which were compressed and written to the file file.z
),
you would type:
$ ./my-unzip file.z
Your my-unzip
should read in the compressed file
(likely using fread()
)
and print out the uncompressed output to standard output using
printf()
.
Details
my-zip: file1 [file2 ...]
(followed by a newline) for my-zip
OR
my-unzip: file1 [file2 ...]
(followed by a newline)
for my-unzip
.
my-zip
,
they are compressed into a single compressed output,
and when unzipped,
they will turn into a single uncompressed stream of text
(thus, the information that multiple files were
originally input into my-zip
is lost).
The same thing holds for my-unzip
.
Evaluation
Your implementations of
my-cat.c
,
my-grep.c
,
my-zip.c
, and my-unzip.c
will be evaluated on correctness
(by comparing output against the standard Unix utilities
and/or reference implementations),
code clarity, C error-handling,
and the specifications in the "Details" sections
of this lab.
Please test and verify your implementation.
You may wish to write your own tests:
the diff
tool is helpful for comparing two files.
You may wish to run the standard Unix
cat
and grep
programs,
redirecting their outputs to a file.
You can then do the same with your implementations.
If you compare the two outputs with diff
,
you will quickly see whether your behavior matches exactly,
and if not, you can see how your output differs.
For my-zip
and my-unzip
,
you may wish to use diff
to compare the uncompressed
version of a file against the original;
although that does not test every case that my-zip
and
my-unzip
must handle (e.g., multiple files as input),
well organized code should generalize to the multi-file cases.
Submitting Your Work
When you have completed the lab, submit your code using the appropriate git commands, such as:
$ git status
$ git add ...
$ git commit -m "final submission"
$ git push
Verify that your changes appear on GitHub by navigating to your private repository using the web interface. It should be available at https://github.com/williams-cs/cs333lab01-{USERNAME} You should see all changes reflected in the various files that you submit. If not, go back and make sure you committed and pushed. I will be retrieving all lab code from GitHub at the due date, so if your changes are not visible to you on GitHub, they will not be visible to me either. I want to make sure everyone receives credit for their work!
We will know that the files are yours because they are in
your private git repository.
Do not include identifying information in your code our documentation
that you submit.
My goal goal is to grade your programs anonymously to avoid any bias.
However, in your README.md file,
please cite any sources of inspiration or
collaboration (e.g., conversations about documentation with classmates).
We take the honor code very seriously, and so should you.
Please include the statement "I am the
sole author of the work in this repository."
inside your README.md
file.
Advice
Start early on my-cat.c
.
The remaining three programs have a similar foundation,
so completing my-cat.c
as early as possible
will help you to create a budget/schedule for your remaining time.
Stop by my office hours or ask questions as soon as you get stuck. Even if you only have a minute, I can look at your code on GitHub, answer your question, and leave more detailed feedback for you using the web interface.
You should be able to use GDB on your code. If you are unfamiliar with gdb, Google and your instructor can help (or your classmates, provided you are not talking while actively debugging any code for this assignment and you keep your discussions strictly to GDB mechanics). This GDB info page from Harvey Mudd and this GDB quick reference from the Univ of Texas may also be very helpful.
Acknowledgments
This contents of this lab borrow heavily from a project written by Remzi and Andrea Arpaci-Dusseau as part of their OSTEP course materials.