CSCI 333
Storage Systems
Home | Schedule | Labs | Williams CS | Piazza
Lab 1: Unix Utilities, System Calls, and C
Assigned | Sunday, 02/03 |
---|---|
Due Date |
Friday 02/22 at 11pm. |
Objectives
This assignment serves several purposes, and as a result it may pose different challenges to each of us.
Because there are no TAs for this course, all students will work in pairs. Working collaboratively will also prepare us for life after graduation, where team-based software development is the norm.
Overview
In this lab, we will be implementing our own versions of important
Unix utilities.
These are useful utilities that we should all familiarize
ourselves with for our everyday programming lives,
but they are also utilities that we may use later to analyze/benchmark
the performance of real systems (a possibility for a final project).
We will be writing simplified versions of
cat
,
grep
,
zip
,
and
unzip
.
We will implement a subset of each tool's functionality,
but you may extend them additional features if desired.
Logistics
We will work in pairs. Each pair will be given a private repository under the Williams-CS GitHub organization, and both partners will have read/write access. Labs will be submitted by committing code to this repository before the due date. Git maintains a log of your commit history, but only your final commit before the deadline will be graded. I highly recommend that you commit code early and often: it will only help.
You will submit four individual .c
files, one for each
operation.
Each file should have a main function that, when run, performs the
behavior described for that task.
You should include brief documentation in your README.md
file
for each program, including how to run and test it.
Honor Code
You may discuss any and all aspects of this lab with your partner.
However, you are prohibited from sharing or discussing your code with
anyone else.
You may discuss the semantics of the Unix utilities
(i.e., the contents of the lab handout and/or the Unix man
pages),
and debugging strategies with your classmates,
but you should never look at any full or partial solution that is
not your own.
Starter Code
There is no "starter code" for this lab, but you will need to clone your
group's private GitHub repository in order to commit your code.
Each repository will have a REAMDE.md
file with additional
links and details.
You can clone your personal private repositories at https://github.com/williams-cs/cs333lab1-{USERNAMEs}.
The directory that you store the repository is up to you, but all of your work should be completed and submitted
in your personal repository.
Part I: my-cat.c
The program my-cat
is a simple program.
At a high level, it reads a file as specified by the user and prints
that file's contents to standard output.
The typical usage is as follows,
in which the user wants to see the contents of main.c
, and thus types:
$ ./my-cat main.c #include#include ... rest of main.c ...
As shown, my-cat
reads the file main.c
and prints out its contents.
The "./"
before the my-cat
above
is a UNIX thing; it just tells the system which directory to find
my-cat
in (in this case, in the "."
(dot) directory, which means the current working directory).
To create the my-cat
binary,
you'll be creating a single source file, my-cat.c
,
and writing a little C code to implement this simplified version of
cat
. To compile this program, you will do the following:
$ gcc -o my-cat my-cat.c -Wall -Werror
This will make a single executable binary called my-cat which you can then run as above.
You'll need to learn how to use a few library routines from the C standard
library (often called libc
) to implement the source code for this program,
which we'll assume is in a file called my-cat.c
. All C code is
automatically linked with the C library, which is full of useful functions you
can call to implement your program. Learn more about the C library
here.
There is a lot to learn about the C library, so at some point you will
simply need to read the manual pages.
Helpful Commands
For this project, we recommend using the following routines to do file input
and output: fopen()
, fgets()
, and fclose()
. Whenever you use a new
function like this, the first thing you should do is read about it -- how else
will you learn to use it properly?
On UNIX systems, the best way to read about such functions is to use what are
called the man
pages (short for manual
). In our HTML/web-driven world,
the man pages feel a bit antiquated, but they are useful and informative and
generally quite easy to use.
To access the man page for fopen()
, for example, just type the following
at your UNIX shell prompt:
$ man fopen
Then, read! Reading man pages effectively takes practice; why not start learning now?
The command `man fopen`
from the command line will open
up the manual page in the same interface as the `less`
command.
Use the arrows to navigate and type `q`
to quit.
In fact, I suggest using the command `man less`
to explore how to navigate man pages!
Your time investment will quickly pay off.
We will also give a simple overview here.
The fopen()
function "opens" a
file, which is a common way in UNIX systems to begin the process of file
access.
In this case, opening a file just gives you back a pointer to a
structure of type FILE
,
which can then be passed to other routines to
read, write, etc.
Here is a typical usage of fopen()
:
FILE *fp = fopen("main.c", "r");
if (fp == NULL) {
printf("cannot open file\n");
exit(1);
}
A couple of points here:
First, note that fopen()
takes two arguments:
the name of the file and the mode.
The latter just indicates what we plan to do with the file.
In this case, because we wish to read the file, we pass "r"
as the second argument. Read the man pages to see what other options are
available.
Second, note the critical checking of whether the
fopen()
actually succeeded.
This is not Java where an exception will be thrown when things goes
wrong;
rather, it is C, and it is expected
(in good programs, i.e., the only kind you'd want to write)
that you always will check if the call succeeded.
Reading the man page tells you the details of what is returned when
an error is encountered; in this case, the macOS man page says:
Upon successful completion fopen(), fdopen(), freopen() and fmemopen() return a FILE pointer. Otherwise, NULL is returned and the global variable errno is set to indicate the error.
Thus, as the code above does, please check that fopen()
does not return
NULL before trying to use the FILE pointer it returns.
Third, note that when the error case occurs, the program prints a message and then exits with error status of 1. In UNIX systems, it is traditional to return 0 upon success, and non-zero upon failure. Here, we will use 1 to indicate failure.
Side note: if fopen()
does fail,
there are many reasons possible as to why.
You can use the functions perror()
or strerror()
to print out more about why the error occurred;
learn about those on your own
(using ... you guessed it ... the man pages!).
Once a file is open, there are many different ways to read from it.
The one we're suggesting here to you is fgets()
,
which is used to get input from files, one line at a time.
To print out file contents, just use printf()
.
For example, after reading in a line with fgets()
into a variable buffer
,
you can just print out the buffer as follows:
printf("%s", buffer);
Note that you should not add a newline ("\n"
)
character to the printf()
,
because that would be changing the output of the file to have extra
newlines.
Just print the exact contents of the read-in buffer (which, of
course, many include a newline).
Finally, when you are done reading and printing,
use fclose()
to close the
file (thus indicating you no longer need to read from it).
Details
my-cat
can be invoked with one or more files on the command
line; it should just print out each file in turn.my-cat
should exit with status code 0, usually by
returning a 0 from main()
(or by calling exit(0)
).my-cat
should just exit
and return 0. Note that this is slightly different than the behavior of
normal UNIX cat
(if you'd like to, figure out the difference).fopen()
a file and fails, it should print the
exact message "my-cat: cannot open file" (followed by a newline) and exit
with status code 1. If multiple files are specified on the command line,
the files should be printed out in order until the end of the file list is
reached or an error opening a file is reached (at which point the error
message is printed and my-cat
exits).Part II: my-grep.c
The second utility you will build is called my-grep
, a variant of the UNIX
tool grep
. This tool looks through a file, line by line, trying to find a
user-specified search term in the line. If a line has the word within it, the
line is printed out, otherwise it is not.
Here is how a user would look for the term "foo
"
in the file bar.txt
:
$ ./my-grep foo bar.txt
this line has foo in it
so does this foolish line; do you see where?
even this line, which has barfood in it, will be printed.
This may seem like a silly thing to implement, or even a waste of time.
I assure you that it is not.
I (Bill) have personally used Unix grep
in multiple
published papers as an evaluation benchmark because grep
implements a linearly scan through a file (or if passed the
-r
flag, a recursive linear scan through a directory
subtree).
When evaluating read performance, this is an important workload!
In fact, extending your implementation into a "super grep", and using
your "super grep" to evaluate an
existing system is a final project that would be very exciting!
Please talk to me if you are curious.
Details
my-grep
is always passed a search term and zero or
more files to grep through (thus, more than one file is possible).
Your my-grep
should go
through each line and see if the search term is in the line;
if so, the entire line containing the search term should be printed,
and if not, the line should be skipped.
foo
", lines
with "Foo
" will not match.
This should help simplify your implementation.
\n
").
Your my-grep
should work
as expected even with very long lines. For this, you might want to look
into the getline()
library call
(instead of fgets()
),
or roll your own.
my-grep
is passed no command-line arguments,
it should print the exact message
"my-grep: searchterm [file ...]
" (followed by a newline)
and exit with status 1.
my-grep
encounters a file that it cannot open,
it should print the exact message
"my-grep: cannot open file
" (followed by a newline)
and exit with status 1.
my-grep
should exit with return code 0.
my-grep
should work,
but instead of reading from a file, my-grep
should read
from standard input.
Doing so is easy, because the file stream stdin
is already open;
you can use fgets()
(or similar routines) to
read from it.
my-grep
can either match NO lines or match ALL lines, both are acceptable.
Part III: my-zip.c
and my-unzip.c
The next tools you will build come in a pair,
because one (my-zip
) is a
file compression tool,
and the other (my-unzip
)
is a file decompression tool.
The type of compression used here is a simple form of compression called
run-length encoding (RLE).
RLE is quite simple: when you encounter n
characters of the same type in a row,
the compression tool (my-zip
) will
turn that into the number n
and a single instance of the character.
Thus, if we had a file with the following contents:
aaaaaaaaaabbbb
the tool would turn it (logically) into:
10a4b
However, the exact format of the compressed file is quite important; here, you will write out a 4-byte integer in binary format followed by the single character in ASCII. Thus, a compressed file will consist of some number of 5-byte entries, each of which is comprised of a 4-byte integer (the run length) and the single character.
To write out an integer in binary format (not ASCII), you should use
fwrite()
. Read the man page for more details. For my-zip
, all
output should be written to standard output (the stdout
file stream,
which, as with stdin
, is already open when the program starts running).
Note that typical usage of the my-zip
tool would thus use shell
redirection in order to write the compressed output to a file. For example,
to compress the file file.txt
into a (hopefully smaller) file.z
,
you would type:
$ ./my-zip file.txt > file.z
The "greater than" sign is a UNIX shell redirection;
in this case,
it ensures that the output from my-zip
is written to the file file.z
(instead of being printed to the screen).
The my-unzip
tool simply does the reverse of the
my-zip
tool,
taking in a compressed file and writing (to standard output again)
the uncompressed results.
For example, to recover the contents of file.txt
(which were compressed and written to the file file.z
),
you would type:
$ ./my-unzip file.z
Your my-unzip
should read in the compressed file
(likely using fread()
)
and print out the uncompressed output to standard output using
printf()
.
Details
my-zip: file1 [file2 ...]
" (followed by a newline) or
"my-unzip: file1 [file2 ...]
" (followed by a newline)
for my-zip
and my-unzip
respectively.
my-zip
,
they are compressed into a single compressed output,
and when unzipped,
will turn into a single uncompressed stream of text
(thus, the information that multiple files were
originally input into my-zip
is lost).
The same thing holds for my-unzip
.
Evaluation
my-cat.c
, my-grep.c
, my-zip.c
, and my-unzip.c
will be evaluated on correctness (by comparing
output against the standard Unix utilities and/or reference implementations),
code clarity, C error-handling, and the specifications in the "Details" sections
of this lab.
Please test and verify your implementation.
You may wish to write your own tests.
Submitting Your Work
When you have completed the lab, submit your code using the appropriate git commands, such as:
$ git status
$ git add ...
$ git commit -m "final submission"
$ git push
Verify that your changes appear on Github. Navigate in your web browser to your group's private repository on GitHub. It should be available at https://github.com/williams-cs/cs333lab1-{USERNAMES} You should see all changes reflected in the various files that you submit. If not, go back and make sure you committed and pushed.
We will know that the files are yours because they are in your git repository. Do not include identifying information in the code that you submit. Our goal is to grade the programs anonymously to avoid any bias. However, in your README.md file, please cite any sources of inspiration or collaboration (e.g., conversations with classmates). We take the honor code very seriously, and so should you. Please include the statement "I am the sole author of the work in this repository." inside your README.md file.
Advice
Start early on my-cat.c
.
The remaining three programs have a similar foundation,
so completing my-cat.c
as early as possible will help you to
create a budget/schedule for your remaining time.
Stop by my office hours or ask questions as soon as you get stuck.
You should be able to use GDB on your code. If you are unfamiliar with gdb, Google and your instructor can help (or your classmates, provided you are not debugging any code for this assignment and keep your discussions strictly to GDB mechanics). This GDB info page from Harvey Mudd and this GDB quick reference from the Univ of Texas may also be very helpful.
This contents of this lab borrow heavily from a project written by Remzi and Andrea Arpaci-Dusseau as part of their OSTEP course materials.