Last time we looked at code to read from a file. That code looked like this:
// create a stream to access dictionary file fileReader = new BufferedReader(new FileReader("file.txt")); // get the first line of the file nextLine = fileReader.readLine(); // Read words until we run out of words or the array becomes full while (nextLine != null) { // Code to process the line omitted // get the next line from the file nextLine = fileReader.readLine(); } // Close the file fileReader.close();
How different are Web pages from files? There are two obvious differences:
The fact that the text in a Web page is HTML is only an issue if we want to display it nicely, like a Web browser does. HTML is, in fact, just text but with a funny syntax, in the same way that a Java file is just text with a funny syntax.
What we care about is the second point. The Web page is probably on a different machine. Because of this when we read a Web page in Java, we can't use a FileReader as above. Fortuntately, Java makes it easy for us to read web pages with a URL. We need to construct the stream reader differently, but after we do that the code to read from the stream is identical. Here it is:
// Note the different way of constructing a stream to read a URL // over the network pageReader = new BufferedReader(new InputStreamReader(url.openStream())); // Read the first line nextLine = pageReader.readLine(); // Loop until all the lines are read while (nextLine != null) { // Code to process next line omitted nextLine = pageReader.readLine(); } // Close the stream pageReader.close();
Now, let's talk a bit about how networking programs work. When we write a networking program like the HTMLLinkFinder or the SpamFilter from lab, our program is communicating with another program over the network. Our program is a client. The other program is a server. In the case of HTMLLinkFinder, we are communicating with a Web server. In the case of the SpamFilter, we are communicating with a POP (Post Office Protocol) server which is one type of server that can be used to read mail.
Let's talk about how a Web client/server pair works a bit more. First, we'll look at URLs and understand what they mean better. A simple URL may look like:
http://www.williams.edu
A more complicated URL is:
http://www.williams.edu/index.html
What are all these pieces?
To connect to a server, we need to identify the machine and the port to use. Think of the port as something like a telephone extension. One machine may run several servers, so we identify the specific server we want by identifying its port. Common servers, like Web servers, have default port numbers. The default port number for a Web server is 80. If a server is running at its default port, we can omit the port number in the URL.
Once connected, the protocol defines a set of commands that we can use to talk to the server. The protocol also specifies the arguments that each command requires and the return values that it produces.
The HTTP protocol is very simple. It supports 1 command:
For example, if I am connected to cs.williams.edu, I can get the home page for 134 by sending the string "GET /~cs134/index.html" to the Web server.
The easiest way to get a feeling for protocols is to run a program called telnet from a command line (like a DOS shell, or Mac Terminal). With telnet, we can send commands like the GET line above to the server just by typing them on the keyboard and hitting return. The responses made by the server appear on the terminal.
Here is a sample session using telnet to download a page from a Web server:
First we connect to the Web server. Using telnet we must explicitly state what the port number is.
-> telnet cs.williams.edu 80
The Web server responds:
Trying 137.165.8.2... Connected to cs.williams.edu. Escape character is '^]'.
Now we request the home page for cs134:
GET /~cs134/index.html
The server responds with the Web page:
<head> <title> CSCI 134 </title> </head> <html> <body> <p class="title"> CSCI 134 </p> <p class="title"> Introduction to Computer Science </p> <p class="box"> <a href="index.html">Home</a> | <a href="lectures.html">Lectures</a> | <a href="handouts.html">Handouts</a> | <a href="links.html">Resources</a> </p> <p class="heading"> Home </p> <p class="info"> <table> <tr> <td width="150px">Instructors:</td><td width="40%"><a href="http://www.cs.williams.edu/~andrea">Prof. Andrea Danyluk</a> <td><a href="http://www.cs.williams.edu/~freund">Prof. Stephen Freund</a> <tr> <td>Email: </td> <td><a href="mailto:andrea@cs.williams.edu">andrea@cs.williams.edu</a></td> <td><a href="mailto:freund@cs.williams.edu">freund@cs.williams.edu</a> <tr> <td>Phone: </td> <td>x2178</td> <td>x4260 <tr> <td>Office:</td> <td> TCL 305</td> <td> TPL 302 <tr> <td>Office Hours: </td><td> Wed 1-2:30</td> <td> Tue 2:30 - 4:30 </td></tr> <tr><td>TA Hours:</td><td>Mon 7-10, Tue-Thurs 7-11, and Thurs 4-6 in TCL 217 </td></tr> </table> <br> </p> <p class="heading"> Course Description </p> <p><p> <p class="text"> This course introduces fundamental ideas in computer science and builds skills in the design, implementation, and testing of computer programs. Students implement algorithms in the Java programming language with a strong focus on constructing correct, understandable, and efficient programs. Students explore the material through specific application areas. Topics covered include object-oriented programming, control structures, arrays, recursion, and event-driven programming. This course is appropriate for all students who want to create software and have little or no prior computing experience. <p class="bottom"> </p> </body> </html>
The HTTP protocol automatically closes the connection after returning the Web page. If I wanted another Web page, I would need to create another connection to get the second page.
Now, let's consider how to make an HTTP connection in Java. (Because of Java's URL class, we don't actually need to do this, but it helps understand how the network connection is created and used.)
We need to first create a connection. To do this we create a Socket passing in the name of the machine and the port number. For example, to connect to CS Department's Web server we would say:
Socket tcpSocket = new Socket("dept.cs.williams.edu", 80);
A socket is a lot like a phone connection. When a phone connection is established, you can actually think of this as two connections bundled together. One of these connections alllows you to speak and to your mouthpiece and be heard via the earpiece on the other end. The other connection allows the other person to speak into a mouthpiece and for you to hear what they are saying in your earpiece.
A socket is similarly composed of two streams. The client writes on one stream which the server reads from. This stream is used to send commands from the client to the server. The server writes on another stream which the client reads from. The server uses this stream to send the results of commands to the client.
The next step in writing a client in Java is to get the stream to write to and the stream to read from and create a writer and a reader for them. Notice that we need to make slightly different calls to create the reader and writer, but once created we can use them in the same way as earlier streams:
// Get the stream reader and writer to enable communication input = new BufferedReader( new InputStreamReader(tcpSocket.getInputStream())); output = new PrintWriter(tcpSocket.getOutputStream(), true);
Now that the connection is established, we can request a Web page by writing the GET command on the output stream. So, to get a Web page, we send the GET command to the server:
output.println("GET " + pageName);To get the cs134 home page, we would set pageName to "/~cs134/index.html".
Now, we need to read the Web page that the server returns over the network connection. This is a loop with a style familiar to what we have seen before:
String response = ""; String curline = input.readLine(); while (curline != null) { response = response + "\ n" + curline; curline = input.readLine(); } return response;
Talking to other types of servers is done in a similar fashion. In all cases, we create a socket and then extract the input and output streams so that we can send commands and read replies. What differs is the protocol, that is, the list of commands that the server will understand. For each command, we need to decide how to process it.
The default port for a POP server is 110.
Here are the commands provided by a POP server:
USER usernameThis command tells the POP server which user's mail to read. Instead of username you would identify the account to log into. This command responds with a single line which starts with the string +OK.
PASS passwordThis command tells the POP server the password corresponding to the previous user account. Instead of password you would provide the user's actual password. This command responds with a single line. If the line begins with +OK the login worked. If the login failed, it returns a line beginning with -ERR.
The remaining commands will only work after a successful login.
STATThis returns some simple statistics about the mailbox. The line it returns has the following form:
+OK 3 496The first number reports the number of messages in the mailbox. The second reports the total number of characters in all the messages.
TOP 1 0This returns the header of the message followed by some number of lines. The first number identifies the message number to return. The second number indicates how many lines of the message body to return. As shown above, the command would return only the header for the 1st message.
This message returns first with a line that begins either with +OK or -ERR. If the line it returns begins with +OK, it then sends multiple lines that are the header and the number of body lines requested. You can tell when it is done sending lines because the last line will consist of a single period character (.).
RETR 1This command is a lot like TOP except that it returns the entire message requested, both header and body, in their entirety. The number you provide is the message number.
This command returns first with a line that begins either with +OK or -ERR. If the line it returns begins with +OK, it then sends multiple lines that are the header and complete body of the mail message requested. You can tell when it is done sending lines because the last line will consist of a single period character (.).
QUITThis command ends the connection with the mail server. It always returns with a single line beginning with +OK.
You can try out these commands using telnet to connect to a mail server on port 110. Just type in the commands shown above and see the responses that you get back.