Lab 6 : Social Networks

Objective
  • Test the usability and versatility of your graph ADT design by implementing a second client.
  • Use basic OO design features in Swift: extensions, inheritance, subtyping, etc.

Table Of Contents

Overview

Social networks capture relationships, for example, among friends on Facebook, cows in a herd, members of the seven families in “Game of Thrones”, etc. You will implement a program to model and answer questions about social networks. Specifically, after developing the machinery to load and manipulate social networks, you will build an app to visualize them and allow the user to explore how members of a social network are related.

You will also demonstrate the scalability of your Graph ADT by modeling a graph capturing the Marvel Comics universe of characters, which contains thousands of nodes and hundreds of thousands of edges. (One interesting reason to use this particular dataset is that researchers at Cornell University who published a research paper showed that its graph is strikingly similar to “real-life” social networks.) At that size, you may discover performance issues that weren’t revealed by your unit tests on smaller graphs. With a well-designed implementation, your program will run in a matter of seconds, but bugs or sub-optimal data structures can increase the runtime to anywhere from several minutes to 30 minutes or more. If this is the case, you may want to go back and revisit your graph implementation. Remember that different graph representations have widely varying time complexities for various operations and this, not a coding bug, may explain the slowness. We’ll discuss performance profiling techniques and remediation strategies at the beginning of lab.

When modeling a social network as a graph, each node represents a member of the network and each edge represents a relationship between two members. For example, in the Marvel Comics universe, an edge from Zeus to Hawk indicates that Zeus appeared in a comic book that Hawk also appeared in. The label of that edge is, in this case, the name of the book. If Zeus and Hawk appeared together in five issues, then Zeus would have five edges to Hawk, and Hawk would have five edges to Zeus. Thus, the graph for Marvel Comics is symmetric — for each edge from Zeus to Hawk, there is an analogous edge from Hawk to Zeus — but that is not the case for all social networks. (Think of the graph capturing the asymetric “follows” relationship for Twitter’s social network or the mother/father relationship in a family tree.)

Below I’ve defined several clear Problems to tackle, but doing them in strict order may not be the most expedient route. I recommend proceeding as follows:

  1. Problem 0: Project Set Up.
  2. Problem 1: Building the Graph and enough of Problem 3: Testing Your Solution to ensure your graph loading code is correct.
  3. Problem 2: Finding Paths and enough of Problem 3: Testing Your Solution to ensure your shortest paths code is correct.
  4. Problem 4: A Command-Line Tool to provide a simple program to use on the command line.

Problem 0: Project Set Up

We will continue using the same repository as last week. This week, your code will go in the GraphProjects/SocialNetworks directory. Open the GraphProjects.xcworkspace in the top-level directory of your repository so you have access to be that code and that GraphADT you designed last week. The SocialNetworks folder structure in the same as the GraphADT directory. You’ll want to switch to using the socialnet scheme with the “Your Mac” target.

Question
  1. Create a new file SocialNetworks/README.md in your project.

    As you complete this assignment, you may need to further modify the implementation and perhaps the public interface of your Graph ADT. Briefly document any changes you made and why in README.md (no more than 1-2 sentences per change). If you made no changes, state that explicitly.

    as above, you don’t need to track and document cosmetic and other minor changes, such as renaming a variable; we are interested in substantial changes to your API or implementation, such as adding a public method or using a different data structure. Describe logical changes rather than precisely how each line of your code was altered. For example, “I switched the data structure for storing all the nodes from a ___ to a ___ because ___" is more helpful than, “I changed line 27 from nodes = ___(); to nodes = ____();.”

Problem 1: Building the Graph

Your first step is to implement code to create a graph from a data file. We will again use JSON-encoded data files, with the following specified format.

Data File Format

As in previous labs, a data file contains a list of nodes. It may also contain an optional array of edges, where the label of an edge decribes how the source node is related to the destination node are related. A simple example is below:

{
  "nodes" : ["Abba", "Ava", "Clementine", "Buttercup", ... ],
  "edges" : [
    { "src" : "Abba", "label" : "friend", "dst" : "Buttercup" },
    { "src" : "Abba", "label" : "frenemy", "dst" : "Clementine" },
    { "src" : "Ava", "label" : "mother", "dst" : "Abba" },
    ...
  ],
  "locations" : ...
}

Ignore the locations part of the files for now – those locations will be used when you implement your app below. While this encoding is sufficient, it is not always convenient or efficient, so we’re going to extend our data representation to include an optional array of properties, where each property entry includes a node name and a property that that node has:

{
  "nodes" : [ ... ],
  "edges" : [ ... ],
  "properties" : [
    ...,
    { "node": "ZEUS", "property": "KZ 1/2" },
    { "node": "HAWK", "property": "KZ 1/2" },
    { "node": "HERCULES [GREEK GOD]", "property": "KZ 1/2" },
    ...
  ],
  "locations" : ...
}

When creating a Graph, every node with a certain property should be connected to every other note with the property. Properties makes it easier to organize our data, and they enable use to create much smaller files for large datasets in which many members are related, as in the Marvel Comics data set. Specifically, a property shared among \(n\) nodes can be encode with \(n\) property entries but would require \(n^2\) edge entries.

Note that a single file can have both edges and properties in it. Have a look at sample files small-cows.json and cows-properties.json to be sure you understand the representation.

Creating the Graph

Since the JSON file format is specific to this one particular application, code to read it and create a graph should not be included in the graph ADT. Instead, you should place the code to create a graph from JSON data in a new initializer defined in an extension of Graph. It will take the JSON data as a parameter and construct a new graph with the nodes, edges, and properties in the file. Recall that such an extension would be written:

extension Graph {
    public convenience init(jsonForGraphWithProperties json : [String:Any]) {
        self.init()
        ...
    }
}

You may assume the json data is well-formed. This extension should be put in a new SocialNetworksGraphExtension.swift file that is part of the “SocialNetworks” target. As with other Graph initializers, you should include a call to checkRep() at the end of it. But what error do you get if you try? In this particular case, changing the checkRep method from private to internal is probably the best choice since Swift doesn’t provide us with a protected access level. However, I’d document that method to indicate that it should not be used outside of Graph and its extensions so to at least warn clients not to use it, even if we can’t prevent it.

Testing

Test your new initializer in isolation, and verify that your program builds the graph correctly before continuing. The assignment formally requires this in Problem 3 below. You may wish to complete this aspect of Problem 3 before continuing.

Problem 2: Finding Paths

You will also define a method to compute how two members of a social network are related. In other words, you will write code to find a path between two nodes in a graph. How that path is subsequently used, or the format in which it is display, depends on the requirements of the particular application, be it your test driver or iOS app. Your code will find the shortest path via breadth-first search (BFS). You should be familiar with this algorith, but below is a general BFS pseudocode algorithm to find the shortest path between two nodes in a graph G that you can use. For readability, you should of course use more descriptive variable names in your actual code than are needed in the pseudocode:

BFS Algorithm
start = starting node
dest = destination node
Q = queue, or "worklist", of nodes to visit: initially empty
M = map from nodes to paths: initially empty.
    // Each key in M is a visited node.
    // Each value is a path from start to that node.
    // A path is a list; you decide whether it is a list of 
    // nodes, or edges, or node data, or edge data, or nodes 
    // and edges, or something else.

Add start to Q
Add start->[] to M (start mapped to an empty list)
while Q is not empty:
    dequeue next node n
    if n is dest
        return the path associated with n in M
    for each edge e=⟨n,m⟩:
        if m is not in M, i.e. m has not been visited:
            let p be the path n maps to in M
            let p' be the path formed by appending e to p
            add m->p' to M
            add m to Q
        
If the loop terminates, then no path exists from start to dest.
The implementation should indicate this to the client.

Here are some facts about the algorithm:

  • It has a loop invariant that every element of Q is a key in M.
  • If the graph were not a multigraph, the for loop could have been equivalently expressed as for each neighbor m of n:.
  • If a path exists from start to dest, then the algorithm returns a shortest path.

The shortest path between two nodes may not be unique. For testing and grading purposes, your program should return the lexicographically (alphabetically) least path. More precisely, your algorithm should pick the lexicographically first neighbor at each next step in the path, and if the current node and neighbor are related in multiple ways, the algorithm should choose the edge that is lexicographically smaller. The BFS algorithm above can be easily modified to support this ordering: in the for-each loop, visit edges in increasing order of m's name, with edges to the same node visited in increasing order of label. This is not meant to imply that your graph should store data in this order; it is merely a convenience for making sure your test output is stable between runs and always consistent with ours.

Because of this application-specific behavior, you should again implement your BFS algorithm in your Graph extension rather than directly in your graph ADT, as other applications that might need BFS probably would not need this special ordering. Further, other applications using the graph ADT might need to use a different search algorithm, so we don’t want to hard-code a particular search algorithm in the graph ADT. To match the discussion below, we’ll assume that your BFS method is declared as follows

extension Graph {
    ...

    public func breadthFirstSearch(from src: String, to dst: String) -> ... {
        ...
    }
}

Your methods in the Graph extension conform to the specification of the Graph class itself. As such, documentation for those methods should only refer to the Graph’s abstract state and public API.

If you wrote any additional ADTs for this part, include an abstraction function, representation invariant, and checkRep in each of them. If a class/struct does not represent an ADT, place a comment that explicitly says so where the AF and RI would normally go. Feel free to come chat with us if you feel unsure about what counts as an ADT and what doesn’t.

Testing

Unit test your BFS implementation in isolation, and verify that your program finds the correct paths. The assignment formally requires this in Problem 3 below. You may wish to complete this aspect of Problem 3 before continuing.

Problem 3: Testing Your Solution

You will perform three types of tests on your graph extension:

  1. Specification Tests
  2. Implementation Tests
  3. Performance Tests

Each serves a different purpose and will require different data sets: the Marvel graph for scalability testing, but its size makes using it for specification testing a bad idea. In addition, it is important to be able to test your graph-building and BFS operations in isolation, separately from each other. For these reasons, you will use a test driver to verify the correctness of both your parser and BFS algorithm on small graphs.

Specification Tests

You will write specification tests using test scripts similar to those you wrote for your Graph ADT. The format is defined in the test file format section below. In addition to writing *.test and *.expected files as before, you can also write *.json files in the format above. All of these files will go in the “SocialNetworksTests” directory.

To get started, find the file named SpecificationTests in the “SocialNetworksTests” directory and copy the text of your GraphTests/SpecificationTests.swift source file into to it. Then revise its contents to match this assignment’s file format specification.

Your code changes will most likely be confined to adding new cases to the Script class’s execute method and possibly writing a couple of new helper methods.

This process will cause duplicated code between the earlier lab and this one, but it avoids some tricky issues with subclassing and is probably the easiest option. We could have pursued a better approach if we knew ahead of time this sort of reuse would happen. One optional extension is to do just that.

Implementation Tests

Your specification tests will most likely cover the bulk of your testing. You may need to test additional behaviors specific to your implementation, such as handling of edge cases. If so, write these tests in a second unit test class.

Performance Tests

Using the full Marvel dataset, your program must be able to construct the graph and find a path in less than 30 seconds on a lab computer. I suggest not adding performance tests as .test files so that you don’t run them every time your run your specification tests.

Instead, you can use the SocialNetworksTests/Tests/PerformanceTests.swift file a starting point for your performance tests. You will need to uncomment and slightly edit it the tests to match your GraphADT implementation.

import XCTest
import GraphADT
@testable import SocialNetworks

class PerformanceTests: XCTestCase {

  // Load the full Marvel data set.
  // This should take less than 30 seconds to complete.
  func testMarvelLoadTime() {
    let text = DataFiles.loadFile(named: "marvel.json")
    let json = try! JSONSerialization.jsonObject(with: text.data(using: .ascii)!) as! [String:Any]    
    let graph = Graph(jsonForGraphWithProperties: json)
    XCTAssertEqual(graph.nodes.count, 6438)
  }
  
  // Load the full Marvel data set and perform a BFS.
  // This should take less than 30 seconds to complete.
  func testMarvelBFS() {
    let text = DataFiles.loadFile(named: "marvel.json")
    let json = try! JSONSerialization.jsonObject(with: text.data(using: .ascii)!) as! [String:Any]
    let graph = Graph(jsonForGraphWithProperties: json)
    graph.addNode(withName: "NOBODY")
    let path = graph.bfs(src: "BLADE", dst: "NOBODY")
    XCTAssertEqual(path, nil)
  }

  ...

}

The console output will indicate how long each test took to run. As a point of reference, our solution took around 6 seconds to load the graph and about 12 seconds to load the graph and perform the search.

If your program takes an excessively long time to construct the graph for the full Marvel dataset, first make go back and verify it correctly handles a very small dataset. If it does, proceed with the steps we used to profile performance during our lab exercise.

Question
  1. Document any changes you made for performance in your SocialNetworks/README.md file. Describe each change similar to how you described changes in the lab exercise. If you made no changes, state that explicitly.

Test Script File Format

The format of the test file is similar to the format described in last week’s lab. Several sample test files are provided in the Resources for this lab.

As before, the test driver manages a collection of named graphs and accepts commands for creating and operating on those graphs.

Each input file has one command per line, and each command consists of whitespace-separated words, the first of which is the command name and the remainder of which are arguments. Lines starting with # are considered comment lines and should be echoed to the output when running the test script. Lines that are blank should cause a blank line to be printed to the output.

The behavior of the testing driver on malformed input command files is not defined; you may assume the input files are well-formed.

In addition to all the same commands (and the same output) as last time, our driver this week accepts the following new commands:

  • Command: LoadGraph graphName file.json

    Creates a new graph named graphName from file.json, where file.json is a data file of the format defined above. The command's output is

    loaded graph graphName

    You may assume file.json is well-formed; the behavior for malformed input files is not defined.

  • Command: FindPath graphName node_1 node_2

    Find the shortest path from node_1 to node_n in the graph using your breadth-first search algorithm. For this command only, underscores in node names should be converted to spaces before being passed into any methods external to the test driver. For example, "node_1" would become "node 1". This is to allow the test scripts to work with the full Marvel dataset, where many character names contain whitespace (but none contain underscores). Anywhere a node is printed in the output for this command, the underscores should be replaced by spaces, as shown below.

    Paths should be chosen using the lexicographic ordering described above. If a path is found, the command prints the path in the format:

    path from NODE 1 to NODE N:
    NODE 1 to NODE 2 via LABEL 1
    NODE 2 to NODE 3 via LABEL 2
    ...
    NODE N-1 to NODE N via LABEL N-1

    where NODE 1 is the first node listed in the arguments to FindPath, NODE N is the second node listed in the arguments, and LABEL K is the title of a book that NODE K-1 and NODE K appeared in.

    Not all nodes may have a path between them. If the user gives two valid node arguments NODE_1 and NODE_2 that have no path in the specified graph, print:

    path from NODE 1 to NODE N:
    no path found

    If a node name NODE was not in the original dataset, print:

    unknown node NODE

    where, as before, any underscores in the name are replaced by spaces. If neither node is in the original dataset, print the line twice: first for NODE 1, then for NODE N. These should be the only lines your program prints in this case — i.e., do not print the regular "path not found" output too.

    What if the user asks for the path from a character in the dataset to itself? Print:

    path from C to C:

    but nothing else. (Hint: a well-written solution requires no special handling of this case.)

    A request for a path from a character that is not in the dataset to itself should print the usual "unknown node C" output.

Problem 4: A Command-Line Tool

Add code to SocialNetworks/Source/socialnet/main.swift that allows a user to interact with your program from the command line (i.e., your code should read user input from the terminal). Your program should prompt the user and then read a command until the user quits. When the program starts, it should load in the marvel data set. The minimal set of commands to handle are the following:

  • members <prefix>: Print all members of the social network whose names start with the given prefix. If prefix is empty, all members should be printed.
  • path <name1> <name2>: Print the path from name1 to name2.
  • quit: quits the program
  • load <file>: load the social network in the given file, eg: load cows.json loads the data from Data/cows.json.

There is no rigid specification here for output formats, but especially creative ones may receive a small amount of extra credit.

Extensions

There are many additional features you can add to your program to compute other metrics of the graph. Here are a few:

  • Most Popular Member: Who has the most immediate neighbors?

  • How about the Most k-Popular: Who has the most neighbors no more than \(k\) steps away?

  • Is the network fully connected? That is, are every two members related by some chain of relatonships?

  • Center of the Universe: Design an algorithm to compute the “center” of a social network. A good starting point is the Oracle of Bacon and it’s discussion of computing centers in Hollywood. A simple version is to compute how good a center a specific member is. A more sophisticated algorithm is to compute how good a center each member is. This can be done the expensive way (use the simple version on each member), but there are better algorithms. See https://en.wikipedia.org/wiki/Graph_center for a few pointers. Who is the center of the happy herd? Game of Thrones? Marvel Comics?

What To Turn In

Be sure to submit everything in the following checklist:

Submission Checklist
GraphADT:
  • Code updates to address any performance bottlenecks.
  • Updated GraphADT/README.md with answer to Question 1 in the lab exercise handout.
SocialNetworks:
  • All code and unit tests for your graph extensions and other model classes.
  • Be sure any extensions to Graph are documented according to our usual expectations for ADTs.
  • SocialNetworks/README.md with answers to Questions 1-2 above.
  • A note in the README.md describing any extra features.

Grading Guidelines

This class emphasizes both program correctness and best practices, including programming style, documentation, specification, sound design principles, testing methodology, etc. Grading will reflect both of these emphases. There is, of course, some subjectivity when evaluating design and specification decisions, but your choices should follow the basic philosophies and methodologies we have been exploring. Labs will be graded on the following scale:

A+: An absolutely fantastic submission of the sort that will only come along a few times during the semester. Such a submissions reflects substantial effort beyond the basic expectations of the assignment.

A: A submission that exceeds our standard expectation for the assignment. The program must reflect additional work beyond the requirements or get the job done in a particularly elegant way.

A−: A submission that satisfies all the requirements for the assignment — a job well done.

B+: A submission that meets the requirements for the assignment, possibly with a few small problems.

B: A submission that has problems serious enough to fall short of the requirements for the assignment.

C: A submission that has extremely serious problems, but nonetheless shows some effort and understanding.

D: A submission that shows little effort and does not represent passing work.