Data Analysis & Plotting

Today, we will discuss the following:

  • Using the data structures we’ve learned about to perform some exploratory data analysis

  • Using matplotlib to create plots

Premise

You are a talent scout for a big English football (soccer) club. The club you work for has a good defense, but a weak offense. So, you’ve been tasked with identifying a star striker to help score more goals! You decide to identify candidates in a data-driven manner.

Peaking into the data

You’ve managed to procure some CSV files containing stats on every player in the league for 3 prior seasons – 2018-19, 2019-20, and 2020-21. Let’s take a look at the first few lines of one of these files using the head command in the terminal.

!head seasonStats/season2018-19.csv
Name,Goals,Passes,Fouls
Rolando Aarons,0,0,0
Abdul Rahman Baba,0,0,0
Tammy Abraham,0,0,0
Adrián,0,0,0
Adrien Silva,0,79,0
Benik Afobe,0,0,0
Sergio Agüero,21,771,21
Daniel Agyei,0,0,0
Soufyan Ahannach,0,0,0

Reading the file into a dictionary

The files look messy to us, but the first line (called the header) defines what values to expect in each column. So, we see that the 1st column in the file corresponds to the names of the players, the 2nd column is the number of goals they scored in the season, the 3rd column is the number of passes they completed, and the 4th is the number of fouls they committed. Let’s now write a function that reads in such files and produces a dictionary mapping player names (keys) to goals (values).

def readFile(filename):
    """
    Function that takes in a file name and
    outputs a dictionary mapping player names
    to the number of goals they scored.
    """
    
    nameGoalDict = {}

    with open(filename) as file:
        
        # the first line of the file is just the header
        # describing what each column is so we read it in
        # before we iterate over the rest of the file
        header = file.readline()
        
        # we now iterate over the rest of the file
        # and add key-value pairs of name: goals
        for line in file:
            
            line = line.strip().split(',')
            nameGoalDict[line[0]] = int(line[1])
        
    return nameGoalDict

season1Dict = readFile("seasonStats/season2018-19.csv")
season2Dict = readFile("seasonStats/season2019-20.csv")
season3Dict = readFile("seasonStats/season2020-21.csv")
# to get a sense of how many players we're trying to pick from
# we'll also print out the length of one of these dictionaries
print(len(season1Dict))
884

Sorting players by number of goals scored

Now that we have all these dictionaries, we’d like to use them to determine who the top goal scorers are each season. For this, we sort the dictionaries by value.

def returnGoals(nameGoalTuple):
    """
    Function that returns the number of goals scored
    from a given tuple.
    """
    return nameGoalTuple[1]

sortedGoals1 = sorted(season1Dict.items(), key=returnGoals, reverse=True)
sortedGoals2 = sorted(season2Dict.items(), key=returnGoals, reverse=True)
sortedGoals3 = sorted(season3Dict.items(), key=returnGoals, reverse=True)
#sortedGoals1

Using maplotlib to visualize the data

Now from these sorted lists of tuples, we can start to form a short list by looking at the top 10 goal scorers in each season. So let’s focus on the 2018-19 season for now. We can print the first 10 elements of the sorted list of tuples, but it’s not very visually appealing to view results that way. Instead, let’s try to make a bar chart of the top 10 goal scorers in 2018-19.

# let's first look at some basic matplotlib functionality
import matplotlib.pyplot as plt

# we can create a simple line plot as follows
# this corresponds to points (1,10), (2,14), etc
plt.plot([1, 2, 3, 4], [10, 14, 15, 18]) 
plt.show()
../../_images/plotting_9_0.png

Decorating a Plot

We can specify length, width, title, name of labels etc to decorate our plot.

# a more advanced example where we customize the line plot
plt.figure(figsize=(4, 4)) # create a 4 by 4 figure
plt.plot([0, 5, 10], [4, 12, 14])
plt.xticks([0, 5, 10],          # x values of axis ticks
           ['x1', 'x2', 'x3'])  # values to show for ticks

# rotate the y tick labels, because are shown horizontally
plt.yticks([4, 12, 14], ['y1', 'y2', 'y3'], rotation=90) 

# axis labels and title
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.title("Custom plot")
plt.show()
../../_images/plotting_11_0.png

Creating a bar plot

Let’s return to our initial goal of creating a bar graph that illustrates the top goal scorers.

# Going back to our initial goal of creating a bar plot

# the names of the top ten scorers will form our labels
# for the x axis
topTenNames = [item[0] for item in sortedGoals1[0:10]]

# the x axis values are just 0-10 to provide even spacing for each bar
xValues = list(range(10))

# the y axis values are determined by the number of goals scored
yValues = [item[1] for item in sortedGoals1[0:10]]

# Create a new figure:
plt.figure()
# Create a bar chart
plt.bar(xValues, yValues)

# Set x tick labels from names
# rotate by 90 so labels are vertical and do not overlap
plt.xticks(xValues, topTenNames, rotation=90) 
# Set title and label axes
plt.title("Top scorers of 2018-19")
plt.xlabel("Name")
plt.ylabel("Goals")
plt.ylim([0, 30])

# Show our chart:
plt.show()
# if you'd like to save the plot as a PDF:
# this line just ensures the longer labels on the x axis don't get cut out
# plt.tight_layout() 
# plt.savefig('topScorers.pdf')
../../_images/plotting_13_0.png

Using sets to find the intersection of players that do well across all seasons

We can make similar visualizations for the other seasons. However, we’d like to recruit a player that’s consistent across all 3 seasons. So let’s try to get a short list of players that appear in the top 10 in all 3 seasons. This is a good place to make use of sets and the intersection() operator!

topTen1 = [item[0] for item in sortedGoals1[0:10]]
topTen2 = [item[0] for item in sortedGoals2[0:10]]
topTen3 = [item[0] for item in sortedGoals3[0:10]]
consistentPlayers = set(topTen1) & set(topTen2) & set(topTen3)
consistentPlayers
{'Jamie Vardy', 'Mohamed Salah'}

And that’s our final short list of players who we can try to recruit to score more goals for our team!