Pivots and Joins#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

1. Review groups#

https://media-cldnry.s-nbcnews.com/image/upload/MSNBC/Components/Photo/_new/110118-obama-gift-2p.jpg

“President Barack Obama receives a gift from Saudi King Abdullah at the start of their bilateral meeting in Riyadh, Saudi Arabia, on June 3, 2009. The large gold medallion was among several gifts given that day that were valued at $34,500, the State Department later said” –CBS News

# Read in all gifts, and tidy up table by removing nan values and relabeling columns
all_gifts = Table().read_table('data/obama-gifts.csv')
all_gifts = all_gifts.where('donor_country', are.not_equal_to('nan')) #clean up and remove the nans
all_gifts = all_gifts.select('year_received', 'donor_country', 'value_usd')
all_gifts = all_gifts.relabeled('year_received', 'Year')
all_gifts = all_gifts.relabeled('donor_country', 'Country')
all_gifts = all_gifts.relabeled('value_usd', 'Value')
all_gifts.sample(8)
Year Country Value
2016 Indonesia 380
2009 Netherlands 1480
2014 Haiti 500
2014 Brunei 1205
2014 Malaysia 810
2012 Mongolia 440
2016 Brunei 2734.96
2009 Haiti 4025
# We'll also create a small subset of our Obama Gifts dataset.  
# It contains 3 countries, 3 years, 7 gifts
gifts = all_gifts.where('Year', are.contained_in([2009,2010,2011,2012]))
gifts = gifts.where('Country', are.contained_in(make_array('Denmark', 'Egypt', 'Finland')))
gifts = gifts.sort('Year')
gifts
Year Country Value
2009 Denmark 388
2009 Egypt 630
2009 Denmark 340
2010 Finland 445
2010 Egypt 356
2010 Egypt 380
2011 Denmark 485
gifts.group('Year')
Year count
2009 3
2010 3
2011 1
gifts.group('Country')
Country count
Denmark 3
Egypt 3
Finland 1
gifts.group('Year', sum)
Year Country sum Value sum
2009 1358
2010 1181
2011 485
all_gifts.group('Year', sum)
Year Country sum Value sum
2009 571164
2010 122626
2011 233730
2012 75307.8
2013 198508
2014 1.40138e+06
2015 1.10529e+06
2016 147903
gifts.group('Country', max)
Country Year max Value max
Denmark 2011 485
Egypt 2010 630
Finland 2010 445

The largest gifts given to Obama:

all_gifts.group('Country', max).sort('Value max', descending=True)
Country Year max Value max
Saudi Arabia 2016 570000
Italy 2016 124000
Qatar 2015 110000
Brunei 2016 73200
Gabon 2014 52695
Ghana 2016 48000
Kuwait 2015 42000
Brazil 2015 40000
Myanmar (Burma) 2016 26220
Pakistan 2015 23615

... (113 rows omitted)

gifts.group(make_array('Year', 'Country'))
Year Country count
2009 Denmark 2
2009 Egypt 1
2010 Egypt 2
2010 Finland 1
2011 Denmark 1
gifts.group(make_array('Year', 'Country'), sum)
Year Country Value sum
2009 Denmark 728
2009 Egypt 630
2010 Egypt 736
2010 Finland 445
2011 Denmark 485
all_gifts.group(['Year', 'Country'], max).sort('Value max', descending=True)
Year Country Value max
2014 Saudi Arabia 570000
2015 Saudi Arabia 522972
2009 Saudi Arabia 132000
2009 Italy 124000
2015 Qatar 110000
2015 Brunei 73200
2013 Brunei 71468
2016 Saudi Arabia 56720
2011 Gabon 52695
2009 Ghana 48000

... (336 rows omitted)

2. Pivots#

Summarize data that has been grouped by two variables in a grid.

gifts
Year Country Value
2009 Denmark 388
2009 Egypt 630
2009 Denmark 340
2010 Finland 445
2010 Egypt 356
2010 Egypt 380
2011 Denmark 485

With only two parameters in pivot, the values are counts.

gifts.pivot('Country', 'Year')
Year Denmark Egypt Finland
2009 2 1 0
2010 0 2 1
2011 1 0 0
gifts.pivot('Year', 'Country')
Country 2009 2010 2011
Denmark 2 0 1
Egypt 1 2 0
Finland 0 1 0

With four parameters used in pivot,

  • The first is the horizontal column labels

  • The second is the vertical row labels

  • The third is the column used as values in the grid

  • The fourth is the aggregation function used for the values in the grid

gifts.pivot('Year', 'Country', 'Value', sum)
Country 2009 2010 2011
Denmark 728 0 485
Egypt 630 736 0
Finland 0 445 0
gifts.pivot('Year', 'Country', 'Value', max)
Country 2009 2010 2011
Denmark 388 0 485
Egypt 630 380 0
Finland 0 445 0

A pivot from our past: Temperatures in Greenland#

We spent some time working with our Upernavik climate data.

greenland_climate = Table.read_table('data/climate_upernavik.csv')
greenland_climate
Year Month Air temperature (C) Sea level pressure (mbar) Precipitation (millimeters)
1873 9 0.4 9999.9 15
1873 10 -5.3 9999.9 34
1873 11 -9.4 9999.9 30
1873 12 999.99 9999.9 29
1874 1 -29.6 9999.9 9
1874 2 -19.6 9999.9 22
1874 9 0.1 1010.7 68
1874 10 -5.4 1002.7 24
1874 11 -8 1010.5 15
1874 12 -8.4 1005.1 69

... (1355 rows omitted)

greenland_climate = Table.read_table('data/climate_upernavik.csv').where('Year', are.above(1876))
greenland_climate = greenland_climate.relabeled('Precipitation (millimeters)', "Precip (mm)")
tidy_greenland = greenland_climate.where('Air temperature (C)', are.not_equal_to(999.99))
tidy_greenland = tidy_greenland.where('Sea level pressure (mbar)', are.not_equal_to(9999.9))
tidy_greenland
Year Month Air temperature (C) Sea level pressure (mbar) Precip (mm)
1877 1 -21.1 994.9 24
1877 2 -26.5 1008.2 26
1877 3 -17.8 1013.4 84
1877 4 -12 1023.6 6
1877 5 -1.7 1019.4 9
1877 6 1.4 1011.8 5
1877 7 4.6 1011.5 9
1877 8 5.2 1019.8 29
1877 9 3 1009 51
1877 10 -2.8 1006.2 23

... (1221 rows omitted)

When plotting line graphs, we needed separate columns for each month. A pivot gives us a table with that format.

temps_by_month = tidy_greenland.pivot('Month', 'Year', 'Air temperature (C)', np.mean)
temps_by_month
Year 1 2 3 4 5 6 7 8 9 10 11 12
1877 -21.1 -26.5 -17.8 -12 -1.7 1.4 4.6 5.2 3 -2.8 -10 -19.2
1878 -22.9 -26.9 -19.6 -13 -5.6 1.9 3.2 4.3 -0.9 -4 -2 -3.7
1879 -13.5 -25.4 -21.3 -13 -2.7 0 5.2 5.8 -1.1 -5.5 -8.2 -16.5
1880 -22.6 -28.9 -22.7 -10.7 -3.1 2.2 5.6 2.9 0.8 -0.3 -9.4 -14.6
1881 -13.6 -19.9 -26.4 -12.8 -3.1 1.3 5.3 4.2 0.4 -2.8 -9.4 -17.5
1882 -27.6 -25.6 -24.9 -14.8 -3.2 3.1 3.3 4.1 1 -6.1 -12.1 -17.1
1883 -19.9 -21.5 -12.7 -12.8 -1.9 3 6.2 4.4 2.2 -4.1 -8.9 -21.1
1884 -26.1 -24.9 -22.7 -13.6 -6.4 1.6 5.1 3.2 -1.6 -5.1 -11.8 -20.7
1885 -15.9 -18.3 -23 -12.8 -2 -0.4 5.2 5.2 -0.8 -4.6 -13.2 -18.5
1886 -24.2 -25.2 -23.7 -18.4 -2.8 1.2 5.3 4.2 0.5 -6.3 -11.9 -18.6

... (97 rows omitted)

two_months = temps_by_month.select('Year', '2', '9')
two_months.show(5)
Year 2 9
1877 -26.5 3
1878 -26.9 -0.9
1879 -25.4 -1.1
1880 -28.9 0.8
1881 -19.9 0.4

... (102 rows omitted)

two_months.plot('Year')
../_images/11-pivots-and-joins_32_0.png

3. Joins#

Let’s hand pick a few of our favorite gifts.

best_gifts = gifts.take(0,1,2,3).drop("Year")
best_gifts
Country Value
Denmark 388
Egypt 630
Denmark 340
Finland 445

If you’re curious what they are, here are shortened descriptions for the four we chose.

best_gifts.with_columns("Gift description", 
                        make_array('Book "Restoring the Military Balance"',
                                   'Yellow alabaster bowl',
                                   'Photograph of Her Majesty and His Royal Highness',
                                   'Hand-blown blue glass bird'))
Country Value Gift description
Denmark 388 Book "Restoring the Military Balance"
Egypt 630 Yellow alabaster bowl
Denmark 340 Photograph of Her Majesty and His Royal Highness
Finland 445 Hand-blown blue glass bird

Here is some other info about countries, specifically the GDP from each country during Obama’s first year in office. (GDP is gross domestic product, a measure of the value of the final goods and services produced in a country.)

We obtained this data from the World Bank and IMF sites.

gdp = Table().with_columns(
    'Country Name', make_array('Denmark', 'Egypt', 'Egypt', 'Greece'),
    'GDP (Billion $)', make_array(321, 189, 198, 331),
    'Source', make_array('World Bank', 'World Bank', 'IMF', 'World Bank'))
gdp
Country Name GDP (Billion $) Source
Denmark 321 World Bank
Egypt 189 World Bank
Egypt 198 IMF
Greece 331 World Bank

Can we combine best_gifts with info about the countries like GDP?

Join will let us merge data from two tables by pairing together rows from each that share a common property. Here we’ll merge a table of gifts with a table of GDP information about the countres giving gifts.

best_gifts.join('Country', gdp, 'Country Name')
Country Value GDP (Billion $) Source
Denmark 388 321 World Bank
Denmark 340 321 World Bank
Egypt 630 189 World Bank
Egypt 630 198 IMF

4. Open-ended exploration#

Suppose we had the following question: Is there any association between a country’s GDP and the total value of gifts they gave to President Obama?

The World Bank provides GDP data we can use.

gdp = Table().read_table('data/gdp.csv')
gdp.show(3)
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 Unnamed: 66
Aruba ABW GDP (current US$) NY.GDP.MKTP.CD nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 4.05587e+08 4.87709e+08 5.96648e+08 6.95531e+08 7.64804e+08 8.72067e+08 9.58659e+08 1.08324e+09 1.24581e+09 1.32067e+09 1.37989e+09 1.53184e+09 1.66536e+09 1.72291e+09 1.87318e+09 1.89665e+09 1.96201e+09 2.04413e+09 2.25475e+09 2.35978e+09 2.46983e+09 2.67765e+09 2.84302e+09 2.55363e+09 2.45363e+09 2.63799e+09 2.61508e+09 2.72793e+09 2.79106e+09 2.96313e+09 2.9838e+09 3.09218e+09 3.20223e+09 3.31006e+09 2.49665e+09 nan nan
Africa Eastern and Southern AFE GDP (current US$) NY.GDP.MKTP.CD 2.12906e+10 2.18085e+10 2.3707e+10 2.821e+10 2.61188e+10 2.96822e+10 3.22391e+10 3.35146e+10 3.65215e+10 4.18283e+10 4.48626e+10 4.94789e+10 5.35148e+10 6.96008e+10 8.60578e+10 9.16492e+10 9.11246e+10 1.03416e+11 1.15345e+11 1.34671e+11 1.70654e+11 1.74387e+11 1.67266e+11 1.74918e+11 1.60134e+11 1.36297e+11 1.52518e+11 1.86145e+11 2.0414e+11 2.17539e+11 2.53224e+11 2.73403e+11 2.38255e+11 2.36527e+11 2.4012e+11 2.69637e+11 2.68414e+11 2.82185e+11 2.65814e+11 2.62172e+11 2.83925e+11 2.58819e+11 2.6487e+11 3.52659e+11 4.38834e+11 5.12211e+11 5.75921e+11 6.61179e+11 7.08287e+11 7.19217e+11 8.60478e+11 9.64418e+11 9.73043e+11 9.83937e+11 1.00368e+12 9.24253e+11 8.82355e+11 1.02065e+12 9.91022e+11 9.97534e+11 9.21646e+11 1.0821e+12 nan
Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.37778e+08 5.48889e+08 5.46667e+08 7.51111e+08 8e+08 1.00667e+09 1.4e+09 1.67333e+09 1.37333e+09 1.40889e+09 1.74889e+09 1.83111e+09 1.59556e+09 1.73333e+09 2.15556e+09 2.36667e+09 2.55556e+09 2.95333e+09 3.3e+09 3.69794e+09 3.64172e+09 3.47879e+09 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 4.05518e+09 4.51556e+09 5.22678e+09 6.20914e+09 6.97129e+09 9.74788e+09 1.01093e+10 1.24162e+10 1.58567e+10 1.78051e+10 1.99073e+10 2.01464e+10 2.04971e+10 1.91342e+10 1.81166e+10 1.87535e+10 1.80532e+10 1.87995e+10 2.01161e+10 nan nan

... (263 rows omitted)

Data wrangling: Let’s subset to the columns we actually need and make these values more readable.

gdp_2008 = Table().with_columns(
    'Country Name', gdp.column('Country Name'),
    '2008 GDP Billion USD', gdp.column('2008') / 1e9
)
gdp_2008.show(3)
Country Name 2008 GDP Billion USD
Aruba 2.84302
Africa Eastern and Southern 708.287
Afghanistan 10.1093

... (263 rows omitted)

Let’s use group to sum up the values of gifts given by each country.

all_gifts.show(5)
Year Country Value
2009 Mexico 400
2009 Japan 1495
2009 United Kingdom 16510
2009 Algeria 500
2009 Denmark 388

... (608 rows omitted)

gifts_by_country = all_gifts.drop('Year').group('Country', sum)
gifts_by_country.show(5)
Country Value sum
Afghanistan 9263
Algeria 4312.28
Argentina 3977.98
Armenia 2985
Australia 4365.48

... (118 rows omitted)

Let’s join these two datasets.

joined = gifts_by_country.join('Country', gdp_2008, 'Country Name')
joined
Country Value sum 2008 GDP Billion USD
Afghanistan 9263 10.1093
Algeria 4312.28 171.001
Argentina 3977.98 361.558
Armenia 2985 11.662
Australia 4365.48 1055.13
Austria 1275 432.052
Azerbaijan 11503 48.8525
Bahrain 10494 25.7109
Bangladesh 390 91.6313
Benin 12289.9 9.78774

... (91 rows omitted)

joined.scatter('2008 GDP Billion USD', 'Value sum')
../_images/11-pivots-and-joins_54_0.png

The scatter plot doesn’t show much because most of the data is in a very tiny part of the graph. It’s hard to tell whether the other points are outliers or part of some trend. In this case, one handy tool is to take the log of each x-value and y-value.

Recall that for a number \(n\), \(log(n) = a\) such \(10^a = n\). Here are some values to illustrate how logarithms work:

n

log(n)

1

0

10

1

100

2

1,000

3

10,000

4

100,000

5

1,000,000

6

10,000,000

7

100,000,000

8

1,000,000,000

9

joined.scatter('2008 GDP Billion USD', 'Value sum', xscale='log', yscale='log')
../_images/11-pivots-and-joins_56_0.png

That makes the points more evenly distributed across both axes of the plots, and it makes any correlation jump out. We can do a lot with log-log plots like this, but for us, we will be happy to just use it to make the correlation more apparent in this one example.

What about the relationship between oil production and gift values? You can find oil production for every country or region of the world here.

# Compute the average oil production during the years 2009-2016.
oil = Table().read_table('data/oil_production_by_country.csv')
oil_for_years = oil.where('Year', are.between_or_equal_to(2009, 2016))
oil_by_entity = oil_for_years.drop('Code', 'Year').group('Entity', np.average)
oil_by_entity.sort('Oil production (TWh) average', descending=True)
oil_by_entity.sample(10)
Entity Oil production (TWh) average
Switzerland 0
Senegal 0
Nepal 0
United States Virgin Islands 0
China 2381.96
New Caledonia 0
Croatia 9.88503
Mali 0
Middle East (BP) 15423.3
Nicaragua 0
joined_oil = gifts_by_country.join('Country', oil_by_entity, 'Entity')
joined_oil.scatter('Oil production (TWh) average', 'Value sum')
../_images/11-pivots-and-joins_60_0.png
plot = joined_oil.scatter('Oil production (TWh) average', 'Value sum',
                   xscale='log', yscale='log')
plot.set_xlim(1/1e3, 1e5)
plot.set_ylim(1e2, 1e6)
../_images/11-pivots-and-joins_61_0.png

If you’re curious about the top oil producers in the world…

joined_oil.sort('Oil production (TWh) average', descending=True)
Country Value sum Oil production (TWh) average
Russia 21474 6153.07
Saudi Arabia 2.14279e+06 6151.13
United States 974 5059.72
China 79786.5 2381.96
Canada 6541.36 2190.7
United Arab Emirates 6125 1825.73
Iraq 4755.99 1819.56
Kuwait 48265 1660.07
Mexico 32546.2 1612.68
Brazil 44409.7 1374.56

... (102 rows omitted)

Q: Brainstorm some alternative variables we might want to examine correlations between?

  • Norms of gift giving? Political alliances? Trade deals?

  • Could we track if this relationship changes over time?

  • In this class, we’re going to be moving towards some more of these open-ended data questions.