Pivots and Joins#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

1. Review groups#

https://media-cldnry.s-nbcnews.com/image/upload/MSNBC/Components/Photo/_new/110118-obama-gift-2p.jpg

“President Barack Obama receives a gift from Saudi King Abdullah at the start of their bilateral meeting in Riyadh, Saudi Arabia, on June 3, 2009. The large gold medallion was among several gifts given that day that were valued at $34,500, the State Department later said” –CBS News

# Read in all gifts, and tidy up table by removing nan values and relabeling columns
all_gifts = Table().read_table('data/obama-gifts.csv')
all_gifts = all_gifts.where('donor_country', are.not_equal_to('nan')) #clean up and remove the nans
all_gifts = all_gifts.select('year_received', 'donor_country', 'value_usd')
all_gifts = all_gifts.relabeled('year_received', 'Year')
all_gifts = all_gifts.relabeled('donor_country', 'Country')
all_gifts = all_gifts.relabeled('value_usd', 'Value')
all_gifts.sample(8)

Year	Country	Value
2010	Spain	660
2010	France	4580
2009	India	666
2014	Algeria	885
2012	Thailand	1000
2012	Georgia	750
2011	Qatar	3200
2014	Madagascar	696

# We'll also create a small subset of our Obama Gifts dataset.  
# It contains 3 countries, 3 years, 7 gifts
gifts = all_gifts.where('Year', are.contained_in([2009,2010,2011,2012]))
gifts = gifts.where('Country', are.contained_in(make_array('Denmark', 'Egypt', 'Finland')))
gifts = gifts.sort('Year')
gifts

Year	Country	Value
2009	Denmark	388
2009	Egypt	630
2009	Denmark	340
2010	Finland	445
2010	Egypt	356
2010	Egypt	380
2011	Denmark	485

gifts.group('Year')

Year	count
2009	3
2010	3
2011	1

gifts.group('Country')

Country	count
Denmark	3
Egypt	3
Finland	1

gifts.group('Year', sum)

Year	Country sum	Value sum
2009		1358
2010		1181
2011		485

all_gifts.group('Year', sum)

Year	Country sum	Value sum
2009		571164
2010		122626
2011		233730
2012		75307.8
2013		198508
2014		1.40138e+06
2015		1.10529e+06
2016		147903

gifts.group('Country', max)

Country	Year max	Value max
Denmark	2011	485
Egypt	2010	630
Finland	2010	445

The largest gifts given to Obama:

all_gifts.group('Country', max).sort('Value max', descending=True)

Country	Year max	Value max
Saudi Arabia	2016	570000
Italy	2016	124000
Qatar	2015	110000
Brunei	2016	73200
Gabon	2014	52695
Ghana	2016	48000
Kuwait	2015	42000
Brazil	2015	40000
Myanmar (Burma)	2016	26220
Pakistan	2015	23615

... (113 rows omitted)

gifts.group(make_array('Year', 'Country'))

Year	Country	count
2009	Denmark	2
2009	Egypt	1
2010	Egypt	2
2010	Finland	1
2011	Denmark	1

gifts.group(make_array('Year', 'Country'), sum)

Year	Country	Value sum
2009	Denmark	728
2009	Egypt	630
2010	Egypt	736
2010	Finland	445
2011	Denmark	485

all_gifts.group(['Year', 'Country'], max).sort('Value max', descending=True)

Year	Country	Value max
2014	Saudi Arabia	570000
2015	Saudi Arabia	522972
2009	Saudi Arabia	132000
2009	Italy	124000
2015	Qatar	110000
2015	Brunei	73200
2013	Brunei	71468
2016	Saudi Arabia	56720
2011	Gabon	52695
2009	Ghana	48000

... (336 rows omitted)

2. Pivots#

Summarize data that has been grouped by two variables in a grid.

gifts

Year	Country	Value
2009	Denmark	388
2009	Egypt	630
2009	Denmark	340
2010	Finland	445
2010	Egypt	356
2010	Egypt	380
2011	Denmark	485

With only two parameters in pivot, the values are counts.

gifts.pivot('Country', 'Year')

Year	Denmark	Egypt	Finland
2009	2	1	0
2010	0	2	1
2011	1	0	0

gifts.pivot('Year', 'Country')

Country	2009	2010	2011
Denmark	2	0	1
Egypt	1	2	0
Finland	0	1	0

With four parameters used in pivot,

The first is the horizontal column labels
The second is the vertical row labels
The third is the column used as values in the grid
The fourth is the aggregation function used for the values in the grid

gifts.pivot('Year', 'Country', 'Value', sum)

Country	2009	2010	2011
Denmark	728	0	485
Egypt	630	736	0
Finland	0	445	0

gifts.pivot('Year', 'Country', 'Value', max)

Country	2009	2010	2011
Denmark	388	0	485
Egypt	630	380	0
Finland	0	445	0

A pivot from our past: Temperatures in Greenland#

We spent some time working with our Upernavik climate data.

greenland_climate = Table.read_table('data/climate_upernavik.csv')
greenland_climate

Year	Month	Air temperature (C)	Sea level pressure (mbar)	Precipitation (millimeters)
1873	9	0.4	9999.9	15
1873	10	-5.3	9999.9	34
1873	11	-9.4	9999.9	30
1873	12	999.99	9999.9	29
1874	1	-29.6	9999.9	9
1874	2	-19.6	9999.9	22
1874	9	0.1	1010.7	68
1874	10	-5.4	1002.7	24
1874	11	-8	1010.5	15
1874	12	-8.4	1005.1	69

... (1355 rows omitted)

greenland_climate = Table.read_table('data/climate_upernavik.csv').where('Year', are.above(1876))
greenland_climate = greenland_climate.relabeled('Precipitation (millimeters)', "Precip (mm)")
tidy_greenland = greenland_climate.where('Air temperature (C)', are.not_equal_to(999.99))
tidy_greenland = tidy_greenland.where('Sea level pressure (mbar)', are.not_equal_to(9999.9))
tidy_greenland

Year	Month	Air temperature (C)	Sea level pressure (mbar)	Precip (mm)
1877	1	-21.1	994.9	24
1877	2	-26.5	1008.2	26
1877	3	-17.8	1013.4	84
1877	4	-12	1023.6	6
1877	5	-1.7	1019.4	9
1877	6	1.4	1011.8	5
1877	7	4.6	1011.5	9
1877	8	5.2	1019.8	29
1877	9	3	1009	51
1877	10	-2.8	1006.2	23

... (1221 rows omitted)

When plotting line graphs, we needed separate columns for each month. A pivot gives us a table with that format.

temps_by_month = tidy_greenland.pivot('Month', 'Year', 'Air temperature (C)', np.mean)
temps_by_month

Year	1	2	3	4	5	6	7	8	9	10	11	12
1877	-21.1	-26.5	-17.8	-12	-1.7	1.4	4.6	5.2	3	-2.8	-10	-19.2
1878	-22.9	-26.9	-19.6	-13	-5.6	1.9	3.2	4.3	-0.9	-4	-2	-3.7
1879	-13.5	-25.4	-21.3	-13	-2.7	0	5.2	5.8	-1.1	-5.5	-8.2	-16.5
1880	-22.6	-28.9	-22.7	-10.7	-3.1	2.2	5.6	2.9	0.8	-0.3	-9.4	-14.6
1881	-13.6	-19.9	-26.4	-12.8	-3.1	1.3	5.3	4.2	0.4	-2.8	-9.4	-17.5
1882	-27.6	-25.6	-24.9	-14.8	-3.2	3.1	3.3	4.1	1	-6.1	-12.1	-17.1
1883	-19.9	-21.5	-12.7	-12.8	-1.9	3	6.2	4.4	2.2	-4.1	-8.9	-21.1
1884	-26.1	-24.9	-22.7	-13.6	-6.4	1.6	5.1	3.2	-1.6	-5.1	-11.8	-20.7
1885	-15.9	-18.3	-23	-12.8	-2	-0.4	5.2	5.2	-0.8	-4.6	-13.2	-18.5
1886	-24.2	-25.2	-23.7	-18.4	-2.8	1.2	5.3	4.2	0.5	-6.3	-11.9	-18.6

... (97 rows omitted)

two_months = temps_by_month.select('Year', '2', '9')
two_months.show(5)

Year	2	9
1877	-26.5	3
1878	-26.9	-0.9
1879	-25.4	-1.1
1880	-28.9	0.8
1881	-19.9	0.4

... (102 rows omitted)

two_months.plot('Year')

3. Joins#

Let’s hand pick a few of our favorite gifts.

best_gifts = gifts.take(0,1,2,3).drop("Year")
best_gifts

Country	Value
Denmark	388
Egypt	630
Denmark	340
Finland	445

If you’re curious what they are, here are shortened descriptions for the four we chose.

best_gifts.with_columns("Gift description", 
                        make_array('Book "Restoring the Military Balance"',
                                   'Yellow alabaster bowl',
                                   'Photograph of Her Majesty and His Royal Highness',
                                   'Hand-blown blue glass bird'))

Country	Value	Gift description
Denmark	388	Book "Restoring the Military Balance"
Egypt	630	Yellow alabaster bowl
Denmark	340	Photograph of Her Majesty and His Royal Highness
Finland	445	Hand-blown blue glass bird

Here is some other info about countries, specifically the GDP from each country during Obama’s first year in office. (GDP is gross domestic product, a measure of the value of the final goods and services produced in a country.)

We obtained this data from the World Bank and IMF sites.

gdp = Table().with_columns(
    'Country Name', make_array('Denmark', 'Egypt', 'Egypt', 'Greece'),
    'GDP (Billion $)', make_array(321, 189, 198, 331),
    'Source', make_array('World Bank', 'World Bank', 'IMF', 'World Bank'))
gdp

Country Name	GDP (Billion $)	Source
Denmark	321	World Bank
Egypt	189	World Bank
Egypt	198	IMF
Greece	331	World Bank

Can we combine best_gifts with info about the countries like GDP?

Join will let us merge data from two tables by pairing together rows from each that share a common property. Here we’ll merge a table of gifts with a table of GDP information about the countres giving gifts.

best_gifts.join('Country', gdp, 'Country Name')

Country	Value	GDP (Billion $)	Source
Denmark	388	321	World Bank
Denmark	340	321	World Bank
Egypt	630	189	World Bank
Egypt	630	198	IMF

4. Open-ended exploration#

Suppose we had the following question: Is there any association between a country’s GDP and the total value of gifts they gave to President Obama?

The World Bank provides GDP data we can use.

gdp = Table().read_table('data/gdp.csv')
gdp.show(3)

Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	Unnamed: 66
Aruba	ABW	GDP (current US$)	NY.GDP.MKTP.CD	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	4.05587e+08	4.87709e+08	5.96648e+08	6.95531e+08	7.64804e+08	8.72067e+08	9.58659e+08	1.08324e+09	1.24581e+09	1.32067e+09	1.37989e+09	1.53184e+09	1.66536e+09	1.72291e+09	1.87318e+09	1.89665e+09	1.96201e+09	2.04413e+09	2.25475e+09	2.35978e+09	2.46983e+09	2.67765e+09	2.84302e+09	2.55363e+09	2.45363e+09	2.63799e+09	2.61508e+09	2.72793e+09	2.79106e+09	2.96313e+09	2.9838e+09	3.09218e+09	3.20223e+09	3.31006e+09	2.49665e+09	nan	nan
Africa Eastern and Southern	AFE	GDP (current US$)	NY.GDP.MKTP.CD	2.12906e+10	2.18085e+10	2.3707e+10	2.821e+10	2.61188e+10	2.96822e+10	3.22391e+10	3.35146e+10	3.65215e+10	4.18283e+10	4.48626e+10	4.94789e+10	5.35148e+10	6.96008e+10	8.60578e+10	9.16492e+10	9.11246e+10	1.03416e+11	1.15345e+11	1.34671e+11	1.70654e+11	1.74387e+11	1.67266e+11	1.74918e+11	1.60134e+11	1.36297e+11	1.52518e+11	1.86145e+11	2.0414e+11	2.17539e+11	2.53224e+11	2.73403e+11	2.38255e+11	2.36527e+11	2.4012e+11	2.69637e+11	2.68414e+11	2.82185e+11	2.65814e+11	2.62172e+11	2.83925e+11	2.58819e+11	2.6487e+11	3.52659e+11	4.38834e+11	5.12211e+11	5.75921e+11	6.61179e+11	7.08287e+11	7.19217e+11	8.60478e+11	9.64418e+11	9.73043e+11	9.83937e+11	1.00368e+12	9.24253e+11	8.82355e+11	1.02065e+12	9.91022e+11	9.97534e+11	9.21646e+11	1.0821e+12	nan
Afghanistan	AFG	GDP (current US$)	NY.GDP.MKTP.CD	5.37778e+08	5.48889e+08	5.46667e+08	7.51111e+08	8e+08	1.00667e+09	1.4e+09	1.67333e+09	1.37333e+09	1.40889e+09	1.74889e+09	1.83111e+09	1.59556e+09	1.73333e+09	2.15556e+09	2.36667e+09	2.55556e+09	2.95333e+09	3.3e+09	3.69794e+09	3.64172e+09	3.47879e+09	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	4.05518e+09	4.51556e+09	5.22678e+09	6.20914e+09	6.97129e+09	9.74788e+09	1.01093e+10	1.24162e+10	1.58567e+10	1.78051e+10	1.99073e+10	2.01464e+10	2.04971e+10	1.91342e+10	1.81166e+10	1.87535e+10	1.80532e+10	1.87995e+10	2.01161e+10	nan	nan

... (263 rows omitted)

Data wrangling: Let’s subset to the columns we actually need and make these values more readable.

gdp_2008 = Table().with_columns(
    'Country Name', gdp.column('Country Name'),
    '2008 GDP Billion USD', gdp.column('2008') / 1e9
)
gdp_2008.show(3)

Country Name	2008 GDP Billion USD
Aruba	2.84302
Africa Eastern and Southern	708.287
Afghanistan	10.1093

... (263 rows omitted)

Let’s use group to sum up the values of gifts given by each country.

all_gifts.show(5)

Year	Country	Value
2009	Mexico	400
2009	Japan	1495
2009	United Kingdom	16510
2009	Algeria	500
2009	Denmark	388

... (608 rows omitted)

gifts_by_country = all_gifts.drop('Year').group('Country', sum)
gifts_by_country.show(5)

Country	Value sum
Afghanistan	9263
Algeria	4312.28
Argentina	3977.98
Armenia	2985
Australia	4365.48

... (118 rows omitted)

Let’s join these two datasets.

joined = gifts_by_country.join('Country', gdp_2008, 'Country Name')
joined

Country	Value sum	2008 GDP Billion USD
Afghanistan	9263	10.1093
Algeria	4312.28	171.001
Argentina	3977.98	361.558
Armenia	2985	11.662
Australia	4365.48	1055.13
Austria	1275	432.052
Azerbaijan	11503	48.8525
Bahrain	10494	25.7109
Bangladesh	390	91.6313
Benin	12289.9	9.78774

... (91 rows omitted)

joined.scatter('2008 GDP Billion USD', 'Value sum')

The scatter plot doesn’t show much because most of the data is in a very tiny part of the graph. It’s hard to tell whether the other points are outliers or part of some trend. In this case, one handy tool is to take the log of each x-value and y-value.

Recall that for a number $n$, $log(n) = a$ such $10^a = n$. Here are some values to illustrate how logarithms work:

n	log(n)
1	0
10	1
100	2
1,000	3
10,000	4
100,000	5
1,000,000	6
10,000,000	7
100,000,000	8
1,000,000,000	9

joined.scatter('2008 GDP Billion USD', 'Value sum', xscale='log', yscale='log')

That makes the points more evenly distributed across both axes of the plots, and it makes any correlation jump out. We can do a lot with log-log plots like this, but for us, we will be happy to just use it to make the correlation more apparent in this one example.

What about the relationship between oil production and gift values? You can find oil production for every country or region of the world here.

# Compute the average oil production during the years 2009-2016.
oil = Table().read_table('data/oil_production_by_country.csv')
oil_for_years = oil.where('Year', are.between_or_equal_to(2009, 2016))
oil_by_entity = oil_for_years.drop('Code', 'Year').group('Entity', np.average)
oil_by_entity.sort('Oil production (TWh) average', descending=True)
oil_by_entity.sample(10)

Entity	Oil production (TWh) average
South Africa	4.41002
Paraguay	0
Senegal	0
Persian Gulf (Shift)	16080.8
New Zealand	27.4095
Congo	158.067
Persian Gulf (Shift)	16080.8
Saint Helena	0
Guyana	0
Zambia	0

joined_oil = gifts_by_country.join('Country', oil_by_entity, 'Entity')
joined_oil.scatter('Oil production (TWh) average', 'Value sum')

plot = joined_oil.scatter('Oil production (TWh) average', 'Value sum',
                   xscale='log', yscale='log')
plot.set_xlim(1/1e3, 1e5)
plot.set_ylim(1e2, 1e6)

If you’re curious about the top oil producers in the world…

joined_oil.sort('Oil production (TWh) average', descending=True)

Country	Value sum	Oil production (TWh) average
Russia	21474	6153.07
Saudi Arabia	2.14279e+06	6151.13
United States	974	5059.72
China	79786.5	2381.96
Canada	6541.36	2190.7
United Arab Emirates	6125	1825.73
Iraq	4755.99	1819.56
Kuwait	48265	1660.07
Mexico	32546.2	1612.68
Brazil	44409.7	1374.56

... (102 rows omitted)

Q: Brainstorm some alternative variables we might want to examine correlations between?

Norms of gift giving? Political alliances? Trade deals?
Could we track if this relationship changes over time?
In this class, we’re going to be moving towards some more of these open-ended data questions.

CSCI 104: Data Science and Computing for All

Pivots and Joins