Groups¶

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', np.VisibleDeprecationWarning)

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

1. Prediction using functions and apply¶

We’re following the example in Ch. 8.1.3

Q: Can we use the average of a child’s parents’ heights to predict the child’s height?

galton = Table.read_table('data/galton.csv')
heights = galton.select('father', 'mother', 'childHeight').relabeled(2, 'child')
heights.show(10)

father	mother	child
78.5	67	73.2
78.5	67	69.2
78.5	67	69
78.5	67	69
75.5	66.5	73.5
75.5	66.5	72.5
75.5	66.5	65.5
75.5	66.5	65.5
75	64	71
75	64	68

... (924 rows omitted)

Another scatter plot (Note: Usually we create scatter plot by specifying two columns: one for x-values and one for y-values, and use a categorical column to group points by color when creating an overlay. If we have a table where we have two columns for y-values that share the same column for x-values, we can create an overlay plot just by specifying the column containing those shared x-values.)

heights.scatter('child')

As review, we can use .apply() to apply a function to one or multiple columns in a table.

heights.show(6)

father	mother	child
78.5	67	73.2
78.5	67	69.2
78.5	67	69
78.5	67	69
75.5	66.5	73.5
75.5	66.5	72.5

... (928 rows omitted)

def average(x, y):
    """Compute the average of two values"""
    return (x+y)/2

parent_avg = heights.apply(average, 'mother', 'father')
parent_avg.take(np.arange(0, 6))

array([72.75, 72.75, 72.75, 72.75, 71.  , 71.  ])

Add a column with parents’ average height to the height table.

heights = heights.with_columns(
    'parent average', parent_avg
)
heights

father	mother	child	parent average
78.5	67	73.2	72.75
78.5	67	69.2	72.75
78.5	67	69	72.75
78.5	67	69	72.75
75.5	66.5	73.5	71
75.5	66.5	72.5	71
75.5	66.5	65.5	71
75.5	66.5	65.5	71
75	64	71	69.5
75	64	68	69.5

... (924 rows omitted)

heights.scatter('parent average', 'child')
plots.axvline(68, color='gold', linestyle='--', lw=2);

Think-pair-share: Suppose researchers encountered a new couple, similar to those in this dataset, and wondered how tall their child would be once their child grew up. What would be a good way to predict the child’s height, given that the parent average height was, say, 68 inches?

A: One initial approach would be to base the prediction on all observations (child, parent pairs) that are “close to” 68 inches for the parent.

Let’s take “close to” to mean within a half-inch
Let’s draw these with red lines

heights.scatter('parent average', 'child')
plots.plot([67.5, 67.5], [55, 80], color='red', lw=2)
plots.plot([68.5, 68.5], [55, 80], color='red', lw=2)
plots.axvline(68, color='gold', linestyle='--', lw=2);

Let’s now identify all points within that red strip.

close_to_68 = heights.where('parent average', are.between(67.5, 68.5))
close_to_68

father	mother	child	parent average
74	62	74	68
74	62	70	68
74	62	68	68
74	62	67	68
74	62	67	68
74	62	66	68
74	62	63.5	68
74	62	63	68
74	61	65	67.5
73.2	63	62.7	68.1

... (175 rows omitted)

And take the average to make a prediction about the child.

np.average(close_to_68.column('child'))

67.62

Ooo! A function to compute that child mean height for any parent average height

def predict_child(parent_avg_height):
    close_points = heights.where('parent average', are.between(parent_avg_height - 0.5, parent_avg_height + 0.5))
    return close_points.column('child').mean()         

predict_child(68)

67.62

predict_child(65)

65.83829787234043

Apply predict_child to all the parent averages.

predicted = heights.apply(predict_child, 'parent average')
predicted.take(np.arange(0,10))

array([70.1       , 70.1       , 70.1       , 70.1       , 70.41578947,
       70.41578947, 70.41578947, 70.41578947, 68.5025    , 68.5025    ])

#extend our table with these new predictions 
height_pred = heights.with_columns('prediction', predicted)

height_pred.select('child', 'parent average', 'prediction').scatter('parent average')

Preview: Throughout this course we’ll keep moving towards making our predictions better!

Extra: How close is close enough for prediction?¶

The choice of say two heights are “close to” eachother if they are within a half-inch was a somewhat arbitrary choice. We chould have chosen other values instead. What would happen if we changed that constant to be 0.25, 1, 2, or 5?

This visualization demostrates the impact that choice has on our predictions.

from functools import lru_cache as cache

@cache  # saves tables for each delta we compute to avoid recomputing.
def vary_range(delta):
    """Use a window of +/- delta when predicting child heights."""
    def predict_child(parent_avg_height):
        close_points = heights.where('parent average', are.between(parent_avg_height - delta, parent_avg_height + delta))
        return close_points.column('child').mean()         

    predicted = heights.apply(predict_child, 'parent average')
    height_pred = heights.with_columns('prediction', predicted)
    return height_pred.select('child', 'parent average', 'prediction')

def visualize_predictions(delta = 0.5):
    predictions = vary_range(delta)
    predictions.scatter('parent average', s=50) # make dots a little bigger than usual
    
_ = interact(visualize_predictions, delta = make_array(0,0.25, 0.5, 1, 3, 5, 10))

2. Groups¶

We must often divide rows into groups according to some feature, and then compute a basic characteristic for each resulting group.

# table of 98 tiles from Scrabble game (excludes the two blanks)
scrabble_tiles = Table().read_table('data/scrabble_tiles.csv')
scrabble_tiles.sample(10)

Letter	Score	Vowel
E	1	Yes
U	1	Yes
W	4	No
F	4	No
E	1	Yes
I	1	Yes
N	1	No
C	3	No
I	1	Yes
S	1	No

scrabble_tiles.group('Letter')

Letter	count
A	9
B	2
C	2
D	4
E	12
F	2
G	3
H	2
I	9
J	1

... (16 rows omitted)

scrabble_tiles.group('Vowel')

Vowel	count
No	56
Yes	42

scrabble_tiles.group('Vowel', sum)

Vowel	Letter sum	Score sum
No		145
Yes		42

Notes:

When we pass in a function to group that is not the default (e.g. sum), the name of that function is appended to the column name.
Some of the columns are empty because sum can only be applied to numerical (not categorial) variables. Our package is smart about this and leaves the columns empty (e.g. Letter sum).

scrabble_tiles.group('Vowel', max)

Vowel	Letter max	Score max
No	Z	10
Yes	U	1

Applying aggregation functions (e.g. max) to some columns (e.g. Letter) are not meaningful. That’s ok. But we’ll have to use our understanding about the dataset to ignore these aggregations.

Group multiple columns¶

small_scrabble = scrabble_tiles.sample(10)
small_scrabble = small_scrabble.with_columns('Used', make_array('Yes', 'Yes', 'Yes', 'No', 'No', 
                                                            'No', 'No', 'No', 'No', 'No'))
small_scrabble

Letter	Score	Vowel	Used
I	1	Yes	Yes
I	1	Yes	Yes
O	1	Yes	Yes
U	1	Yes	No
W	4	No	No
I	1	Yes	No
C	3	No	No
N	1	No	No
E	1	Yes	No
O	1	Yes	No

Q: How many vowels do I have left that I have not used?

small_scrabble.group(make_array('Vowel', 'Used'))

Vowel	Used	count
No	No	3
Yes	No	4
Yes	Yes	3

Q: What’s the total score of the non-values I have used and not used?

# Notice the different syntax for the array of column names.  You'll see this form in 
# the book and some examples, but it is the same as the make_array('Vowel', 'Used') form.
small_scrabble.group(['Vowel', 'Used'], sum)

Vowel	Used	Score sum
No	No	8
Yes	No	4
Yes	Yes	3

Groups for Galton heights¶

galton = Table().read_table("data/galton.csv")
galton.show(3)

family	father	mother	midparentHeight	children	childNum	gender	childHeight
1	78.5	67	75.43	4	1	male	73.2
1	78.5	67	75.43	4	2	female	69.2
1	78.5	67	75.43	4	3	female	69

... (931 rows omitted)

Q: How many children does each family have?

by_family = galton.group('family')
by_family.show(5)

family	count
1	4
10	1
100	3
101	4
102	6

... (200 rows omitted)

# Relabel based on what we know about this particular dataset
# (each row is a child)
by_family = by_family.relabeled("count", "number of children")

by_family.hist("number of children", bins=15)

Q: Per family, what is the average height of the children?

by_family = galton.select('family', 'childHeight').group('family', np.mean)
by_family.show(5)
by_family.hist('childHeight mean')

family	childHeight mean
1	70.1
10	65.5
100	70.7333
101	72.375
102	66.1667

... (200 rows omitted)

Groups for Obama gifts¶

Let’s examine the dataset of Obama gifts we looked at previously. Now imagine we’re an auditor and want to investigate the relationship between countries and the number and amount of gifts their giving. We’ll use the .group() method for this.

https://media-cldnry.s-nbcnews.com/image/upload/MSNBC/Components/Photo/_new/110118-obama-gift-2p.jpg

“President Barack Obama receives a gift from Saudi King Abdullah at the start of their bilateral meeting in Riyadh, Saudi Arabia, on June 3, 2009. The large gold medallion was among several gifts given that day that were valued at $34,500, the State Department later said” –CBS News

gifts = Table().read_table('data/obama-gifts.csv')
gifts = gifts.where('donor_country', are.not_equal_to('nan')) #clean up and remove the nans
gifts.show(5)

year_received	donor_country	value_usd	gift_description
2009	Mexico	400	Book entitled ``The National Palace of Mexico''; red and ...
2009	Japan	1495	Mikimoto desk clock; black basketball jersey. Rec'd--2/2 ...
2009	United Kingdom	16510	Black and gold pen with a wooden pen holder, made from t ...
2009	Algeria	500	Four boxes of dates and twelve bottles of wine. Rec'd--3 ...
2009	Denmark	388	Book entitled ``Restoring the Balance''; book entitled ` ...

... (608 rows omitted)

Grouping by a single column¶

by_country = gifts.group('donor_country')
by_country.show(5)

donor_country	count
Afghanistan	7
Algeria	6
Argentina	5
Armenia	1
Australia	5

... (118 rows omitted)

Q: What are the top ten countries by number of gifts given? Show in a bar chart.

by_country.sort('count', descending=True).take(np.arange(0, 10)).barh('donor_country', 'count');

Q: What are the top ten countries by total value of all the gifts given by that country? Show in a bar chart.

by_country_sum = gifts.group('donor_country', sum)
by_country_sum.show(5)

donor_country	year_received sum	value_usd sum
Afghanistan	14077	9263
Algeria	12082	4312.28
Argentina	10069	3977.98
Armenia	2010	2985
Australia	10064	4365.48

... (118 rows omitted)

sort_table = by_country_sum.sort('value_usd sum', descending=True)
sort_table.take(np.arange(0, 10)).barh('donor_country', 'value_usd sum');

Grouping by multiple columns¶

Q: How many gifts did countries give in each year?

by_year = gifts.group('year_received')
by_year.barh('year_received')

Q: What was the total value of gifts given in each year?

value_by_year = gifts.group('year_received', sum)
value_by_year.barh('year_received', "value_usd sum")

Q: How many gifts did each country give in each year?

by_year_and_country = gifts.group(make_array('year_received', 'donor_country'))
by_year_and_country

year_received	donor_country	count
2009	Afghanistan	1
2009	Algeria	1
2009	Argentina	1
2009	Brazil	1
2009	Brunei	2
2009	Burkina Faso	1
2009	Chile	1
2009	China	6
2009	Czechia	3
2009	Denmark	2

... (336 rows omitted)

Let’s make a visualization for a particular country.

uk = by_year_and_country.where('donor_country', are.equal_to('United Kingdom'))
uk.barh('year_received', 'count')
plots.title('United Kingdom gifts to Obama');

Think-pair-share: Let’s write a function to make these plots for any country we’d want!

Note: In this function, we will specifically not return anything. We’ll just make code to make a plot appear. This is not uncommon to create functions without return–those that make visualizations and use print() statements.

def plot_country_yearly_gifts(donor_country_name): 
    gifts_from_donor = by_year_and_country.where('donor_country', are.equal_to(donor_country_name))
    gifts_from_donor.barh('year_received', 'count')
    title = donor_country_name + ' gifts to Obama'
    plots.title(title);

Q: Which are the variables that are local to the function? Which are the global variables?

plot_country_yearly_gifts("United Kingdom")

plot_country_yearly_gifts("Saudi Arabia")

plot_country_yearly_gifts("Qatar")

countries = gifts.sort('donor_country',distinct=True).column('donor_country')
_ = interact(plot_country_yearly_gifts, donor_country_name=countries)

A more general function¶

We’ve talked about how computer science is all about abstraction. Let’s write a function that takes as a parameter any aggregation function we can pass into .group(), and the function groups by country, year, and that aggregation function. The function then plots the results for a specific country.

gifts.show(3)

year_received	donor_country	value_usd	gift_description
2009	Mexico	400	Book entitled ``The National Palace of Mexico''; red and ...
2009	Japan	1495	Mikimoto desk clock; black basketball jersey. Rec'd--2/2 ...
2009	United Kingdom	16510	Black and gold pen with a wooden pen holder, made from t ...

... (610 rows omitted)

def country_group_generic(donor_country_name, aggregation_function): 
    group_gifts = gifts.group(make_array('year_received', 'donor_country'), aggregation_function)
    
    gifts_from_donor = group_gifts.where('donor_country', are.equal_to(donor_country_name))

    # The aggregate value_usd column will have a different name, depending on which
    # aggregation function we employ.  So, we just refer to that column by its column
    # index, 2, instead of its name when make the bar chart.
    gifts_from_donor.barh('year_received', 2)
    
    title = donor_country_name + ' gifts to Obama'
    plots.title(title);

country_group_generic("United Kingdom", max)

country_group_generic("United Kingdom", min)

country_group_generic("United Kingdom", np.mean)

country_group_generic("Afghanistan", np.mean)

_ = interact(country_group_generic, donor_country_name=countries, aggregation_function=[('sum',sum),('min',min),('max',max),('mean',np.mean)])

Letter	Score	Vowel	Used
I	1	Yes	Yes
I	1	Yes	Yes
O	1	Yes	Yes
U	1	Yes	No
W	4	No	No
I	1	Yes	No
C	3	No	No
N	1	No	No
E	1	Yes	No
O	1	Yes	No

Letter	Score	Vowel	Used
I	1	Yes	Yes
I	1	Yes	Yes
O	1	Yes	Yes
U	1	Yes	No
W	4	No	No
I	1	Yes	No
C	3	No	No
N	1	No	No
E	1	Yes	No
O	1	Yes	No

CSCI 104: Understanding Data Through Computation