Permutation Tests#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

1. Load and explore maternal smoker data#

First stage of our data science pipeline, let’s explore the data and see if we find something interesting.

You can read more about this data here.

births = Table.read_table('data/baby.csv')

births.show(4)

Birth Weight	Gestational Days	Maternal Age	Maternal Height	Maternal Pregnancy Weight	Maternal Smoker
120	284	27	62	100	False
113	282	33	64	135	False
128	279	28	64	115	True
108	282	23	67	125	True

... (1170 rows omitted)

smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')

smoking_and_birthweight.group('Maternal Smoker')

Maternal Smoker	count
False	715
True	459

smoking_and_birthweight.hist('Birth Weight', group='Maternal Smoker')

../_images/22-permutation-tests_10_0.png

Interesting! It looks like there’s a higher birth weight for maternal non-smokers. But is this just due to chance? Let’s use hypothesis testing to find out.

2. Test Statistic#

means_table = smoking_and_birthweight.group('Maternal Smoker', np.mean)
means_table

Maternal Smoker	Birth Weight mean
False	123.085
True	113.819

means = means_table.column('Birth Weight mean')
observed_difference = means.item(0) - means.item(1)
observed_difference

9.266142572024918

In keeping with the approach we laid out last lecture, we’ll focus only on absolute difference…

observed_difference = abs(means.item(0) - means.item(1))
observed_difference

9.266142572024918

def abs_difference_of_means(table, group_label, value_label):   
    # table containing group means
    means_table = table.group(group_label, np.mean)
    
    # array of group means
    means = means_table.column(value_label + ' mean')
    
    return abs(means.item(0) - means.item(1))

Our observed difference

observed_difference = abs_difference_of_means(births, 'Maternal Smoker', "Birth Weight")
observed_difference

9.266142572024918

We can use this function on lots of columns!

abs_difference_of_means(births, 'Maternal Smoker', "Maternal Age")

0.8076725017901509

abs_difference_of_means(births, 'Maternal Smoker', "Maternal Height")

0.09058914941267915

3. Simulation Under Null Hypothesis#

Creating Permutations of Labels#

Just use a tiny table to show our approach…

tiny_smoking_and_birthweight = smoking_and_birthweight.take(np.arange(0,6))
tiny_smoking_and_birthweight

Maternal Smoker	Birth Weight
False	120
False	113
True	128
True	108
False	136
False	138

We’ll use .sample(with_replacement=False) to shuffle the rows of a table.

shuffled_labels = tiny_smoking_and_birthweight.sample(with_replacement=False).column('Maternal Smoker')
shuffled_labels

array([ True, False,  True, False, False, False])

original_and_shuffled = tiny_smoking_and_birthweight.with_columns('Shuffled Label', 
                                                                 shuffled_labels)
original_and_shuffled

Maternal Smoker	Birth Weight	Shuffled Label
False	120	True
False	113	False
True	128	True
True	108	False
False	136	False
False	138	False

A function to make a permutation!

def permutation_sample(table, group_label):
    """
    Returns: The table with a new "Shuffled Label" column containing
    the shuffled values of the group_label.
    """
    
    # array of shuffled labels
    shuffled_labels = table.sample(with_replacement=False).column(group_label)
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.with_columns('Shuffled Label', shuffled_labels)
    
    return shuffled_table

original_and_shuffled = permutation_sample(tiny_smoking_and_birthweight, 
                                           "Maternal Smoker")
original_and_shuffled

Maternal Smoker	Birth Weight	Shuffled Label
False	120	False
False	113	True
True	128	False
True	108	False
False	136	True
False	138	False

We’ll calculate the statistic for the shuffled groups.

abs_difference_of_means(original_and_shuffled, "Shuffled Label", "Birth Weight")

1.0

And now the full table…

smoking_and_birthweight

Maternal Smoker	Birth Weight
False	120
False	113
True	128
True	108
False	136
False	138
False	132
False	120
True	143
False	140

... (1164 rows omitted)

original_and_shuffled = permutation_sample(smoking_and_birthweight, 
                                           "Maternal Smoker")
original_and_shuffled

Maternal Smoker	Birth Weight	Shuffled Label
False	120	False
False	113	False
True	128	True
True	108	True
False	136	False
False	138	False
False	132	False
False	120	False
True	143	True
False	140	True

... (1164 rows omitted)

Statistic for one sample of the null hypothesis.

abs_difference_of_means(original_and_shuffled, 'Shuffled Label', 'Birth Weight')

0.37097064155888404