from datascience import *
from cs104 import *
import numpy as np

%matplotlib inline

Functions#

1. Overlaid Histograms#

Here’s a dataset of adults and the heights of their parents.

heights_original = Table().read_table('data/galton.csv')
heights_original.show(5)
family father mother midparentHeight children childNum gender childHeight
1 78.5 67 75.43 4 1 male 73.2
1 78.5 67 75.43 4 2 female 69.2
1 78.5 67 75.43 4 3 female 69
1 78.5 67 75.43 4 4 female 69
2 75.5 66.5 73.66 4 1 male 73.5

... (929 rows omitted)

Let’s focus on the female adult children first.

heights = heights_original.where('gender', 'female').select('father', 'mother', 'childHeight')
heights = heights.relabeled('childHeight', 'daughter')
heights
father mother daughter
78.5 67 69.2
78.5 67 69
78.5 67 69
75.5 66.5 65.5
75.5 66.5 65.5
75 64 68
75 64 67
75 64 64.5
75 64 63
75 58.5 66.5

... (443 rows omitted)

plot = heights.hist('daughter')
plot.set_xlabel("Daughter height (inches)")
../_images/09-functions_7_0.png

What are some common heights we can see from the histogram above?

A concentration around 63 inches (5’4”).

plot = heights.hist('mother')
plot.set_xlabel("Mother height (inches)")
../_images/09-functions_9_0.png

Recall, we can use overlaid histograms to compare the distribution of two variables that have the same units. (Note: We’ve seen how to group by a column containing a categorical. It’s also possible to just call hist with multiple column names to produce an overlaid histogram.)

plot = heights.hist('daughter', 'mother')
plot.set_xlabel('Height (inches)')
../_images/09-functions_11_0.png
plot = heights.hist()
plot.set_xlabel('Height (inches)')
../_images/09-functions_12_0.png

We can specify bins ourselves to make them have a different width and number.

our_bins = np.arange(55, 81, 1)
print("bins=", our_bins)
plot = heights.hist(bins=our_bins)
plot.set_xlabel('Height (inches)')
bins= [55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
 79 80]
../_images/09-functions_14_1.png

Let’s now take a different slice of the original data

heights_original.show(3)
family father mother midparentHeight children childNum gender childHeight
1 78.5 67 75.43 4 1 male 73.2
1 78.5 67 75.43 4 2 female 69.2
1 78.5 67 75.43 4 3 female 69

... (931 rows omitted)

heights = heights_original.select('father', 'mother', 'childHeight').relabeled('childHeight', 'child')
heights
father mother child
78.5 67 73.2
78.5 67 69.2
78.5 67 69
78.5 67 69
75.5 66.5 73.5
75.5 66.5 72.5
75.5 66.5 65.5
75.5 66.5 65.5
75 64 71
75 64 68

... (924 rows omitted)

plot = heights.hist(bins=np.arange(55,80,2))
plot.set_xlabel('Height (inches)')
../_images/09-functions_18_0.png

Question: Why is the maximum height of a bar for child smaller than that for mother or father?

A: We have a larger spread because we have both male and female children.

2. Functions#

We use functions all the time. They do computation for us without us describing every single step. That saves us time – we don’t have to write the code – and let’s us perform those operations without even caring how they are implemented. Example: max: we have an idea of how we’d take the maximum of a list of numbers, but we can just use that function in Python without explicitely describing how it works.

Can we do the same for other computations? Yes! It’s a core principle of programming: define functions for tasks you do often so you never have to repeat writing the code.

Defining and calling our own functions#

def double(x):
    """ Double x """
    return 2*x
double(5)
10
double(double(5))
20

Scoping: parameter only “visible” inside the function definition

x #should throw an error
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 x #should throw an error

NameError: name 'x' is not defined
double(5/4)
2.5
y = 5
double(y/4)
2.5
x # we still can't access the parameter
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 x # we still can't access the parameter

NameError: name 'x' is not defined
x = 1.5
double(x)
3.0
x
1.5

What happens if I double an array?

double(make_array(3,4,5))
array([ 6,  8, 10])

What happens if I double a string?

double("string")
'stringstring'
5*"string"
'stringstringstringstringstring'

More functions#

Think-pair-share:

  1. What is this function below doing?

  2. How would you rewrite this function (the name of the function, the docstring, the parameters) in order to make it more clear?

def f(s):
    total = sum(s)
    return np.round(s / total * 100, 2)

A: Always use meaningul names.

def percents(counts):
    """Convert the counts to percents out of the total."""
    total = sum(counts)
    return np.round(counts / total * 100, 2)

Note that we have a local variable total in our definition….

f(make_array(2, 4, 8, 6, 10))
array([ 6.67, 13.33, 26.67, 20.  , 33.33])
percents(make_array(2, 4, 8, 6, 10))
array([ 6.67, 13.33, 26.67, 20.  , 33.33])

Remember scoping here too!

total
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 total

NameError: name 'total' is not defined

Accessing global variables#

Global variables are defined outside any function – we’ve been using them all along. You can access global variables that you have defined inside your functions. Always define globals before functions that use them to avoid confusion and surprising results when you rerun your whole notebook!

heights.show(3)
father mother child
78.5 67 73.2
78.5 67 69.2
78.5 67 69

... (931 rows omitted)

def children_under_height(height):
    """Proportion of children in our data set that are no taller than the given height."""    
    return heights.where("child", are.below_or_equal_to(height)).num_rows / heights.num_rows
children_under_height(65)
0.3747323340471092

Functions with more than one parameter#

We can add functions with more than one parameter.

#original function
def percents(counts):
    """Convert the counts to percents out of the total."""
    total = sum(counts)
    return np.round(counts / total * 100, 2)
#function with two parameters
def percents_two_params(counts, decimals_to_round):
    """Convert the counts to percents out of the total."""
    total = sum(counts)
    return np.round(counts / total * 100, decimals_to_round)
counts = make_array(2, 4, 8, 6, 10)
percents(counts)
array([ 6.67, 13.33, 26.67, 20.  , 33.33])
percents_two_params(counts, 2)
array([ 6.67, 13.33, 26.67, 20.  , 33.33])
percents_two_params(counts, 1)
array([ 6.7, 13.3, 26.7, 20. , 33.3])
percents_two_params(counts, 0)
array([ 7., 13., 27., 20., 33.])
percents_two_params(counts, 3)
array([ 6.667, 13.333, 26.667, 20.   , 33.333])

Let’s write a function that given the unique id of an observation (a row) gives us the value of a particular column.

heights_id = heights.with_columns('id', np.arange(heights.num_rows))
heights_id.show(5)
father mother child id
78.5 67 73.2 0
78.5 67 69.2 1
78.5 67 69 2
78.5 67 69 3
75.5 66.5 73.5 4

... (929 rows omitted)

def find_a_value(table, observation_id, column_name): 
    return table.where('id', are.equal_to(observation_id)).column(column_name).item(0)
find_a_value(heights_id, 2, 'mother')
67.0
find_a_value(heights_id, 200, 'mother')
63.0

Great! Now we can keeping using a function we wrote throughout this class to speed up work in the same way we’re using functions built-in to Python, e.g. max, or the datascience package, e.g. .take()

3. Apply#

There are times we want to perform mathematical operations columns of the table but can’t use array broadcasting…

min(make_array(70, 73, 69), 72) #should be an error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[44], line 1
----> 1 min(make_array(70, 73, 69), 72) #should be an error

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
def cut_off_at_72(x):
    """The smaller of x and 72"""
    return min(x, 72)
cut_off_at_72(62)
62
cut_off_at_72(72)
72
cut_off_at_72(78)
72

The table apply method applies a function to every entry in a column.

heights
father mother child
78.5 67 73.2
78.5 67 69.2
78.5 67 69
78.5 67 69
75.5 66.5 73.5
75.5 66.5 72.5
75.5 66.5 65.5
75.5 66.5 65.5
75 64 71
75 64 68

... (924 rows omitted)

heights.hist('child')
../_images/09-functions_79_0.png
cut_off = heights.apply(cut_off_at_72, 'child')
height2 = heights.with_columns('child', cut_off)
height2.hist('child')
../_images/09-functions_81_0.png

Like we did with variables, we can call functions and their types. In Python, help prints out the docstring of a function.

cut_off_at_72
<function __main__.cut_off_at_72(x)>
type(cut_off_at_72)
function
help(cut_off_at_72)
Help on function cut_off_at_72 in module __main__:

cut_off_at_72(x)
    The smaller of x and 72

Apply with multiple columns#

heights.show(6)
father mother child
78.5 67 73.2
78.5 67 69.2
78.5 67 69
78.5 67 69
75.5 66.5 73.5
75.5 66.5 72.5

... (928 rows omitted)

parent_max = heights.apply(max, 'mother', 'father')
parent_max.take(np.arange(0, 6))
array([78.5, 78.5, 78.5, 78.5, 75.5, 75.5])
def average(x, y):
    """Compute the average of two values"""
    return (x+y)/2
parent_avg = heights.apply(average, 'mother', 'father')
parent_avg.take(np.arange(0, 6))
array([72.75, 72.75, 72.75, 72.75, 71.  , 71.  ])