Midterm Review#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

Small Items#

  • You will not need to use the interact function or think about interactive visualizations on the exam.

Loops#

An Example of a loop that iterates over the values of an array#

# stay_positive(make_array(1,-1,-2,3,4,0)) function should return 3 -> number of positive numbers in an array.
extra_values = make_array(-1,2,3)

def stay_positive(values):
    numbers_positive = 0   # keep a count of how my positive numbers we've seen so far
    
    for value in values:
        # if my value is positive, increase number of positive numbers
        if value > 0: 
            numbers_positive = numbers_positive + 1
            
    return numbers_positive
            
stay_positive(make_array(1,-1,-2,3,4,0))
3

An Example to builds an array and returns it#

def track_positives(values):
    positive_values = make_array()
    for value in values:
        # if my value is positive, increase number of positive numbers
        if value > 0: 
            positive_values = np.append(positive_values, value)
    return positive_values
track_positives(make_array(1,-1,-2,3,4,0))
array([1., 3., 4.])

An Example of a loop that iterates for a specific number of times.#

def run_trials(trials):
    for i in np.arange(0,trials):
        print(2*i)
n_trials = 10     # how many trials to run
run_trials(n_trials)
0
2
4
6
8
10
12
14
16
18

Group and Pivot#

penguins = Table().read_table('data/penguins.csv')
penguins.sample(5)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Gentoo Biscoe 46.2 14.4 214 4650 UNKNOWN
Adelie Biscoe 39.7 18.9 184 3550 MALE
Gentoo Biscoe 47.2 13.7 214 4925 FEMALE
Gentoo Biscoe 47.4 14.6 212 4725 FEMALE
Adelie Biscoe 39.6 20.7 191 3900 FEMALE

group divides the rows into groups according to the values stored in a categorical variable.

penguins.group('species')
species count
Adelie 151
Chinstrap 68
Gentoo 123

With only one parameter, we get a count of how many rows are in each group. Pass in a second parameter that is an aggregation function to summarize the numerical columns in the table.

penguins.group('species', np.mean)
species island mean bill_length_mm mean bill_depth_mm mean flipper_length_mm mean body_mass_g mean sex mean
Adelie 38.7914 18.3464 189.954 3700.66
Chinstrap 48.8338 18.4206 195.824 3733.09
Gentoo 47.5049 14.9821 217.187 5076.02

We can also group by more than one column by passing in an array of the columns to group by.

penguins.group(make_array('species', 'island'), max)
species island bill_length_mm max bill_depth_mm max flipper_length_mm max body_mass_g max sex max
Adelie Biscoe 45.6 21.1 203 4775 MALE
Adelie Dream 44.1 21.2 208 4650 UNKNOWN
Adelie Torgersen 46 21.5 210 4700 UNKNOWN
Chinstrap Dream 58 20.8 212 4800 MALE
Gentoo Biscoe 59.6 17.3 231 6300 UNKNOWN

pivot creates a 2-dimensional matrix by grouping according to two columns. If only those columns are provided, we get counts of how many rows fall into each row/col combination.

penguins.pivot('species', 'island')
island Adelie Chinstrap Gentoo
Biscoe 44 0 123
Dream 56 68 0
Torgersen 51 0 0

Or you can pass a third column and a way to summarize that column’s values for each group:

penguins.pivot('species', 'island', 'bill_length_mm', max)
island Adelie Chinstrap Gentoo
Biscoe 45.6 0 59.6
Dream 44.1 58 0
Torgersen 46 0 0

We use pivot instead of group to present data in a more accessible way, or to prepare for additional data manipulation or analyses.

Functions#

Check out our lecture slides on Functions for a more complete discussion of how function definitions work, but here’s an example showing the major parts:

  1. All functions start with def, the number of the function, and then the list of parameter names.

  2. The body of the function is indented and contains the instructions to run when the function is called.

  3. The return statement is used to give a value back to the caller.

def count_odd_values(values):
    num_odd_values = 0
    for value in values:
        if value % 2 != 0:
            num_odd_values = num_odd_values + 1
    return num_odd_values
count_odd_values(extra_values)
2

Histograms#

Histograms show the distribution of values taken on by a variable. They often convey more information than just looking at the min, max, and mean. The area of each bar is proportional to the percent of data values falling with them. That percentage is computed for a bar by multiplying the width of the bar by its height.

penguins.hist('bill_length_mm')
../_images/Midterm-1-Review-2025_33_0.png

We can adjust the bins to make the plot easier to interpret. One way is to give an array of values that will be used as the cutoffs between the bins:

cutoffs = np.arange(30,65,5)
cutoffs
array([30, 35, 40, 45, 50, 55, 60])
penguins.hist('bill_length_mm', bins=cutoffs)
../_images/Midterm-1-Review-2025_36_0.png

We can also make an overlaid histogram:

penguins.hist('bill_length_mm', bins=cutoffs, group='species')
../_images/Midterm-1-Review-2025_38_0.png