# Inference Library Reference#

• Click on any row to see detailed examples.

#### Sampling and Simulation

Name Description Parameters Output

sample_proportions(sample_size,
                   model_proportions)

Sample_size should be an integer, model_proportions an array of probabilities that sum up to 1. The function samples sample_size objects from the distribution specified by model_proportions. It returns an array with the same size as model_proportions. Each item in the array corresponds to the proportion of times it was sampled out of the sample_size times.

1. int : sample size

2. array : an array of proportions that should sum to 1

array : each item corresponds to the proportion of times that corresponding item was sampled from model_proportions in sample_size draws, should sum to 1

Examples
model_proportions = make_array(0.9, 0.1)
sample_proportions(100, model_proportions)

array([0.91, 0.09])

model_proportions = make_array(0.7, 0.2, 0.1)
sample_proportions(100, model_proportions)

array([0.71, 0.2 , 0.09])


simulate(make_one_outcome, num_outcomes)

Simulates the outcome of num_outcomes events. The outcome of an event is computed by the make_one_outcome function passed to simulate

• make_one_outcome: a function that returns the outcome of an event.

• num_outcomes: the number of events to simulate.

An array of the simulated outcomes.

Examples
dice = np.arange(1,7)

def roll_two_dice():
return np.random.choice(dice) + np.random.choice(dice)

simulate(roll_two_dice, 10)

array([12.,  3.,  4., 11.,  8., 10.,  7., 10.,  8.,  8.])


simulate_sample_statistic(make_one_sample,
   sample_size,
   compute_statistic,
   num_trials)

Simulates the process of computing a statistic for random samples.

• make_one_sample: a function that takes an int $$n$$ and returns a sample as an array of $$n$$ elements.

• sample_size: the size of the samples to use in the simulation.

• compute_statistic: a function that takes a sample as an array and returns the statistic for that sample.

• num_trials: the number of simulation steps to perform.

An array of the simulated statistics.

Examples
coin = make_array('heads', 'tails')

def flip_coins(n):
return np.random.choice(coin, n)

simulate_sample_statistic(flip_coins, 100,

array([53., 53., 49., 53., 55., 60., 52., 46., 52., 54.])

coin = make_array('heads', 'tails')

def flip_coins(n):
return np.random.choice(coin, n)

return abs(np.count_nonzero(sample == 'heads') - len(sample)/2)

simulate_sample_statistic(flip_coins, 100,

array([5., 2., 5., 5., 6., 1., 5., 2., 3., 2.])


#### Hypothesis Testing

Name Description Parameters Output

empirical_pvalue(simulated_statistics,
    observed_statistic)

Computes the proportion of values in simulated statistics that are at least as large as observed_statistic

• simulated_statistics: an array of int of float.

• observed_statistic: an int or float.

A proportion.

Examples
sample_statistics = make_array(2,2,3,4,5,2,2,6)
empirical_pvalue(sample_statistics, 5)

0.25

sample_statistics = make_array(2,2,3,4,5,2,2,6)
empirical_pvalue(sample_statistics, 6)

0.125


#### Permutation Tests

Name Description Parameters Output

permutation_sample(table,
    group_label)

Returns the given table augmented with a new column Shuffled Labels that contains a permutation of the values in the column group_column_label.

• table: a Table.

• group_label: the column to permute.

A new Table.

Examples
trial

Group Outcome
Control 0
Control 0
Control 1
Control 0
Treatment 1
Treatment 1
Treatment 0
Treatment 1
permutation_sample(trial, 'Group')

Group Outcome Shuffled Label
Control 0 Treatment
Control 0 Control
Control 1 Control
Control 0 Treatment
Treatment 1 Treatment
Treatment 1 Control
Treatment 0 Control
Treatment 1 Treatment

abs_difference_of_means(table,
   group_label,
   value_label)

Takes a table, the label of the column used to divide rows into two groups, and the label of the column storing the values for each row. Returns: the absolute difference of means for the two groups.

Note: If the values are all 0 or 1, then the result can be interpreted as the difference in the proportion of 1 for the two groups.

• table: a Table.

• group_label: the column to divide the rows.

• value_label: the column holding numerical values.

A new Table.

Examples
sizes

Color Size
Blue 2
Blue 4
Red 3
Blue 6
Red 2
Red 1
abs_difference_of_means(sizes, 'Color', 'Size')

2.0

trial

Group Outcome
Control 0
Control 0
Control 1
Control 0
Treatment 1
Treatment 1
Treatment 0
Treatment 1
abs_difference_of_means(trial, 'Group', 'Outcome')

0.5


simulate_permutation_statistic(table,
   group_label,
   value_label,
   num_trials)

Simulates num_trials permutation sampling steps and returns an array of abs_difference_of_means statistics for those samples.

• table: a Table.

• group_label: the column to divide the rows.

• value_label: the column holding the values of interest.

• num_trials: the number of simulation steps to perform.

An array of the simulated statistics.

Examples
big_trial.sample(10)

Group Outcome
Control 0
Control 0
Treatment 1
Control 0
Treatment 1
Control 0
Control 0
Treatment 1
Treatment 0
Control 0
simulate_permutation_statistic(big_trial, 'Group',
'Outcome', 5)

array([0.006, 0.022, 0.006, 0.034, 0.018])


#### Bootstrapping and Confidence Intervals

Name Description Parameters Output

bootstrap_statistic(initial_sample,
    compute_statistic,
    num_trials)

Simulates the process of computing a statistic for resamples of one original sample. The original sample is represented as a array, and the compute_statistic function should take in an array.

• initial_sample: an array representing the sample we start with.

• compute_statistic: a function that takes a sample as an array and returns the statistic for that sample.

• num_trials: the number of resampling steps to perform.

An array of the simulated statistics for the resamples.

Examples
observed_sample = make_array(1,2,3,4,5)

bootstrap_statistic(observed_sample, np.mean, 5)

array([4.8, 3.4, 1.8, 3. , 3.4])

statistics = bootstrap_statistic(observed_sample,
np.mean, 2000)

plot = Table().with_columns('bootstrap statistics',
statistics).hist()
plot.dot(np.mean(observed_sample))


confidence_interval(ci_percent,
    statistics)

Returns an array with the lower and upper bound of the ci_percent confidence interval.

• ci_percent: The percent of statistics covered by the confidence interval.

• statistics: An array of statistics.

An array of two elements.

Examples
observed_sample = make_array(1,2,3,4,5)

statistics = bootstrap_statistic(observed_sample,
np.mean, 5000)
left, right = confidence_interval(95,
statistics)
print(left, right)

1.8 4.2

statistics = bootstrap_statistic(observed_sample,
np.mean, 5000)

plot = Table().with_columns('statistics',
statistics).hist()

ci = confidence_interval(95, statistics)
plot.interval(ci)
plot.dot(np.mean(observed_sample))


#### Linear Regression

Name Description Parameters Output

pearson_correlation(table,
   x_label, y_label)

Computes the correlation coefficient capturing the sign and strength of the association between the given columns in the table.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

A float between -1 and 1.

Examples
bills.scatter('bill_length_mm','bill_depth_mm')

pearson_correlation(bills, 'bill_length_mm','bill_depth_mm')

0.3914916918358763


line_predictions(a, b, x)

Computes the prediction y_hat = a * x + b where a and b are the slope and intercept and x is the set of x-values.

• a, b: Slope and intercept of line

• x: an array of x values.

an array of predicted y values.

Examples
x = make_array(4,8,12,16)
x

array([ 4,  8, 12, 16])

y_hat = line_predictions(0.25, 3, x)
y_hat

array([4., 5., 6., 7.])

lengths = small_bills.column('bill_length_mm')

lengths

array([39.1, 39.5, 40.3, 36.7, 39.3])

predicted_depths = line_predictions(0.18, 11.4,
lengths)
predicted_depths

array([18.438, 18.51 , 18.654, 18.006, 18.474])


linear_regression(table,
   x_label, y_label)

Computes the slope and intercept of the line best fitting the table’s data according to the mean square error loss function.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

A two-element array with the slope and intercept.

Examples
optimal = linear_regression(bills, 'bill_length_mm',
'bill_depth_mm')
a = optimal.item(0)
b = optimal.item(1)
print('a = ', a, '  b = ', b)

a =  0.17883434929831496   b =  11.409124493057005

# a shorcut to assign to a and b all in one line.
a,b = linear_regression(bills, 'bill_length_mm',
'bill_depth_mm')

print('a = ', a, '  b = ', b)

a =  0.17883434929831496   b =  11.409124493057005


r2_score(table,
   x_label, y_label,
   a, b)

Given the values $$x$$ and $$y$$ in the x_label and y_label columns, computes the r-squared score (also called the “coefficient of determination”) for the predictions given $$y=ax+b$$.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

• a, b: Slope and intercept of line

A float between 0 and 1.

Examples
a,b = linear_regression(bills,'bill_length_mm','bill_depth_mm')
r2_score(bills,'bill_length_mm','bill_depth_mm', a,b)

0.15326574477651644


plot_scatter_with_line(table,
   x_label, y_label,
   a, b)

Plots a scatter graph for the points in given columns and also a line for the equation $$y = ax+b$$.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

• a, b: Slope and intercept of line

A Plot.

Examples
a,b = linear_regression(bills,'bill_length_mm','bill_depth_mm')
plot_scatter_with_line(bills,'bill_length_mm','bill_depth_mm',a,b)


plot_residuals(table,
   x_label, y_label,
   a, b)

Given the values $$x$$ and $$y$$ in the x_label and y_label columns, plots the residuals $$(x, y - \hat{y})$$, where $$\hat{y}$$ are the predictions from the line characterized by $$y = ax+b$$.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

• a, b: Slope and intercept of line

A Plot.

Examples
a,b = linear_regression(bills,'bill_length_mm','bill_depth_mm')
with Figure(1,2):
plot_scatter_with_line(bills,'bill_length_mm','bill_depth_mm',a,b)
plot_residuals(bills,'bill_length_mm','bill_depth_mm',a,b)


plot_regression_and_residuals(table,
   x_label, y_label,
   a, b)

Create a pair of plots capturing 1) a scatter plot and the line $$y=ax+b$$ and 2) the residuals when that line is used for preductions. Returns the Plot for the scatter plot.

• table: The Table of data.

• x_label, y_label: Labels of the columns for the x-axis and y-axis.

• a, b: Slope and intercept of line

A Plot.

Examples
a,b = linear_regression(bills,'bill_length_mm','bill_depth_mm')
plot_regression_and_residuals(bills,'bill_length_mm','bill_depth_mm',a,b)