1. Installing Python packages

Install packages in a Jupyter notebook on any machine (including your own)!

# Three packages we have used throughout the semester
!pip3 install -q datascience
!pip3 install -q numpy 
!pip3 install -q matplotlib

We import packages that have been installed in order to use their features in our code:

# Second step: import packages 
from datascience import * 
import numpy as np 
import matplotlib.pyplot as plots
%matplotlib inline

Now let’s install three different packages we haven’t seen yet…

!pip3 install -q pandas 
!pip3 install -q scikit-learn
!pip3 install -q seaborn
import pandas as pd
from sklearn import * 
import seaborn as sns

2. Pandas

Pandas is a library to manipulate and explore data (similar to Tables), but with more functionality.

penguins = pd.read_csv('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv')
penguins = penguins.drop(columns = ['year'])
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male
# pandas gives us a quick summary of the data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
penguins = penguins.dropna()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male
6 Adelie Torgersen 38.9 17.8 181.0 3625.0 female
print('num rows (after droping nulls) = ', len(penguins))
num rows (after droping nulls) =  333
penguins[penguins.species == 'Adelie'].mean(numeric_only=True)
bill_length_mm         38.823973
bill_depth_mm          18.347260
flipper_length_mm     190.102740
body_mass_g          3706.164384
dtype: float64

3. Seaborn

Seaborn is a Data visualization library. Interfaces with pandas nicely. Makes very pretty plots!

sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

A big advantage of seaborn is that you can quickly visualize different subsets of data:

sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="island");
sns.lmplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");
sns.kdeplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");
sns.pairplot(penguins, hue="species");
fig, ax = plots.subplots(1,2,figsize=(12,5))
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", 
                size="body_mass_g", sizes=(30, 300), alpha=0.5, 
sns.violinplot(penguins, x="body_mass_g", y="species", hue="sex", 

4. sklearn (Scikit-Learn)

sklearn — pronounced Sci Kit Learn — is a library for machine learning (statistical pattern matching).

Linear Regression with sklearn

from sklearn import linear_model

from sklearn.metrics import r2_score as r2_score_sklearn
from sklearn.metrics import mean_squared_error as mse_sklearn
from sklearn.feature_selection import r_regression
# Some data wrangling to get our x and y values, this time with pandas...
chinstrap = penguins[penguins['species'] == 'Chinstrap']
x = chinstrap['bill_length_mm'].to_numpy().reshape(-1, 1)
y = chinstrap['bill_depth_mm'].to_numpy()
model = linear_model.LinearRegression()
model.fit(x, y);
print('slope    ', model.coef_[0])
print('intercept', model.intercept_)
slope     0.2222117240036715
intercept 7.569140119132472
y_hat = model.predict(x)
sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x, y_hat, color='r', lw=2);

A whole lot of metrics we might want are already implemented in sklearn.

print('Pearson Correlation:', r_regression(x, y)[0])
print('MSE:                ', mse_sklearn(y, y_hat))
print('R2 Score:           ', r2_score_sklearn(y, y_hat))
Pearson Correlation: 0.6535362081800236
MSE:                 0.7276649994299124
R2 Score:            0.4271095754023476

Non-linear Regresion

New: let’s fit a non-linear regression line with sklearn!

model_nonlinear = svm.SVR(kernel='poly') #does non-linear (polynomial) regression
model_nonlinear.fit(x, y);
# plot what the polynomial regression does 
x_range = np.arange(42.5, 57.5, 0.1).reshape(-1, 1)

y_hat_linear = model.predict(x_range)
y_hat_nonlinear = model_nonlinear.predict(x_range)

sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x_range, y_hat_linear, color='r', label='linear', lw=2);
plots.plot(x_range, y_hat_nonlinear, color='b', label='nonlinear', lw=2)
plots.title("Linear and Nonlinear models")

Take Machine Learning to learn the process for evaluating which model is better!