Wrap Up¶
1. Installing Python packages¶
Install packages in a Jupyter notebook on any machine (including your own)!
# Three packages we have used throughout the semester
!pip3 install -q datascience
!pip3 install -q numpy
!pip3 install -q matplotlib
We import packages that have been installed in order to use their features in our code:
# Second step: import packages
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
%matplotlib inline
Now let’s install three different packages we haven’t seen yet…
!pip3 install -q pandas
!pip3 install -q scikit-learn
!pip3 install -q seaborn
import pandas as pd
from sklearn import *
import seaborn as sns
2. Pandas¶
Pandas is a library to manipulate and explore data (similar to Tables), but with more functionality.
penguins = pd.read_csv('https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv')
penguins = penguins.drop(columns = ['year'])
penguins.head(6)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male |
# pandas gives us a quick summary of the data
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
penguins = penguins.dropna()
penguins.head(6)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | male |
6 | Adelie | Torgersen | 38.9 | 17.8 | 181.0 | 3625.0 | female |
print('num rows (after droping nulls) = ', len(penguins))
num rows (after droping nulls) = 333
penguins[penguins.species == 'Adelie'].mean(numeric_only=True)
bill_length_mm 38.823973
bill_depth_mm 18.347260
flipper_length_mm 190.102740
body_mass_g 3706.164384
dtype: float64
3. Seaborn¶
Seaborn is a Data visualization library. Interfaces with pandas nicely. Makes very pretty plots!
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

A big advantage of seaborn is that you can quickly visualize different subsets of data:
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="island");

sns.lmplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

sns.kdeplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species");

sns.pairplot(penguins, hue="species");

fig, ax = plots.subplots(1,2,figsize=(12,5))
sns.scatterplot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species",
size="body_mass_g", sizes=(30, 300), alpha=0.5,
ax=ax[0])
sns.violinplot(penguins, x="body_mass_g", y="species", hue="sex",
ax=ax[1])
fig.tight_layout();

4. sklearn (Scikit-Learn)¶
sklearn — pronounced Sci Kit Learn — is a library for machine learning (statistical pattern matching).
Linear Regression with sklearn¶
from sklearn import linear_model
from sklearn.metrics import r2_score as r2_score_sklearn
from sklearn.metrics import mean_squared_error as mse_sklearn
from sklearn.feature_selection import r_regression
# Some data wrangling to get our x and y values, this time with pandas...
chinstrap = penguins[penguins['species'] == 'Chinstrap']
x = chinstrap['bill_length_mm'].to_numpy().reshape(-1, 1)
y = chinstrap['bill_depth_mm'].to_numpy()
model = linear_model.LinearRegression()
model.fit(x, y);
print('slope ', model.coef_[0])
print('intercept', model.intercept_)
slope 0.2222117240036715
intercept 7.569140119132472
y_hat = model.predict(x)
sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x, y_hat, color='r', lw=2);

A whole lot of metrics we might want are already implemented in sklearn.
print('Pearson Correlation:', r_regression(x, y)[0])
print('MSE: ', mse_sklearn(y, y_hat))
print('R2 Score: ', r2_score_sklearn(y, y_hat))
Pearson Correlation: 0.6535362081800236
MSE: 0.7276649994299124
R2 Score: 0.4271095754023476
Non-linear Regresion¶
New: let’s fit a non-linear regression line with sklearn!
model_nonlinear = svm.SVR(kernel='poly') #does non-linear (polynomial) regression
model_nonlinear.fit(x, y);
# plot what the polynomial regression does
x_range = np.arange(42.5, 57.5, 0.1).reshape(-1, 1)
y_hat_linear = model.predict(x_range)
y_hat_nonlinear = model_nonlinear.predict(x_range)
sns.scatterplot(chinstrap, x='bill_length_mm', y='bill_depth_mm')
plots.plot(x_range, y_hat_linear, color='r', label='linear', lw=2);
plots.plot(x_range, y_hat_nonlinear, color='b', label='nonlinear', lw=2)
plots.title("Linear and Nonlinear models")
plots.legend();

Take Machine Learning to learn the process for evaluating which model is better!