Histograms#

from datascience import *
from cs104 import *
import numpy as np

%matplotlib inline

0. Error practice#

majors = Table().read_table("data/majors.csv")
majors.show(5)

Major	Division	2008-2012	2018-2021
American Studies	2	10	9
Anthropology	2	8	4
Arabic Studies	1	4	7
Art	1	55	31
Asian Studies	1	8	6

... (32 rows omitted)

# Select only division 3 majors 
div3 = majors.where("Division", are.equal_to(3)
div3

  Cell In[4], line 3
    div3
    ^
SyntaxError: invalid syntax

# Get the number of majors across both time periods
majors.select("2008-2012") + majors.select("2018-2021")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 # Get the number of majors across both time periods
----> 2 majors.select("2008-2012") + majors.select("2018-2021")

TypeError: unsupported operand type(s) for +: 'Table' and 'Table'

1. Overlaid graphs#

Sometimes we want to see more than one plot on a single graph.

Overlaid bar charts#

div3 = majors.where("Division", are.equal_to(3)).drop("Division")
div3

Major	2008-2012	2018-2021
Astronomy	1	2
Astrophysics	3	3
Biology	58	61
Chemistry	30	34
Computer Science	16	50
Geosciences	7	12
Mathematics	53	61
Physics	12	13
Psychology	62	45
Statistics	0	16

# First graph for 2008-2012
div3.barh("Major", "2008-2012")
# Second graph from 2018-2021
div3.barh("Major", "2018-2021")

Overlaid graph puts the two graphs together to make comparison easier.

The package we’re using will automatically make overlaid graphs with the remainder of the columns if you give it just one parameter.

div3.barh("Major")

Overlaid line plots#

temps_by_month = Table().read_table("data/temps_by_month_upernavik.csv")
temps_by_month.show(5)

Year	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
1875	-15.6	-19.7	-25.9	-14.7	-9.6	-0.4	4.7	2.9	-0.1	-4.5	-5.5	-14
1876	-24.5	-21.2	-20.8	-14.9	-6.3	0.9	3.9	2.4	3.2	-6.2	-9.4	-14.4
1877	-21.1	-26.5	-17.8	-12	-1.7	1.4	4.6	5.2	3	-2.8	-10	-19.2
1878	-22.9	-26.9	-19.6	-13	-5.6	1.9	3.2	4.3	-0.9	-4	-2	-3.7
1879	-13.5	-25.4	-21.3	-13	-2.7	0	5.2	5.8	-1.1	-5.5	-8.2	-16.5

... (104 rows omitted)

As with bar charts, if you supply only one parameter, the plot method will plot a line for every other column.

temps_by_month.plot("Year")

Qualitatively, we can see that the plot above has too much information on it which makes it not very useful for understand trends.

temps_by_month.select("Year", "Feb", "Aug").plot("Year")

Overlaid scatter plots#

We want to plot points (the values of two numerical variables) from different groups on the same graph.
A new approach. Use categorical variable to break the rows into groups of related points in the plot.

finch_1975 = Table().read_table("data/finch_beaks_1975.csv")
finch_1975.show(6)

species	Beak length, mm	Beak depth, mm
fortis	9.4	8
fortis	9.2	8.3
scandens	13.9	8.4
scandens	14	8.8
scandens	12.9	8.4
fortis	9.5	7.5

... (400 rows omitted)

finch_1975.scatter("Beak length, mm", "Beak depth, mm", group="species")

Takeaway: The overlaid scatter plot above helps us very quickly discern differences between groups. In this case, we can quickly tell that the two Finch species have evolved (via natural selection) to have different beak characteristics.

2. Histograms#

A Histogram shows us the distribution of a numerical variable.

Midterm scores#

scores = Table().read_table("data/scores_by_section.csv").relabeled("Midterm", "Midterm 1")
scores

Section	Midterm 1
1	22
2	12
2	23
2	14
1	20
3	20
4	19
1	24
1	0
1	13

... (111 rows omitted)

Let’s subset to just section 4 for now.

scores_sec4 = scores.where("Section", 4)
scores_sec4

Section	Midterm 1
4	19
4	14
4	24
4	12
4	0
4	13
4	19
4	22
4	13
4	18

... (20 rows omitted)

A histogram can give us a sense of the data as a whole: What are the common values? What are uncommon? How much variability is there? What are the extremes?

scores_sec4.hist("Midterm 1")

Class survey: Distance from home#

Load Data#

survey = Table().read_table("data/prelab01-survey-fall2024.csv")
survey.show(5)

Year at Williams	Favorite icecream flavor	Favorite planet	Height (in inches)	Distance Home (in miles)	Birthday month	Left or right handed?
Year 4	Purple Cow	Pluto* because liking Pluto is a protest move against Pl ...	66	175	October	Right handed
Year 4	Purple Cow	Neptune	72	213	January	Right handed
Year 2	Chocolate	Venus	66	152	April	Left handed
Year 4	Mint chocolate chip	Earth	71	218	September	Right handed
Year 1	Vanilla	Venus	60	1729	February	Right handed

... (53 rows omitted)

survey.labels

('Year at Williams',
 'Favorite icecream flavor',
 'Favorite planet',
 'Height (in inches)',
 'Distance Home (in miles)',
 'Birthday month',
 'Left or right handed? ')

distance_home = survey.column('Distance Home (in miles)')
distance_home

array([ 175. ,  213. ,  152. ,  218. , 1729. ,  402. , 2570. ,  132.6,
       1275. ,  140. ,  140. , 2849. , 2923. ,   45. ,  185.7,  226.4,
       2850. ,  150. ,  162. , 3000. ,  258. ,   86. , 2485.7,  102. ,
       3002. ,  125. ,  833. ,  150. ,  170. ,  287. , 2869.7, 3539. ,
        140. , 1900. ,  151. ,  166. ,  258. ,   92.2,  103. ,   46.4,
       2600. ,  246. ,  391. ,  330. , 7239. ,  382. ,  160. ,  151. ,
        852. ,  222.3,  960. , 1284. ,  162. ,  170. ,  162.1, 7781. ,
        150. , 3000. ])

Some basic info about the distances:

len(distance_home)

np.mean(distance_home)

1078.346551724138

max(distance_home)

7781.0

Sneak preview of a histogram for those distances

survey.hist('Distance Home (in miles)')

3. Binning#

survey.show(3)

Year at Williams	Favorite icecream flavor	Favorite planet	Height (in inches)	Distance Home (in miles)	Birthday month	Left or right handed?
Year 4	Purple Cow	Pluto* because liking Pluto is a protest move against Pl ...	66	175	October	Right handed
Year 4	Purple Cow	Neptune	72	213	January	Right handed
Year 2	Chocolate	Venus	66	152	April	Left handed

... (55 rows omitted)

We have a method in our package that can make bins automatically: table.bin.

our_range = np.arange(0,10000,2000)
our_range

array([   0, 2000, 4000, 6000, 8000])

binned_distance_home = survey.bin('Distance Home (in miles)', bins=our_range)
binned_distance_home

bin	Distance Home (in miles) count
0	45
2000	11
4000	0
6000	2
8000	0

Let’s add a column that is the percentage in each bin.

percent = binned_distance_home.column('Distance Home (in miles) count') / survey.num_rows * 100
percent_table = binned_distance_home.with_columns('Percent', percent)
percent_table

bin	Distance Home (in miles) count	Percent
0	45	77.5862
2000	11	18.9655
4000	0	0
6000	2	3.44828
8000	0	0

Histogram of distances from home#

survey.hist('Distance Home (in miles)', bins= np.arange(0,10000,2000))

survey.hist('Distance Home (in miles)', bins=4)

survey.hist('Distance Home (in miles)', bins=8)

Think-pair-share: Calculate the area of each bar in the histogram (estimating the height). Then show the sum of the area of all the bars equals 100.

#Possible approximations
widths = make_array(2000, 2000, 2000, 2000)
heights = make_array(0.038, 0.007, 0.001, 0.003)
areas = widths*heights
areas

array([76., 14.,  2.,  6.])

sum(areas)

98.0

Let’s check our estimates.

percent_table

bin	Distance Home (in miles) count	Percent
0	45	77.5862
2000	11	18.9655
4000	0	0
6000	2	3.44828
8000	0	0

Cool! We’re pretty close to the actual areas! Great!

Let’s work backwards now and see how our hist() method calculated the y-axis.

Let’s look at the first bar/bin.

bin0 = percent_table.take(0)
bin0

bin	Distance Home (in miles) count	Percent
0	45	77.5862

Recall, the height is equal to the (percent of entries in the bin) / width of the bin

percent_in_bin0 =  bin0.column('Percent').item(0)
percent_in_bin0

77.58620689655173

height0 = percent_in_bin0/2000
height0

0.03879310344827586

Fantastic! That’s what we see on the y-axis on the histogram.

More histogram practice#

survey.show(5)

Year at Williams	Favorite icecream flavor	Favorite planet	Height (in inches)	Distance Home (in miles)	Birthday month	Left or right handed?
Year 4	Purple Cow	Pluto* because liking Pluto is a protest move against Pl ...	66	175	October	Right handed
Year 4	Purple Cow	Neptune	72	213	January	Right handed
Year 2	Chocolate	Venus	66	152	April	Left handed
Year 4	Mint chocolate chip	Earth	71	218	September	Right handed
Year 1	Vanilla	Venus	60	1729	February	Right handed

... (53 rows omitted)

survey.labels

('Year at Williams',
 'Favorite icecream flavor',
 'Favorite planet',
 'Height (in inches)',
 'Distance Home (in miles)',
 'Birthday month',
 'Left or right handed? ')

plot = survey.hist('Height (in inches)', bins=np.arange(58,80,2))
plot.set_ylim(0,0.15)

Think-pair-share: Approximating Histogram of heights#

Approximate the percentage of the class that has height greater than or equal to 70 inches but less than 72 inches.
How many students is this? We know 58 students responded to our survey.

survey.num_rows

answer_q1 = 7.7 * 2
answer_q1

15.4

answer_q2 = answer_q1 / 100 * 58
answer_q2

8.932

We can’t have a fraction of a student so our approximation was probably slightly incorrect. The real count is 9. There were four students that have a height of excatly 70 inches.

survey.where('Height (in inches)', are.between(70, 72))

Year at Williams	Favorite icecream flavor	Favorite planet	Height (in inches)	Distance Home (in miles)	Birthday month	Left or right handed?
Year 4	Mint chocolate chip	Earth	71	218	September	Right handed
Year 1	Vanilla	Neptune	70	402	December	Right handed
Year 1	Chocolate	Earth	71	226.4	February	Right handed
Year 2	Mint chocolate chip	Earth	70	150	November	Right handed
Year 4	Chocolate	Jupiter	70	125	March	Left handed
Year 4	Mint chocolate chip	Pluto* because liking Pluto is a protest move against Pl ...	71	1900	December	Left handed
Year 2	Purple Cow	Venus	70	2600	October	Right handed
Year 2	Vanilla	Mercury	71	246	July	Right handed
Year 1	Vanilla	Saturn	71	7781	October	Left handed

survey.where('Height (in inches)', are.equal_to(70)).num_rows

Overlaid histograms#

Scores#

Circle back around to our midterm data.

scores

Section	Midterm 1
1	22
2	12
2	23
2	14
1	20
3	20
4	19
1	24
1	0
1	13

... (111 rows omitted)

scores_sec4 = scores.where("Section", 4)
scores_sec4.hist("Midterm 1")

Like scatter we can create overlaid histograms with the group= named variable

scores.hist("Midterm 1", group="Section", bins=10)

Finches#

One more overlay, for the two finch species.

finch_1975.show(10)

species	Beak length, mm	Beak depth, mm
fortis	9.4	8
fortis	9.2	8.3
scandens	13.9	8.4
scandens	14	8.8
scandens	12.9	8.4
fortis	9.5	7.5
fortis	9.5	8
fortis	11.5	9.9
fortis	11.1	8.6
fortis	9.9	8.4

... (396 rows omitted)

finch_1975.hist("Beak length, mm", bins=20)

plot = finch_1975.hist("Beak length, mm", group="species")
plot.set_title("Finches, 1975")

Try different bins to see differences in granularity.

def hist_with_bins(bins):
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=bins,  title=str(bins) + " bins")

interact(hist_with_bins, bins=Slider(1,20))

bins

A few different histograms side-by-side:

with Figure(2,2, figsize=(4,3)):
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=6,  title="6 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=10, title="10 bins")    
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=15, title="16 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=20, title="20 bins")

CSCI 104: Data Science and Computing for All

Histograms

Contents