Histograms#

from datascience import *
from cs104 import *
import numpy as np

%matplotlib inline

0. Error practice#

majors = Table().read_table("data/majors.csv")
majors.show(5)
Major Division 2008-2012 2018-2021
American Studies 2 10 9
Anthropology 2 8 4
Arabic Studies 1 4 7
Art 1 55 31
Asian Studies 1 8 6

... (32 rows omitted)

# Select only division 3 majors 
div3 = majors.where("Division", are.equal_to(3)
div3
  Cell In[3], line 3
    div3
    ^
SyntaxError: invalid syntax
# Get the number of majors across both time periods
majors.select("2008-2012") + majors.select("2018-2021")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 2
      1 # Get the number of majors across both time periods
----> 2 majors.select("2008-2012") + majors.select("2018-2021")

TypeError: unsupported operand type(s) for +: 'Table' and 'Table'

1. Overlaid graphs#

Sometimes we want to see more than one plot on a single graph.

Overlaid bar charts#

div3 = majors.where("Division", are.equal_to(3)).drop("Division")
div3
Major 2008-2012 2018-2021
Astronomy 1 2
Astrophysics 3 3
Biology 58 61
Chemistry 30 34
Computer Science 16 50
Geosciences 7 12
Mathematics 53 61
Physics 12 13
Psychology 62 45
Statistics 0 16
# First graph for 2008-2012
div3.barh("Major", "2008-2012")
# Second graph from 2018-2021
div3.barh("Major", "2018-2021")
../_images/08-histograms_10_0.png ../_images/08-histograms_10_1.png

Overlaid graph puts the two graphs together to make comparison easier.

The package we’re using will automatically make overlaid graphs with the remainder of the columns if you give it just one parameter.

div3.barh("Major")
../_images/08-histograms_12_0.png

Overlaid line plots#

temps_by_month = Table().read_table("data/temps_by_month_upernavik.csv")
temps_by_month.show(5)
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1875 -15.6 -19.7 -25.9 -14.7 -9.6 -0.4 4.7 2.9 -0.1 -4.5 -5.5 -14
1876 -24.5 -21.2 -20.8 -14.9 -6.3 0.9 3.9 2.4 3.2 -6.2 -9.4 -14.4
1877 -21.1 -26.5 -17.8 -12 -1.7 1.4 4.6 5.2 3 -2.8 -10 -19.2
1878 -22.9 -26.9 -19.6 -13 -5.6 1.9 3.2 4.3 -0.9 -4 -2 -3.7
1879 -13.5 -25.4 -21.3 -13 -2.7 0 5.2 5.8 -1.1 -5.5 -8.2 -16.5

... (104 rows omitted)

As with bar charts, if you supply only one parameter, the plot method will plot a line for every other column.

temps_by_month.plot("Year")
../_images/08-histograms_16_0.png

Qualitatively, we can see that the plot above has too much information on it which makes it not very useful for understand trends.

temps_by_month.select("Year", "Feb", "Aug").plot("Year")
../_images/08-histograms_18_0.png

Overlaid scatter plots#

  • We want to plot points (the values of two numerical variables) from different groups on the same graph.

  • A new approach. Use categorical variable to break the rows into groups of related points in the plot.

finch_1975 = Table().read_table("data/finch_beaks_1975.csv")
finch_1975.show(6)
species Beak length, mm Beak depth, mm
fortis 9.4 8
fortis 9.2 8.3
scandens 13.9 8.4
scandens 14 8.8
scandens 12.9 8.4
fortis 9.5 7.5

... (400 rows omitted)

finch_1975.scatter("Beak length, mm", "Beak depth, mm", group="species")
../_images/08-histograms_21_0.png

Takeaway: The overlaid scatter plot above helps us very quickly discern differences between groups. In this case, we can quickly tell that the two Finch species have evolved (via natural selection) to have different beak characteristics.

2. Histograms#

A Histogram shows us the distribution of a numerical variable.

Midterm scores#

scores = Table().read_table("data/scores_by_section.csv").relabeled("Midterm", "Midterm 1")
scores
Section Midterm 1
1 22
2 12
2 23
2 14
1 20
3 20
4 19
1 24
1 0
1 13

... (111 rows omitted)

Let’s subset to just section 4 for now.

scores_sec4 = scores.where("Section", 4)
scores_sec4
Section Midterm 1
4 19
4 14
4 24
4 12
4 0
4 13
4 19
4 22
4 13
4 18

... (20 rows omitted)

A histogram can give us a sense of the data as a whole: What are the common values? What are uncommon? How much variability is there? What are the extremes?

scores_sec4.hist("Midterm 1")
../_images/08-histograms_29_0.png

Class survey: Distance from home#

Load Data#

survey = Table().read_table("data/prelab01-survey-fall2023.csv")
survey.show(5)
Year at Williams Favorite Icecream Flavor Favorite Planet Height (in inches) Distance Home (in miles) Birth Month Left or Right Handed
2 Strawberry Venus 60 175 October Right
1 Coffee Venus 63 6831 March Right
2 Chocolate Earth 74 2600 July Right
2 Vanilla Earth 69 933.9 July Ambidextrous
1 Chocolate Saturn 64 141 July Right

... (53 rows omitted)

survey.labels
('Year at Williams',
 'Favorite Icecream Flavor',
 'Favorite Planet',
 'Height (in inches)',
 'Distance Home (in miles)',
 'Birth Month',
 'Left or Right Handed')
distance_home = survey.column('Distance Home (in miles)')
distance_home
array([ 175.  , 6831.  , 2600.  ,  933.9 ,  141.  ,  132.8 , 2023.  ,
       1685.7 ,  167.5 ,  170.  , 7233.  , 2856.6 , 1685.7 , 2701.  ,
        225.  ,  195.5 , 1250.  ,  130.  , 4920.  ,  264.  ,  170.  ,
        161.  , 2585.  ,  167.9 ,  144.  ,  419.  , 3341.  ,   85.  ,
          0.  ,  176.  ,  155.  ,  167.  ,  106.  ,  930.  , 2856.  ,
        187.  ,  153.  ,  396.  ,  163.9 ,  250.  , 2517.  ,  168.  ,
        163.  ,  200.  ,  231.  ,  160.  ,  117.  ,   88.  ,  132.  ,
       1891.  ,  848.  , 4455.23, 6873.  ,  181.7 , 1046.  ,  235.  ,
       6743.  ,  424.  ])

Some basic info about the distances:

len(distance_home)
58
np.mean(distance_home)
1300.1108620689654
max(distance_home)
7233.0

Sneak preview of a histogram for those distances

survey.hist('Distance Home (in miles)')
../_images/08-histograms_40_0.png

3. Binning#

survey.show(3)
Year at Williams Favorite Icecream Flavor Favorite Planet Height (in inches) Distance Home (in miles) Birth Month Left or Right Handed
2 Strawberry Venus 60 175 October Right
1 Coffee Venus 63 6831 March Right
2 Chocolate Earth 74 2600 July Right

... (55 rows omitted)

We have a method in our package that can make bins automatically: table.bin.

our_range = np.arange(0,10000,2000)
our_range
array([   0, 2000, 4000, 6000, 8000])
binned_distance_home = survey.bin('Distance Home (in miles)', bins=our_range)
binned_distance_home
bin Distance Home (in miles) count
0 44
2000 8
4000 2
6000 4
8000 0

Let’s add a column that is the percentage in each bin.

percent = binned_distance_home.column('Distance Home (in miles) count') / survey.num_rows * 100
percent_table = binned_distance_home.with_columns('Percent', percent)
percent_table
bin Distance Home (in miles) count Percent
0 44 75.8621
2000 8 13.7931
4000 2 3.44828
6000 4 6.89655
8000 0 0

Histogram of distances from home#

survey.hist('Distance Home (in miles)', bins= np.arange(0,10000,2000))
../_images/08-histograms_49_0.png
survey.hist('Distance Home (in miles)', bins=4)
../_images/08-histograms_50_0.png
survey.hist('Distance Home (in miles)', bins=8)
../_images/08-histograms_51_0.png

Think-pair-share: Calculate the area of each bar in the histogram (estimating the height). Then show the sum of the area of all the bars equals 100.

#Possible approximations
widths = make_array(2000, 2000, 2000, 2000)
heights = make_array(0.038, 0.007, 0.001, 0.003)
areas = widths*heights
areas
array([76., 14.,  2.,  6.])
sum(areas)
98.0

Let’s check our estimates.

percent_table
bin Distance Home (in miles) count Percent
0 44 75.8621
2000 8 13.7931
4000 2 3.44828
6000 4 6.89655
8000 0 0

Cool! We’re pretty close to the actual areas! Great!

Let’s work backwards now and see how our hist() method calculated the y-axis.

  1. Let’s look at the first bar/bin.

bin0 = percent_table.take(0)
bin0
bin Distance Home (in miles) count Percent
0 44 75.8621

Recall, the height is equal to the (percent of entries in the bin) / width of the bin

percent_in_bin0 =  bin0.column('Percent').item(0)
percent_in_bin0
75.86206896551724
height0 = percent_in_bin0/2000
height0
0.03793103448275862

Fantastic! That’s what we see on the y-axis on the histogram.

More histogram practice#

survey.show(5)
Year at Williams Favorite Icecream Flavor Favorite Planet Height (in inches) Distance Home (in miles) Birth Month Left or Right Handed
2 Strawberry Venus 60 175 October Right
1 Coffee Venus 63 6831 March Right
2 Chocolate Earth 74 2600 July Right
2 Vanilla Earth 69 933.9 July Ambidextrous
1 Chocolate Saturn 64 141 July Right

... (53 rows omitted)

survey.labels
('Year at Williams',
 'Favorite Icecream Flavor',
 'Favorite Planet',
 'Height (in inches)',
 'Distance Home (in miles)',
 'Birth Month',
 'Left or Right Handed')
plot = survey.hist('Height (in inches)', bins=np.arange(58,78.1,2))
plot.set_ylim(0,0.15)
../_images/08-histograms_68_0.png

Think-pair-share: Approximating Histogram of heights#

  1. Approximate the percentage of the class that has height greater than or equal to 70 inches but less than 72 inches.

  2. How many students is this? We know 58 students responded to our survey.

survey.num_rows
58
answer_q1 = 7.5 * 2
answer_q1
15.0
answer_q2 = answer_q1 / 100 * 58
answer_q2
8.7

We can’t have a fraction of a student so our approximation was probably slightly incorrect. The real count is 9. There were two students that have a height of excatly 70 inches.

survey.where('Height (in inches)', are.between(70, 72))
Year at Williams Favorite Icecream Flavor Favorite Planet Height (in inches) Distance Home (in miles) Birth Month Left or Right Handed
4 Vanilla Earth 71 7233 January Right
2 Chocolate Earth 71 225 December Right
3 Mint chocolate chip Earth 71 1250 April Right
4 Vanilla Pluto 71 930 September Left
3 Chocolate Mars 70 200 May Right
1 Vanilla Neptune 70.5 117 November Right
2 Chocolate Earth 71 132 January Right
1 Chocolate Earth 70.08 4455.23 February Right
1 Mint chocolate chip Earth 70 181.7 February Right
survey.where('Height (in inches)', are.equal_to(70)).num_rows
2

Overlaid histograms#

Scores#

Circle back around to our midterm data.

scores
Section Midterm 1
1 22
2 12
2 23
2 14
1 20
3 20
4 19
1 24
1 0
1 13

... (111 rows omitted)

scores_sec4 = scores.where("Section", 4)
scores_sec4.hist("Midterm 1")
../_images/08-histograms_79_0.png

Like scatter we can create overlaid histograms with the group= named variable

scores.hist("Midterm 1", group="Section", bins=10)
../_images/08-histograms_81_0.png

Finches#

One more overlay, for the two finch species.

finch_1975.show(10)
species Beak length, mm Beak depth, mm
fortis 9.4 8
fortis 9.2 8.3
scandens 13.9 8.4
scandens 14 8.8
scandens 12.9 8.4
fortis 9.5 7.5
fortis 9.5 8
fortis 11.5 9.9
fortis 11.1 8.6
fortis 9.9 8.4

... (396 rows omitted)

finch_1975.hist("Beak length, mm", bins=20)
../_images/08-histograms_84_0.png
plot = finch_1975.hist("Beak length, mm", group="species")
plot.set_title("Finches, 1975")
../_images/08-histograms_85_0.png

Try different bins to see differences in granularity.

with Figure(2,2, figsize=(4,3)):
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=6,  title="6 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=10, title="10 bins")    
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=15, title="16 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=20, title="20 bins")
../_images/08-histograms_87_0.png