Probability#

from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

1. Random Choice, Loops, Simulation Review#

https://upload.wikimedia.org/wikipedia/commons/7/74/Pompey_by_Nasidius.jpg

In the last lecture, we wrote the following code to simulate tossing a coin 100 times and counting the number of heads.

coin = make_array('heads', 'tails')

np.random.choice(coin)

'heads'

flips = np.random.choice(coin, 100)
flips

array(['heads', 'heads', 'tails', 'tails', 'tails', 'tails', 'tails',
       'heads', 'tails', 'tails', 'heads', 'tails', 'tails', 'tails',
       'tails', 'tails', 'tails', 'heads', 'heads', 'heads', 'heads',
       'heads', 'heads', 'heads', 'tails', 'tails', 'heads', 'tails',
       'heads', 'heads', 'tails', 'heads', 'tails', 'tails', 'heads',
       'tails', 'tails', 'tails', 'tails', 'tails', 'heads', 'heads',
       'tails', 'heads', 'tails', 'tails', 'tails', 'tails', 'tails',
       'heads', 'tails', 'tails', 'heads', 'tails', 'heads', 'heads',
       'heads', 'tails', 'tails', 'tails', 'heads', 'tails', 'tails',
       'tails', 'tails', 'tails', 'heads', 'heads', 'heads', 'heads',
       'tails', 'heads', 'tails', 'heads', 'heads', 'heads', 'heads',
       'tails', 'heads', 'tails', 'heads', 'tails', 'tails', 'heads',
       'tails', 'heads', 'tails', 'heads', 'tails', 'tails', 'tails',
       'heads', 'heads', 'tails', 'tails', 'heads', 'tails', 'heads',
       'tails', 'heads'], dtype='<U5')

np.count_nonzero(flips == 'heads')

def heads_in_100_flips():
    """ Returns the number of heads in 100 flips of
    a fair coin """
    coin = make_array('heads', 'tails')
    flips = np.random.choice(coin, 100)
    return np.count_nonzero(flips == 'heads')

heads_in_100_flips()

We also wrote a simulation look using our algorithm to emperically estimate the probability of getting between 40 and 60 heads in 100 flips.

Generic simulation algorithm:

Repeat N times:
- Simulate one trial
- Record the outcome
Analyze outcomes for all trials

We record the outcomes by appending them to an array.

# Create an empty array
outcomes = make_array()

# Run this a bunch -- each run adds one more element to outcomes
num_heads = heads_in_100_flips()
outcomes = np.append(outcomes, num_heads)
outcomes

array([50.])

Now make a loop!

num_trials = 10000
outcomes = make_array()
for i in np.arange(0, num_trials):
    num_heads = heads_in_100_flips()
    outcomes = np.append(outcomes, num_heads)
    
outcomes

array([53., 55., 55., ..., 41., 47., 52.])

simulated_results = Table().with_columns('Heads in 100 flips', 
                                        outcomes)
plot = simulated_results.hist(bins=np.arange(30, 70, 1))
plot.interval(40,60)

target_range = simulated_results.where("Heads in 100 flips", 
                                       are.between(40,60))
proportion_40_to_60 = target_range.num_rows / simulated_results.num_rows
print('proportion between 40 and 60:', proportion_40_to_60)

proportion between 40 and 60: 0.9539

2. A general simulation function#

Let’s make a reusable version of our simulation. That is, let’s make a function to do the work and produce the outcomes array. We can start with our simulation loop above:

num_trials = 10000
outcomes = make_array()
for i in np.arange(0, num_trials):
    num_heads = heads_in_100_flips()
    outcomes = np.append(outcomes, num_heads)
outcomes

array([51., 54., 46., ..., 47., 54., 59.])

This code depends on two pieces of information specific to the simulation we wish to perform:

the number of trials (num_trials)
the code to compute the outcome of one trial (eg: heads_in_100_flips(). That code would need to change if we simulated the number of tails in 200 flips, the sum of 20 dice rolls, or any other kind outcome.

To enable us to use our general function with different numbers of trials or different functions to make the outcomes, we write the function with those two items as parameters:

def simulate(make_one_outcome, num_trials):
    """
    Return an array of num_trials values, each 
    of which was created by calling make_one_outcome().
    """
    outcomes = make_array()
    for i in np.arange(0, num_trials):
        outcome = make_one_outcome()
        outcomes = np.append(outcomes, outcome)

    return outcomes

We can then call simulate as follows:

simulate(heads_in_100_flips, 10)

array([55., 50., 51., 57., 53., 51., 54., 54., 49., 55.])

Or if we are interested in the sum of 20 dice rolls, we call it as follows:

def sum_twenty_dice():
    dice = np.arange(1,7)
    roll_20_dice = np.random.choice(dice, 20)
    return sum(roll_20_dice)

simulate(sum_twenty_dice, 5)

array([50., 71., 66., 75., 56.])

Notice how we can design new simulations without starting from scratch! We write a function to compute one outcome, and then reuse simulate with the number of trials we wish to perform.

And just for fun…

twenty_dice = simulate(sum_twenty_dice, 100000)
Table().with_columns('Sum of 20 dice', twenty_dice).hist(bins=np.arange(40,100,1))

Does this look like any other histogram we saw today?

3. The Monty Hall Problem#

Three doors hide two goats and a car.
You pick one of three doors.
A different door is opened to reveal a goat.
You must decide which door has the car. You can stick with your original choice, or you can switch to the other unopened door.

Which strategy wins the car with higher probability?#

First, we’ll define functions that give us the result of one trial. Both pick the initial door at random, and then determine which door the player ultimately selects based on our two strategies.

prizes = make_array('goat', 'goat', 'car')

def monty_hall_game_staying_strategy():
    """
    Return what player wins if they stick with their original choice
    """
    return np.random.choice(prizes)

def monty_hall_game_switching_strategy():
    """
    Return what player wins if they switch with their original choice
    """
    contestant_guess = np.random.choice(prizes)
    if contestant_guess == 'car':
        # Revealed door is one goat ...
        # ... and the remaining door is the second goat.
        return 'goat'
    else:
        # Releaved door is the second goat ...
        # ... and the remaining door is the car.
        return 'car'

monty_hall_game_staying_strategy()

'goat'

monty_hall_game_switching_strategy()

'goat'

Ooo! We get to use our simulate function!

outcomes_staying = simulate(monty_hall_game_staying_strategy, 10)
outcomes_staying

array(['car', 'car', 'goat', 'goat', 'car', 'goat', 'goat', 'car', 'goat',
       'car'], dtype='<U32')

num_trials = 10000
outcomes_staying = simulate(monty_hall_game_staying_strategy, num_trials)
stay_and_win = np.count_nonzero(outcomes_staying == "car") / len(outcomes_staying)
print("Probability of winning a car if we always stay:", stay_and_win)

Probability of winning a car if we always stay: 0.3274

outcomes_switching =  simulate(monty_hall_game_switching_strategy, num_trials)
switch_and_win = np.count_nonzero(outcomes_switching == "car")/len(outcomes_switching)
print("Probability of winning a car if we always switch", switch_and_win)

Probability of winning a car if we always switch 0.6697

Would we trust the probabilities approximated by our simulation more or less if we increase the number of trials?

CSCI 104: Data Science and Computing for All

Probability

Contents

Probability#

1. Random Choice, Loops, Simulation Review#

2. A general simulation function#

3. The Monty Hall Problem#

Which strategy wins the car with higher probability?#