Sorting

Let’s quickly review binarySearch before moving on to sorting. Recall that binary search works on a sorted list.

def binarySearch(aList, item):
    """Assume aList is sorted. If item is 
    in aList, return True; else return False."""

    n = len(aList)
    mid = n // 2

    # base case 1
    if n == 0:
        return False

    # base case 2
    elif item == aList[mid]:
        return True

    
    # recurse on left
    elif item < aList[mid]:
        return binarySearch(aList[:mid], item)
    
    # recurse on right
    else:
        return binarySearch(aList[mid + 1:], item)

Although the above approach works, it is actually not O(log n)! The problem is that list splicing is actually an O(n) operation. In order to write a truly logarithmic binary search, we have to recursively pass index values rather than creating list copies using splicing.

def binarySearchBetter(aList, item, indexStart, indexEnd):
    """Assume aList is sorted. If item is 
    in aList, return True; else return False."""

    n = indexEnd - indexStart
    mid = (n // 2) + indexStart
    
    # base case 1
    if item == aList[mid]:
        return True
    
    # base case 2
    elif n <= 0:
        return False
    
    # base case 2
    elif item == aList[mid]:
        return True
    
    # recurse on left
    elif item < aList[mid]:
        return binarySearchBetter(aList, item, 0, mid)
        
    # recurse on right
    else:
        return binarySearchBetter(aList, item, mid+1, indexEnd)
# quick test to make sure it works
myList = ['a', 'e', 'i', 'o', 'u', 'z']
binarySearch(myList, 'z')
True
# quick test to make sure it works
myList = ['a', 'e', 'i', 'o', 'u', 'z']
binarySearchBetter(myList, 'z', 0, len(myList)-1)
True
# let's make a big list
# we'll include each word twice just to make the list bigger
myList = []
with open("prideandprejudice.txt") as f:
    for line in f:
        myList.extend(line.strip().split())
        myList.extend(line.strip().split())
myList.sort()
len(myList)
244178
import time
start_time = time.time()
print(binarySearch(myList, "cat"))
print((time.time() - start_time), "seconds")
False
0.004509925842285156 seconds
import time
start_time = time.time()
print(binarySearchBetter(myList, "cat", 0, len(myList)-1))
print((time.time() - start_time), "seconds")
False
0.000392913818359375 seconds

Selection Sort

Binary search is more efficient than linear search, but it also requires that our list be sorted in advance. Sorting is a computationally expensive operation. Today we will explore a few sorting algorithms.

A possible approach to sort:

  • Find the smallest element and move it to the first position

  • Repeat: find the second-smallest element and move it to the second position, and so on.

This algorithm is called selection sort.

def selectionSort(myList):
    """Selection sort of given list myList,
    mutates list and sorts using selection sort."""
    # find size
    n = len(myList)
    
    # traverse through all elements
    for i in range(n):
        
        # find min element in remaining unsorted list
        minIndex = i
        for j in range(i + 1, n):
            if myList[minIndex] > myList[j]:
                minIndex = j
                
        # swap min element with element at i
        myList[i], myList[minIndex] = myList[minIndex], myList[i]
        
myList = [12, 2, 9, 4, 11, 3, 1, 7, 14, 5, 13]
selectionSort(myList)
myList
[1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 14]

Extra Slides Material: MergeSort

Mergesort is another way to sort that is more efficient, but also more complicated. To get started, let’s write a helper function, merge, that takes two sorted list and iteratively merges them into a single sorted list and returns it.

def merge(a, b):
    """Merges two sorted lists a and b,
    and returns new merged list c"""
    # initialize variables
    i, j, k = 0, 0, 0
    lenA, lenB = len(a), len(b)
    c = []
    
    # traverse and populate new list
    while i < lenA and j < lenB:
        
        if a[i] <= b[j]:
            c.append(a[i])
            i += 1
        else:
            c.append(b[j])
            j += 1
        k += 1
        
    # handle remaining values
    if i < lenA:
        c.extend(a[i:])
        
    elif j < lenB:
        c.extend(b[j:]) 
    
    return c     
merge([3, 12, 43], [])
[3, 12, 43]
merge([], [0, 2, 12])
[0, 2, 12]
merge(['a', 'd', 'f'], ['b', 'c', 'e'])
['a', 'b', 'c', 'd', 'e', 'f']
evens = [i for i in range(20) if i % 2 == 0]
sqs = [i*i for i in range(1, 8)]
merge(evens, sqs)
[0, 1, 2, 4, 4, 6, 8, 9, 10, 12, 14, 16, 16, 18, 25, 36, 49]

Using our helper function, we can implement the recursive mergeSort algorithm that uses merge() in the final merge step.

def mergeSort(L):
    """Given a list L, returns
    a new list that is L sorted
    in ascending order."""
    n = len(L)
    
    # base case
    if n == 0 or n == 1:
        return L
    
    else:
        m = n//2 # middle
        
        # recurse on left & right half
        sortLt = mergeSort(L[:m])
        sortRt = mergeSort(L[m:])
        
        # return merged list
        return merge(sortLt, sortRt)
mergeSort([12, 2, 9, 4, 11, 3, 1, 7, 14, 5, 13])
[1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 14]
mergeSort(['hello', 'world', 'aloha', 'earth'])
['aloha', 'earth', 'hello', 'world']
mergeSort(['e', 'p', 'o', 'c', 'h'])
['c', 'e', 'h', 'o', 'p']
mergeSort(list('We hate Covid-19'))
[' ',
 ' ',
 '-',
 '1',
 '9',
 'C',
 'W',
 'a',
 'd',
 'e',
 'e',
 'h',
 'i',
 'o',
 't',
 'v']

Merge Sort vs Selection Sort

Why do we need a better sorting algorithm? As the list we are sorting grows large, the Big-O bound matters! Let’s compare the runtime of both sorting algorithms on pretty large lists.

wordList = []
with open('prideandprejudice.txt') as book:
    for line in book:
        line = line.strip().split()
        wordList.extend(line)
print(len(wordList))
122089
miniList = wordList[:500]
medList = wordList[:7000]
import time

def timedSorting(wordList):
    """Measures runtime for sorting wordList"""
    start = time.time()
    sortedWordList = selectionSort(wordList)
    end = time.time()
    print("Selection sort takes {} secs", end - start)
    start = time.time()
    sortedWordList = mergeSort(wordList)
    end = time.time()
    print("Merge sort takes {} secs", end - start)
timedSorting(miniList)
Selection sort takes {} secs 0.008807897567749023
Merge sort takes {} secs 0.0008759498596191406
# timedSorting(medList)
# timedSorting(wordList)