Sorting¶
Let’s quickly review binarySearch before moving on to sorting. Recall that binary search works on a sorted list.
def binarySearch(aList, item):
"""Assume aList is sorted. If item is
in aList, return True; else return False."""
n = len(aList)
mid = n // 2
# base case 1
if n == 0:
return False
# base case 2
elif item == aList[mid]:
return True
# recurse on left
elif item < aList[mid]:
return binarySearch(aList[:mid], item)
# recurse on right
else:
return binarySearch(aList[mid + 1:], item)
Although the above approach works, it is actually not O(log n)! The problem is that list splicing is actually an O(n) operation. In order to write a truly logarithmic binary search, we have to recursively pass index values rather than creating list copies using splicing.
def binarySearchBetter(aList, item, indexStart, indexEnd):
"""Assume aList is sorted. If item is
in aList, return True; else return False."""
n = indexEnd - indexStart
mid = (n // 2) + indexStart
# base case 1
if item == aList[mid]:
return True
# base case 2
elif n <= 0:
return False
# base case 2
elif item == aList[mid]:
return True
# recurse on left
elif item < aList[mid]:
return binarySearchBetter(aList, item, 0, mid)
# recurse on right
else:
return binarySearchBetter(aList, item, mid+1, indexEnd)
# quick test to make sure it works
myList = ['a', 'e', 'i', 'o', 'u', 'z']
binarySearch(myList, 'z')
True
# quick test to make sure it works
myList = ['a', 'e', 'i', 'o', 'u', 'z']
binarySearchBetter(myList, 'z', 0, len(myList)-1)
True
# let's make a big list
# we'll include each word twice just to make the list bigger
myList = []
with open("prideandprejudice.txt") as f:
for line in f:
myList.extend(line.strip().split())
myList.extend(line.strip().split())
myList.sort()
len(myList)
244178
import time
start_time = time.time()
print(binarySearch(myList, "cat"))
print((time.time() - start_time), "seconds")
False
0.004509925842285156 seconds
import time
start_time = time.time()
print(binarySearchBetter(myList, "cat", 0, len(myList)-1))
print((time.time() - start_time), "seconds")
False
0.000392913818359375 seconds
Selection Sort¶
Binary search is more efficient than linear search, but it also requires that our list be sorted in advance. Sorting is a computationally expensive operation. Today we will explore a few sorting algorithms.
A possible approach to sort:
Find the smallest element and move it to the first position
Repeat: find the second-smallest element and move it to the second position, and so on.
This algorithm is called selection sort.
def selectionSort(myList):
"""Selection sort of given list myList,
mutates list and sorts using selection sort."""
# find size
n = len(myList)
# traverse through all elements
for i in range(n):
# find min element in remaining unsorted list
minIndex = i
for j in range(i + 1, n):
if myList[minIndex] > myList[j]:
minIndex = j
# swap min element with element at i
myList[i], myList[minIndex] = myList[minIndex], myList[i]
myList = [12, 2, 9, 4, 11, 3, 1, 7, 14, 5, 13]
selectionSort(myList)
myList
[1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 14]
Extra Slides Material: MergeSort¶
Mergesort is another way to sort that is more efficient, but also more complicated. To get started, let’s write a helper function, merge
, that takes two sorted list and iteratively merges them into a single sorted list and returns it.
def merge(a, b):
"""Merges two sorted lists a and b,
and returns new merged list c"""
# initialize variables
i, j, k = 0, 0, 0
lenA, lenB = len(a), len(b)
c = []
# traverse and populate new list
while i < lenA and j < lenB:
if a[i] <= b[j]:
c.append(a[i])
i += 1
else:
c.append(b[j])
j += 1
k += 1
# handle remaining values
if i < lenA:
c.extend(a[i:])
elif j < lenB:
c.extend(b[j:])
return c
merge([3, 12, 43], [])
[3, 12, 43]
merge([], [0, 2, 12])
[0, 2, 12]
merge(['a', 'd', 'f'], ['b', 'c', 'e'])
['a', 'b', 'c', 'd', 'e', 'f']
evens = [i for i in range(20) if i % 2 == 0]
sqs = [i*i for i in range(1, 8)]
merge(evens, sqs)
[0, 1, 2, 4, 4, 6, 8, 9, 10, 12, 14, 16, 16, 18, 25, 36, 49]
Using our helper function, we can implement the recursive mergeSort
algorithm that uses merge()
in the final merge step.
def mergeSort(L):
"""Given a list L, returns
a new list that is L sorted
in ascending order."""
n = len(L)
# base case
if n == 0 or n == 1:
return L
else:
m = n//2 # middle
# recurse on left & right half
sortLt = mergeSort(L[:m])
sortRt = mergeSort(L[m:])
# return merged list
return merge(sortLt, sortRt)
mergeSort([12, 2, 9, 4, 11, 3, 1, 7, 14, 5, 13])
[1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 14]
mergeSort(['hello', 'world', 'aloha', 'earth'])
['aloha', 'earth', 'hello', 'world']
mergeSort(['e', 'p', 'o', 'c', 'h'])
['c', 'e', 'h', 'o', 'p']
mergeSort(list('We hate Covid-19'))
[' ',
' ',
'-',
'1',
'9',
'C',
'W',
'a',
'd',
'e',
'e',
'h',
'i',
'o',
't',
'v']
Merge Sort vs Selection Sort¶
Why do we need a better sorting algorithm? As the list we are sorting grows large, the Big-O bound matters! Let’s compare the runtime of both sorting algorithms on pretty large lists.
wordList = []
with open('prideandprejudice.txt') as book:
for line in book:
line = line.strip().split()
wordList.extend(line)
print(len(wordList))
122089
miniList = wordList[:500]
medList = wordList[:7000]
import time
def timedSorting(wordList):
"""Measures runtime for sorting wordList"""
start = time.time()
sortedWordList = selectionSort(wordList)
end = time.time()
print("Selection sort takes {} secs", end - start)
start = time.time()
sortedWordList = mergeSort(wordList)
end = time.time()
print("Merge sort takes {} secs", end - start)
timedSorting(miniList)
Selection sort takes {} secs 0.008807897567749023
Merge sort takes {} secs 0.0008759498596191406
# timedSorting(medList)
# timedSorting(wordList)