Filters
Bloom Filters
Bloom filters are tuned based on three parameters
- m - the size of the bit array
- k - the number of hash functions
- N - the size of the set
False negatives are never tolerated, but for a given N, m and k can be chosen to minimize the false positive rate.
operations
To insert into a Bloom filter, you hash the element using each of the k independent hash functions. You then set the corresponding bits in the m-bit array.
To query a Bloom filter, you hash the element using each of the k independent hash functions. You then check whether the corresponding bits in the m-bit array are set. The Bloom filter returns
- definitely not in the set if any of the bits are 0
- possibly in the set if all of the bits are 1 The answer could return a false positive if some combination of other elements previously inserted into the array had caused the bits that correspond to the queried element's k hash function's bits.
Questions
- What types of applications can benefit from using Bloom filters?
- What types of applications could never use a Bloom filter?
- How good/bad is the cache behavior of a standard Bloom filter?
- Can you think of any ways to improve cache locality?
- For the following operations, decide whether a standard Bloom filters could support the operation, and why/why not (or under what conditions it is possible):
- combining two Bloom filters
- resizing a Bloom filter
- deleting an element
- Can you think of any way to modify the standard Bloom filter to support deletes?
- you may consider storing auxiliary data in the bit array, but keep in mind that you want to minimize the memory footprint
- Which data structure do you think is the easiest to implement:
- Bloom filter
- Quotient filter
- Cuckoo filter
What types of things seem most challenging to get right? Do those things affect performance or correctness?