Meeting 14 :: Filters

Filters

Bloom Filters

Bloom filters are tuned based on three parameters

  • m - the size of the bit array
  • k - the number of hash functions
  • N - the size of the set

False negatives are never tolerated, but for a given N, m and k can be chosen to minimize the false positive rate.

operations

To insert into a Bloom filter, you hash the element using each of the k independent hash functions. You then set the corresponding bits in the m-bit array.

To query a Bloom filter, you hash the element using each of the k independent hash functions. You then check whether the corresponding bits in the m-bit array are set. The Bloom filter returns

  • definitely not in the set if any of the bits are 0
  • possibly in the set if all of the bits are 1 The answer could return a false positive if some combination of other elements previously inserted into the array had caused the bits that correspond to the queried element's k hash function's bits.

 

Questions

  1. What types of applications can benefit from using Bloom filters?
  2. What types of applications could never use a Bloom filter?
  3. How good/bad is the cache behavior of a standard Bloom filter?
  • Can you think of any ways to improve cache locality?
  1. For the following operations, decide whether a standard Bloom filters could support the operation, and why/why not (or under what conditions it is possible):
  • combining two Bloom filters
  • resizing a Bloom filter
  • deleting an element
  1. Can you think of any way to modify the standard Bloom filter to support deletes?
  • you may consider storing auxiliary data in the bit array, but keep in mind that you want to minimize the memory footprint
  1. Which data structure do you think is the easiest to implement:
  • Bloom filter
  • Quotient filter
  • Cuckoo filter

What types of things seem most challenging to get right? Do those things affect performance or correctness?