Bloom filter in JavaScript

This article assumes basic familiarty with bit vectors. If you’re unsure how they work or need a refresher, check out the previous article about bit vectors which goes in depth both in explaining how they work, and how to implement one.

A Bloom filter is a simple yet powerful data structure. It allows you to answer the question have you seen this before? in a very fast and memory efficient way. Unlike a regular set, a Bloom filter only needs to store a few bits per item instead of storing the whole thing. The trick is, a Bloom filter will be able to tell you if something is not present in the set with 100% certainty, but if you ask it if something is present in the set, you might get a false positive. That means the response could be true, even if the item was never stored in the set.

To explain things, let’s first do a simple example. Consider we’re taking random numbers as input and checking if we’ve already seen a given number. We could use an array and store the numbers and check its contents whenever needed.

var arr = [];

// inserting a few elements in the array
arr.push(3);
arr.push(5);

// and checking for presence
console.log(arr.indexOf(3) !== -1)
console.log(arr.indexOf(4) !== -1)

arr.indexOf(...) returns the index of an element in the array if found, and -1 if the element was not present in the array.

This approach works, but it has a few issues. Firstly, indexOf needs to traverse the entire array to see if an element is present (\(O(n)\) time complexity), which means the bigger the array, the longer it will take. And second, we’re also using extra memory for each element we add to the array. This might seem dumb at first, considering we haven’t gotten to the Bloom filter part, but read on to see how we can save a lot of memory with a probabilistic approach. But first, let’s try fixing the lookup time by switching to a hash table. We could use JavaScript’s builtin objects to store a truthy value.

var table = {};

table[3] = true;
table[5] = true;

console.log(table[3]);
console.log(table[4]);

This helps us with the lookup time to an average constant time \(O(1)\). The thing is, if we were to count a lot of numbers, it would take up a lot of memory. Considering we’re only storing numbers here, you might be thinking that we could use a bit vector, which would only take 1 bit per number. If we were to store something like IPv4 addresses encoded as 32-bit integers, we would need to allocate a 128MB bit vector. That might still be feasible, as the set of possible values is still only 32 bits. Increasing this to IPv6 (which are 128-bit integers), the bit vector would be way too big (rougly \(10^{36}\) bytes).

Here comes the interesting part though, what if we only intend to store a portion of the numbers. If it was small enough, we could simply store them in an array. If it was bigger, we’d probably use a hash table or ideally a set data structure. But what if we need to store lots of them?

Storing 5 million IPv6 addresses (128bit numbers) would take up around 76MB of memory. That might not seem like a huge deal if all you’re doing is managing which IPv6 address you’ve seen before. But what if this is part of a side calculation that isn’t particularly important? Or what if you need to store even more? The storage requirements grow linearly together with each address. At 50 million we’d be closing in 1GB of memory.

Using a probabilistic approach to save memory

If we’re willing to relax our requirements on the data structure, we can save quite a bit of memory. The original bit vector approach is great in terms of memory, but fails when the set of possible values is too large, as we need to pre-allocate a slot for each possible value. But what if we allowed multiple values to share the same slot?

We can use a simple hash function to calculate the position in the bit vector, which allows us to have a bit vector that is smaller than the set of possible values. If we have 100 possible values, but only 10 bits, we can use a modulo 10 hash function to set the proper bit to true.

function hash(num) {
    return num % 10;
}

This approach does decrease the amount of memory required by 10x, but it’s also easy to see that collisions in the hash function can occur rather easily. For example 1, 11, 21, 31, … all have the same hash value. At first this seems like our data structure is completely broken and can’t tell us anything useful, but something interesting happens here.

When we query the data structure, there are two cases that can happen:

  1. The queried bit is true, which means the number we used to calculate the hash might have been used to set it, or it could have been a different number with the same hash.
  2. The queried bit is false, which menas the number was never present in the set! There is no way a different number could be used to set any bit to false.

There is a big difference between cases 1) and 2). While the first case gives a probabilistic response with a possible false positive (getting true while the answer should be false), the second case is always 100% correct. If we get a false, the element definitely isn’t in the set!

Before we move on to how an actual bloom filter works, let’s see a few examples where a data structure that can return false positives but never returns false negatives can be useful:

  • Tracking which pages/articles/profiles/websites a user has seen and which he hasn’t. This way you can recommend things they definitely have not seen.
  • A simple web crawler where you want to avoid visiting the same pages again and again. You will 100% recognize if the crawler has not seen a particular page, with a possibility of skipping a few pages due to a false positive.
  • Mapping tags/keywords to large data files/database where you might want to avoid a possibly expensive search if the keyword does not match the contents.
  • Any case where you want a set which should not be enumerable. For example, when storing privacy sensitive data, using the approach mentioned above doesn’t anyone allow to figure out which specific elements are in the set, as there are many possibilities for each slot.

Bloom filter

In our previous example, the probability of a collision (two elements sharing the same hash) is both very high, and using such a simple hash function yields predictable collisions, which is something we’d want to avoid. The reason is, the number of collisions could be very high for a certain set of items, while being low for a different set, leading to uneven distribution of false positives among the set of possible sets of items.

First, it is important to note that the above shown hash function is not a good example of a hash function. I won’t go into the details of constructing a hash function in this article, as it’s a rather involved topic and I can’t really think of any cases where you’d need to come up with your own. For now, let’s simply consider we have three different hash functions called h1, h2 and h3.

To create a bloom filter, we simply need a bit vector and k different hash functions, and use them to calculate multiple bit indexes for each element. That way if two elements have a collision using one hash function, they don’t necessarily have a collision with the other ones, as each of the hash functions is different and returns a different index.

// Here we assume that `bitvector` is an N-bit long bit vector. See references at the end of this
// article if you're unsure on how it works.
function insert(bitvector, value) {
    // We simply set all the relevant bits to `1`
    bitvector.set(h1(value), 1);
    bitvector.set(h2(value), 1);
    bitvector.set(h3(value), 1);
}

function isMember(bitvector, value) {
    var bit1 = bitvector.get(h1(value));
    var bit2 = bitvector.get(h2(value));
    var bit3 = bitvector.get(h3(value));

    // An element is in the set only if all of the relevant bits are `1`
    return bit1 && bit2 && bit3;
}

Now the question is, does having multiple hash functions reduce the probability of a false positive? Sure the probability that three hash functions collide all at once is lower than the probability of just one hash function colliding, but we’re also setting three times as many bits.

If you’re not interested in the math behind calculating the probabilites of collisions, feel free to skip to the next section for a high level overview.

If we have a bit vector with N bits, the probability that a hash function selects a specific bit is \(\frac{1}{N}\). This means, that if we do a single insert with a single hash function, the probability that a specific bit is kept at 0 is \(1 - \frac{1}{N}\). If we use k different hash functions, a single bit is kept at 0 if none of the hash functions choose it, which means the above test has to pass for all of them, which means the probability is \((1 - \frac{1}{N})^k\).

If we start inserting multiple elements, the probability that a particular bit is 0 after n elements have been inserted is again calculated by looking at this as passing the above test for each of the inserted elements (done n times), so the probability is \(((1 - \frac{1}{N})^k)^n = (1 - \frac{1}{N})^{kn}\).

The complement of this is \(1 - (1 - \frac{1}{N})^{kn}\), which is the probability that a specific bit is 1 after having inserted n elements. We get a false positive if and only if all of the hash functions select indexes which are all 1, which means the above test needs to pass for all of them, giving us the final probability:

\((1 - (1 - \frac{1}{N})^{kn})^k \approx (1 - e^{-\frac{kn}{N}})^k\)

Using a bit of analysis, we can find that the minimum is at \(\frac{kn}{N} = \ln 2\), which gives us a useful relationship \(k = \ln 2 \cdot \frac{N}{n}\).

Choosing the right constants and hash functions

When we want to use a Bloom filter, we don’t really start from the number of hash functions. We care about the probability of false positives and how many items we are approximately going to store in the filter. If we know those, it’s rather simple to calculate the number of bits and hash functions needed. There is actually a nice Bloom filter calculator which does exactly this, you put in a number of expected elements, your target probability for false positives, and the calculator will tell you the optimal number of hash functions, and how many bits the backing bit vector should have. I really recommend trying it out with a few different settings for p and n and see how the size of the Bloom filter changes to get a feel for how much memory you might need if you used one.

Now all that is left to do is figure out where to get those k different hash functions. A simple solution suggested by this paper suggests how to create k hash functions from just two different ones (let’s call them f and g). We calculate the i-th hash as h(i) = (f(x) + i * g(x)) % N where N is the number of bits in our bit vector.

An important point to note here is that we only need the hash function to be uniform, we don’t need a cryptographically secure hash function such as SHA-2. We can use a simpler and faster hash function, such as FNV (here and here), which simply does a few bitwise operations and multiplications. There are actually two variants, FNV-1 and FNV-1a, which are almost the same, except for the order of one particular operation. This means we can use these two hash functions as a basis for the trick mentioned above. You can take a look at the bloomfilter.js library which has an efficient implementation of both the Bloom filter, and of both variants of the FNV hash function.

References


Share