Hi people,

in this post I’ll present part of my master degree field study: probabilistic data structures or just sketches.

Introduction to sketches

Every programmer well know data structures like List, Stack, Trees and others. This type of data strucutures are called abstract data structures becouse them can store any kind of information and solve many kind of problems in general.

Sketches comes to solve special needs where time and space for computing is very restrict for amount of data that you must deal. For example, in my studies, I need make real-time computation in network devices to detect not desired behaviors in my network but switches and routers have limited resources to compute and store data.

Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter); the more elements that are added to the set, the larger the probability of false positives.

From Wikipedia.

How it works

  • Input data: google.com, twitter.com, instagram.com, apple.com, amazon.com, ebay.com, aliexmpress.com and register.com.
  • Hash functions: h1, h2 and h3.
  • Bloom filter array with 16 positions called M.

Now, iterating through all input data we’re going to perform this operation:

1
2
3
4
M[h1('google.com')] = 1
M[h2('google.com')] = 1
...
M[h3('register.com')] = 1

Suppose this result:

Checking if ‘amazon.com’ is presented at Bloom Filter?

1
2
3
h1('amazon.com') = 4
h2('amazon.com') = 9
h3('amazon.com') = 13

Looking M[4], M[9] and M[13] all are setted 1, so ‘amazon.com’ is presented at Bloom Filter.

Checking if ‘facebook.com’ is presented at Bloom Filter?

1
2
3
h1('facebook.com') = 0
h2('facebook.com') = 13
h3('facebook.com') = 15

Only M[13] in this case is 1, so ‘facebook.com’ isn’t presented at Bloom Filter.

In Python

To see Bloom Filter in practice I’m going to use PyProbables:

pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.

Let’s try this sample code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from probables import BloomFilter


# Init Bloom Filter estimated to store 8 items with 5% false positive rate
blm = BloomFilter(est_elements=8, false_positive_rate=0.05)

# Add some items
blm.add('google.com')
blm.add('twitter.com')
blm.add('instagram.com')
blm.add('apple.com')
blm.add('amazon.com')
blm.add('ebay.com')
blm.add('aliexpress.com')
blm.add('register.com')

# Checking
print('twitter.com is present?', blm.check('twitter.com'))
print('facebook.com is present?', blm.check('facebook.com'))
print('google.com is present?', blm.check('google.com'))

Output:

1
2
3
twitter.com is present? True
facebook.com is present? False
google.com is present? True

Home work:

  1. Look current_false_positive_rate function; and
  2. Look bloom attribute.

Bloom Filter at Open Source