Hi people,

in this post I’ll present part of my master degree field study: probabilistic data structures or just sketches. The latest post I describe what sketches means if you’re missed.

Count-min

In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

Count–min sketches are essentially the same data structure as the counting Bloom filters introduced in 1998 by Fan et al. However, they are used differently and therefore sized differently: a count-min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.

From Wikipedia.

How it works

  • Input data: google.com (1), twitter.com (1), instagram.com (4), apple.com (10), amazon.com (5), ebay.com (3), aliexmpress.com (12) and register.com (2).
  • Hash functions: h1, h2 and h3.
  • Count-min matrix with 10 positions by hash function called M.

Now, iterating through all input data we’re going to perform this operation:

1
2
3
4
M[0][h1('google.com')] += 1
M[1][('google.com')] += 1
...
M[2][('register.com')] += 1

Suppose this result:

Checking ‘amazon.com’ frequency at Count-min?

1
2
3
h1('amazon.com') = 3
h2('amazon.com') = 4
h3('amazon.com') = 7

Getting M[0][3], M[1][4] and M[2][7] values we obtains this result: 5, 8, 6.

Now, the answer for frequency of ‘amazon.com’ is probably 5, minimal value, because the others maybe influenced by hash collision.

Checking if ‘facebook.com’ is presented at Count-min?

1
2
3
h1('facebook.com') = 9
h2('facebook.com') = 6
h3('facebook.com') = 4

Only M[2][4] in this case is different of zero, so ‘facebook.com’ isn’t presented at Count-min.

In Python

To see Count-min in practice I’m going to use PyProbables:

pyprobables is a pure-python library for probabilistic data structures. The goal is to provide the developer with a pure-python implementation of common probabilistic data-structures to use in their work.

Let’s try this sample code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from probables import CountMinSketch


# Init Count-min estimated with 50 positions and 5 hash functions
cms = CountMinSketch(width=50, depth=5)

# Add some items
cms.add('google.com')
cms.add('twitter.com',1)
cms.add('instagram.com',4)
cms.add('apple.com',10)
cms.add('amazon.com',5)
cms.add('ebay.com',3)
cms.add('aliexpress.com',12)
cms.add('register.com',2)

# Checking
print('twitter.com frequence:', cms.check('twitter.com'))
print('facebook.com frequence:', cms.check('facebook.com'))
print('amazon.com frequence:', cms.check('amazon.com'))

print(cms)

Output:

1
2
3
4
5
6
7
8
9
twitter.com frequence: 1
facebook.com frequence: 0
amazon.com frequence: 5
Count-Min Sketch:
	Width: 50
	Depth: 5
	Confidence: 0.96875
	Error Rate: 0.04
	Elements Added: 38

Home work:

  1. Look _bins attribute; and
  2. How to get highest frequency element.

Count-min at Open Source