The Count-Min sketch is a simple technique to summarize large amounts of frequency data. Count-min Sketch is a probabilistic data structure that uses fixed memory.
The Count-Min Sketch is a data structure that is used to summarize data streams. It stores information about how often an item occurs in the data without storing all the data from the data stream and helps with answering questions like ”What items have appeared more than k times in this data stream?”
You can think of Count-Min Sketch as a two-dimensional array. We define width of this array and its height. Width is usually in thousands, while height is small and represents a number of hash functions, for example 5.
How it works:
Count-min Sketch uses a 2-dimensional array.
When a new element comes we calculate multiple hash functions.
Then increment by 1 on all such positions returned by hash functions.
While retrieving data we take the minimum of all values at positions by hash functions. As this data structure is probabilistic, the values returned are approximate. Accuracy and probability can be adjusted by fixing the height and width 2-D array. The greater the height and width the lesser the collision and the higher the accuracy.
Issue with Count-min sketch and its solution:
What if one or more elements got the same hash values and then they all incremented. So, in that case, the value would have been increased because of the hash collision.
Thus sometimes (in very rare cases) Count-min sketch overcounts the frequencies because of the hash functions. So the more hash function we take there will be less collision. The fewer hash functions we take there will be the high probability of collision. Hence it is always recommended to take more hash functions. The Collison is shown in the above diagram.
Properties of Count-Min Sketch
What’s great about the Count-Min Sketch’s unusual structure is that we do not need to allocate new memory when we add a new event to the set. We increment some counters, but our space allocated is constant. Even if we add a hundred, a thousand, or a million new events, our program should run in the same amount of time with the same amount of memory. In other words, Count-Min Sketch is O(1) in both time and space.