Modern, high performance bloom filter in Python? [closed]

I'm looking for a production quality bloom filter implementation in Python to handle fairly large numbers of items (say 100M to 1B items with 0.01% false positive rate).

Pybloom is one option but it seems to be showing its age as it throws DeprecationWarning errors on Python 2.5 on a regular basis. Joe Gregorio also has an implementation.

Requirements are fast lookup performance and stability. I'm also open to creating Python interfaces to particularly good c/c++ implementations, or even to Jython if there's a good Java implementation.

Lacking that, any recommendations on a bit array / bit vector representation that can handle ~16E9 bits?


Asked by: Anna442 | Posted: 28-01-2022






Answer 1

I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.

You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.

In terms of non-Java bit array implementations:

I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.

Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.

from BitVector import BitVector
from random import Random
# get hashes from http://www.partow.net/programming/hashfunctions/index.html
from hashes import RSHash, JSHash, PJWHash, ELFHash, DJBHash


#
# ryan.a.cox@gmail.com / www.asciiarmor.com
#
# copyright (c) 2008, ryan cox
# all rights reserved 
# BSD license: http://www.opensource.org/licenses/bsd-license.php
#

class BloomFilter(object):
    def __init__(self, n=None, m=None, k=None, p=None, bits=None ):
        self.m = m
        if k > 4 or k < 1:
            raise Exception('Must specify value of k between 1 and 4')
        self.k = k
        if bits:
            self.bits = bits
        else:
            self.bits = BitVector( size=m )
        self.rand = Random()
        self.hashes = []
        self.hashes.append(RSHash)
        self.hashes.append(JSHash)
        self.hashes.append(PJWHash)
        self.hashes.append(DJBHash)

        # switch between hashing techniques
        self._indexes = self._rand_indexes
        #self._indexes = self._hash_indexes

    def __contains__(self, key):
        for i in self._indexes(key): 
            if not self.bits[i]:
                return False    
        return True 

    def add(self, key):
        dupe = True 
        bits = []
        for i in self._indexes(key): 
            if dupe and not self.bits[i]:
                dupe = False
            self.bits[i] = 1
            bits.append(i)
        return dupe

    def __and__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise AND')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits & filter.bits))

    def __or__(self, filter):
        if (self.k != filter.k) or (self.m != filter.m): 
            raise Exception('Must use bloom filters created with equal k / m paramters for bitwise OR')
        return BloomFilter(m=self.m,k=self.k,bits=(self.bits | filter.bits))

    def _hash_indexes(self,key):
        ret = []
        for i in range(self.k):
            ret.append(self.hashes[i](key) % self.m)
        return ret

    def _rand_indexes(self,key):
        self.rand.seed(hash(key))
        ret = []
        for i in range(self.k):
            ret.append(self.rand.randint(0,self.m-1))
        return ret

if __name__ == '__main__':
    e = BloomFilter(m=100, k=4)
    e.add('one')
    e.add('two')
    e.add('three')
    e.add('four')
    e.add('five')        

    f = BloomFilter(m=100, k=4)
    f.add('three')
    f.add('four')
    f.add('five')
    f.add('six')
    f.add('seven')
    f.add('eight')
    f.add('nine')
    f.add("ten")        

    # test check for dupe on add
    assert not f.add('eleven') 
    assert f.add('eleven') 

    # test membership operations
    assert 'ten' in f 
    assert 'one' in e 
    assert 'ten' not in e 
    assert 'one' not in f         

    # test set based operations
    union = f | e
    intersection = f & e

    assert 'ten' in union
    assert 'one' in union 
    assert 'three' in intersection
    assert 'ten' not in intersection
    assert 'one' not in intersection

Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:

def fast_count_bits( self, v ):
    bits = (
            0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
            4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 )

    return bits[v & 0xff] + bits[(v >> 8) & 0xff] + bits[(v >> 16) & 0xff] + bits[v >> 24]

Answered by: Emily624 | Posted: 01-03-2022



Answer 2

In reaction to Parand, saying "common practice seems to be using something like SHA1 and split up the bits to form multiple hashes", while that may be true in the sense that it's common practice (PyBloom also uses it), it still doesn't mean it's the right thing to do ;-)

For a Bloom filter, the only requirement a hash function has is that its output space must be uniformly distributed given the expected input. While a cryptographic hash certainly fulfils this requirement, it's also a little bit like shooting a fly with a bazooka.

Instead try the FNV Hash which uses just one XOR and one multiplication per input byte, which I estimate is a few hundred times faster than SHA1 :)

The FNV hash is not cryptographically secure, but you don't need it to be. It has slightly imperfect avalanche behaviour, but you're not using it for integrity checking either.

About uniformity, note that the second link only did a Chi-square test for the 32-bit FNV hash. It's better to use more bits and the FNV-1 variant, which swaps the XOR and the MUL steps for better bit-dispersion. For a Bloom Filter, there's a few more catches, such as mapping the output uniformly to the index range of the bit-array. If possible, I'd say round up the size of the bit-array to the nearest power of 2 and adjust k accordingly. That way you get better accuracy and you can use simple XOR-folding to map the range.

Additionally, here's a reference explaining why you don't want SHA1 (or any cryptographic hash) when you need a general purpose hash.

Answered by: Emma972 | Posted: 01-03-2022



Answer 3

Eventually I found pybloomfiltermap. I haven't used it, but it looks like it'd fit the bill.

Answered by: First Name722 | Posted: 01-03-2022



Answer 4

Look at the array module.

class Bit( object ):
    def __init__( self, size ):
        self.bits= array.array('B',[0 for i in range((size+7)//8)] )
    def set( self, bit ):
        b= self.bits[bit//8]
        self.bits[bit//8] = b | 1 << (bit % 8)
    def get( self, bit ):
        b= self.bits[bit//8]
        return (b >> (bit % 8)) & 1

FWIW, all of the //8 and % 8 operations can be replaced with >>3 and &0x07. This may lead to slightly better speed at the risk of some obscurity.

Also, changing 'B' and 8 to 'L' and 32 should be faster on most hardware. [Changing to 'H' and 16 might be faster on some hardware, but it's doubtful.]

Answered by: Daisy560 | Posted: 01-03-2022



Answer 5

It's almost a decade since the most recent answers on here. And times do change.

Looks like the most popular maintained bloom filter package in late 2019 is now this one: https://github.com/joseph-fox/python-bloomfilter, available on PyPi as pybloom_live: https://pypi.org/project/pybloom_live/

Answered by: Chester377 | Posted: 01-03-2022



Answer 6

I've put up a python bloom filter implementation at http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

It's in pure python, has good hash functions, good automated tests, a selection of backends (disk, array, mmap, more) and more intuitive arguments to the __init__ method, so you can specify an ideal number of elements and desired maximum error rate, instead of somewhat ethereal, datastructure-specific tunables.

Answered by: Aida528 | Posted: 01-03-2022



Similar questions

performance - Are tuples more efficient than lists in Python?

Is there any performance difference between tuples and lists when it comes to instantiation and retrieval of elements?


python - Scripting language choice for initial performance

Closed. This question is opinion-based. It is not c...


java - Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other?

We're looking into transport/protocol solutions and were about to do various performance tests, so I thought I'd check with the community if they've already done this: Has anyone done server performance tests for simple echo services as well as serialization/deserialization for various messages sizes comparing EJB3, Thrift, and Protocol Buffers on Linux? Primarily languages will be Java, C/C++, Python, and PH...


performance - Sample a running Python app

I'm used to sampling C-based apps, which every few milliseconds sees what function stack is being called at that moment. This allows me to see where most of the time is spent in an app so I can optimize it. When using python, however, sample isn't so helpful, since it's sampling the C functions of the python interpreter, not the python code itself. Is there a useful sampling tool for python?


Python Performance - have you ever had to rewrite in something else?

Closed. This question is opinion-based. It is not c...


python - Performance Advantages to Iterators?

What (if any) performance advantages are offered by using iterators. It seems like the 'Right Way' to solve many problems, but does it create faster/more memory-conscious code? I'm thinking specifically in Python, but don't restrict answers to just that.


performance - Why is IronPython faster than the Official Python Interpreter

According to this: http://www.codeplex.com/IronPython/Wiki/View.aspx?title=IP20VsCPy25Perf&amp;referringTitle=IronPython%20Performance IronPython (Python for .Net) is faster than regular Python (cPython) on the same machine. Why is this? I would think compiled...


Python string join performance

There are a lot of articles around the web concerning Python performance. The first thing you read is concatenating strings should not be done using '+'; avoid s1 + s2 + s3, and instead use str.join I tried the following: concatenating two strings as part of a directory path: three approaches: '+' which I should not do str.join os.path.join Here...


python - SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?

Why is :memory: in sqlite so slow? I've been trying to see if there are any performance improvements gained by using in-memory sqlite vs. disk based sqlite. Basically I'd like to trade startup time and memory to get extremely rapid queries which do not hit disk during the course of the application. However, the following benchmark gives me only a factor of 1.5X in improved speed. Here, I'm gener...


performance - Cheap exception handling in Python?

I read in an earlier answer that exception handling is cheap in Python so we shouldn't do pre-conditional checking. I have not heard of this before, but I'm relatively new to Python. Exception handling means a dynamic call and a static return, whereas an if statement is static call, static return.






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top