Skip to content

Accuracy of bloom2 impacted by compilers #240

@matthewvon

Description

@matthewvon

I have ported util/bloom2.cc to a different environment. The bloom filter, with 16 bits per key, is supposed to have 0.0004 false positive rate. The filter was showing 0.0117 and 0.0519 within two different compiler environments (Debian 10 and Debian 11 respectively using default gcc packages). The bloom filter is important due to Riak's read before write characteristics.

The problem appears related to these two lines:

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L70

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L97

The existing code uses a mix of variable declarations: size_t, unsigned, and uint32_t. This appears to cause portions of the code to be 64 bit and portions 32 bit. Forcing all variables around these two lines to consistent uint32_t declaration was the cure.

However, just patching the current code does not address how to "fix" bad bloom filters in existing data sets. The corrected bloom filter code has the potential to make data written with the old code to seemingly disappear. As is, there might be a scenario where data also disappears due to distribution of new Riak releases that use a different compiler than before. I have not proven this, just guessing.

The solution might be to create a corrected bloom3.cc. Then the existing leveldb code can automatically pick the bloom filter that matches existing data files and use the corrected bloom3.cc bloom filters for new data and data newly passing through a normal compaction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions