BloomFilter description and implementation sample (JAVA)

1. Bloom Filter description and example

1-1. Prerequisite use case

It's a little difficult, so I'll give you an example. It's neither a proper use case nor a reality.

  1. Create an app that allows you to enter the other party's account number on your smartphone and send money
  2. You cannot send money to a non-existent account number

In this case

  1. Query the database of the server for the account number of the other party
  2. If it does not exist, an error will be returned, and if it exists, the deposit will be processed there.

Will be.

Here, as follows If the app has a list of whether the account number exists or not There is no need to bother to inquire the database for the existence of the account number via the network.

account number
111-111-111
222-222-222
333-333-333
.........

** However, if the account number is huge, it is difficult for the smartphone app to retain that much information. ** ** ** In such a case, if you use Bloom Filter, you can have very compact information that "almost" tells you whether or not the account number exists. ** **

1-2. Mechanism

For more information, see [Wiki](https://en.wikipedia.org/wiki/%E3%83%96%E3%83%AB%E3%83%BC%E3%83%A0%E3%83%95% E3% 82% A3% E3% 83% AB% E3% 82% BF).

  1. Prepare a record of a certain length that can have bit information of 0 and 1.
  2. Apply a hash function to the account number and convert the resulting hash to a bit string of record length 01.
  3. Set a bit at the corresponding position on the record

At this time, there are a plurality of positions where 1 stands.

0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0

If this is repeated for all account numbers, the number of positions where 1 stands will increase.

If you have this record when you enter the account number in the above app ** Exactly the same hash ⇒ Perform 01 bit string conversion and match the part where 1 stands with the record ** ** If all 1's are set, it means that the probability that the account number exists is extremely high. ** **

--Upper: Record with information on all account numbers --Lower: Hash ⇒ 01 conversion result of account number "111-111-111"

0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0
0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0

⇒ Since 1 is set in all the target positions, it is very likely that the record exists.

** "Extremely expensive" is not absolute. ** ** Because the position where 1 stands conflicts with other account numbers There is a possibility that 1 is set at that position by the bit string of other account number information.

--First line: Record with the following account number information --Second line: Bit information of the existing account number "111-111-111" --Third line: Bit information of existing account number "222-222-222"

0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0
0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0

Here, enter the non-existent account number "999-999-999" If the result of hash ⇒ 01 bit conversion is as follows, it is determined that the data exists by mistake.

--First line: Record with the following account number information (same as before) --Second line: Conversion result of non-existent account number "999-999-999"

0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0
0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0

If there is even one 0 in the position where 1 stands, it can be 100% definitely determined that there is no data. Even if there is a 1 in the position where all the 1s stand, it cannot be 100% confirmed that there is data. ** We say that there are no false negatives, but there are false positives. ** **

Taking a long enough record will reduce false positives, If you take a record that is too long, the capacity will be squeezed.

** However, for 1 million account numbers (any number of digits) ** ** False positives will converge to about 1% if there is an area of about 1.2MB. ** ** ** If a region of about 1.8 MB can be prepared, false positives will converge to about 0.1%. ** **

So, with an app like the one above, you only need to keep a few records It will be possible to significantly suppress network access and database access.

2. Bloom Filter implementation sample

I will post the implementation in Java as it is.

MyBloomFilter.java



import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;

/**
 *It has only the most basic functions of Bloom Filter
 * @author owl
 * @param <E>
 */
public class MyBasicBloomFilter<E> implements Serializable {

    private final BitSet bitFilter;                                             //Filter body
    private final int bitFilterSize;                                            //Filter size
    private int numberOfAddedElements;                                          //Current number of elements
    private final int countOfHashNumCreatePerOneElement;                        //Number of True bits per element (hashing ⇒ number of number conversions)

    static final Charset DEFAULT_CHARSET = Charset.forName("UTF-8");            

    /**
     *False positive level
     * High:High false positives, but compact filter size and good performance
     *Middle: Balance false positives with filter size and performance
     * Low:Keep false positives as low as possible even if the filter size is increased or the processing time is long.
     */
    public static enum FalsePositiveProbability{ High, Middle, Low }
    
    /**
     *Run message digest
     */
    static final MessageDigest digestFunction;
    static final String hashName = "MD5";                                       
    static {                                                                    
        MessageDigest tmp;
        try { tmp = java.security.MessageDigest.getInstance(hashName); } catch (NoSuchAlgorithmException e) { tmp = null; }
        digestFunction = tmp;
    }    

    /**
     *constructor
     * @param c Number of bits used per element
     * @param n (maximum expected number of elements)
     * @Number of True bits per param k element (hashing ⇒ number of number conversions)
     */
    public MyBasicBloomFilter(double c, int n, int k) {

        this.countOfHashNumCreatePerOneElement = k;
        
        //Element count*The bit size of Filter is determined by the number of bits used per element.
        this.bitFilterSize = (int)Math.ceil(c * n);
        
        numberOfAddedElements = 0;
        this.bitFilter = new BitSet(bitFilterSize);
        
    }


    /**
     * 
     * @param expectedItemCount
     * @param fpp 
     */
    public MyBasicBloomFilter(int expectedItemCount, FalsePositiveProbability fpp) {
        
        switch(fpp){
            case High : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * 4.8); 
                break;
            }
            case Low : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * (9.6 + 4.8));
                break;
            }
            default : {
                this.bitFilterSize = (int)Math.ceil(expectedItemCount * 9.6);
            }
        }
        this.countOfHashNumCreatePerOneElement
                = (int) Math.round((this.bitFilterSize / (double)expectedItemCount) * Math.log(2.0));
        numberOfAddedElements = 0;
        this.bitFilter = new BitSet(bitFilterSize);        
        
    }
    public MyBasicBloomFilter(int expectedItemCount) {
        this(expectedItemCount, FalsePositiveProbability.Middle);
    }
    
    /**
     *Hashing
     * @param data input data
     * @param hashCount Count
     * @return Numerical value obtained by hashing (bit standing position)
     */
    public static int[] createHashes(byte[] data, int hashCount) {
        int[] result = new int[hashCount];

        int k = 0;
        byte seed = 0;
        while (k < hashCount) {
            
            //Hashing
            byte[] digest;
            synchronized (digestFunction) {
                digestFunction.update(++seed);
                digest = digestFunction.digest(data);                
            }
        
            //Convert to int with 4Byte delimiter and pack in array
            for (int i = 0; i < digest.length/4 && k < hashCount; i++) {
                int h = 0;
                for (int j = (i*4); j < (i*4)+4; j++) {
                    h <<= 8;
                    h |= ((int) digest[j]) & 0xFF;
                }
                result[k] = h;
                k++;
            }
        }
        return result;
    }    
    
    /**
     *Add element
     * @param element 
     */
    public void add(E element) {
       add(element.toString().getBytes(DEFAULT_CHARSET));
    }    
    
    /**
     *Add element
     * @param bytes 
     */
    public void add(byte[] bytes) {
       int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
       for (int hash : hashes) bitFilter.set(Math.abs(hash % bitFilterSize), true);
       numberOfAddedElements ++;
    }

    /**
     *Judgment
     * @param element element
     * @return 
     */
    public boolean contains(E element) {
        return contains(element.toString().getBytes(DEFAULT_CHARSET));
    }
    
    /**
     *Judgment
     * @Byte array of param bytes element
     * @return 
     */
    public boolean contains(byte[] bytes) {
        int[] hashes = createHashes(bytes, countOfHashNumCreatePerOneElement);
        for (int hash : hashes) {
            if (!bitFilter.get(Math.abs(hash % bitFilterSize))) {
                return false;
            }
        }
        return true;
    }    
    
    /**
     *Get filter
     * @return filter
     */
    public BitSet getBitSet() {
        return bitFilter;
    }

    /**
     *Get the number of added items
     * @number of return items
     */
    public int getAddItemCount(){
        return this.numberOfAddedElements;
    }
    
    public double getFalsePositiveProbability() {
        // (1 - e^(-k * n / m)) ^ k
        return 
            Math.pow((1 - Math.exp(-countOfHashNumCreatePerOneElement * (double) this.numberOfAddedElements
                / (double) bitFilterSize)), countOfHashNumCreatePerOneElement);
    }    
    
}

Call sample


import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang.RandomStringUtils;

/**
 *Try Bloom Filter
 * @author owl
 */
public class ExecBloomFilterSample {
    
    public static void main(String argv[]){
        MyBasicBloomFilter<String> bf = new MyBasicBloomFilter<>(1000000, MyBasicBloomFilter.FalsePositiveProbability.Low);

        List<String> addList = new ArrayList<>();           //Data to add to BloomFilter
        List<String> notAddList = new ArrayList<>();        //Data not added to Bloom Filter
        
        //Added 10,000 data to Bloom Filter
        //At the same time, generate 10,000 verification data that is not added to Bloom Filter.
        for(int i=0;i<2000000;i++){
            if(i % 2 == 0) {
                String s = RandomStringUtils.randomAlphabetic(24);
                bf.add(s);
                addList.add(s);
            } else {
                notAddList.add(RandomStringUtils.randomAlphabetic(24));
            }
        }

        //Output Bloom Filter status
        for(int i=0; i < bf.getBitSet().size(); i++ ){
            System.out.print(bf.getBitSet().get(i) ? "1" : "0");
            if(i>1000) break;
        }
        
        //Output Bloom Filter data size and calculated false positive probability
        System.out.println();
        System.out.println("Data size--> " + bf.getBitSet().size() + " | K-Byte number of bytes" + (double)((double)(bf.getBitSet().size())/8/1024));
        System.out.println("False positive probability--> " + String.format("%.5f", bf.getFalsePositiveProbability()));

        //Check if all the data added to Bloom Filter is positive
        //(If there is no problem, nothing is displayed)
        System.out.println();
        System.out.println("■■■■■ Start checking if all additional data is positive ■■■■■");
        addList
            .stream()
            .sequential()
            .filter(e -> !bf.contains(e))
            .forEach(e-> System.out.println("Missing judgment:" + e));
        
        //Outputs the count that data not added to Bloom Filter is falsely positive
        System.out.println("■■■■■ Start checking if non-additional data is judged as false positive ■■■■■");
        System.out.println("In 1000000 data..." +
            notAddList
                .stream()
                .sequential()
                .filter(e -> bf.contains(e))
                .count() + "A false positive was detected.");
        
    }
    
}

Execution result


000011100100001110011011100010101001100110111000100011001(abridgement)
Data size--> 14400000 | K-Byte number of bytes 1757.8125
False positive probability--> 0.00099

■■■■■ Start checking if all additional data is positive ■■■■■
■■■■■ Start checking if non-additional data is judged as false positive ■■■■■
In 1000000 data...994 false positives were detected.

Recommended Posts

BloomFilter description and implementation sample (JAVA)
Java 15 implementation and VS Code preferences
Java 9 new features and sample code
[Java] Generics sample
Java sample code 02
Java sample code 03
Java GUI sample
Java sample code 04
Comparison of thread implementation methods in Java and lambda expression description method
XXE and Java
Java sample code 01
[ev3 × Java] Interface, implementation and inheritance (event handling)
Java and Swift comparison (3) Class implementation / Class inheritance / Class design
PrintObserver "Observer Design Pattern: Description and Implementation Example"
Interpreter implementation in Java
Getters and setters (Java)
Check Java toString () implementation
[Java] Thread and Runnable
Sample (Java) for OAuth 2.0 authentication and access token acquisition
Java true and false
[Java] String comparison and && and ||
[Java] Holiday judgment sample
Boyer-Moore implementation in Java
Java --Serialization and Deserialization
Heapsort implementation (in java)
[Java] Arguments and parameters
timedatectl and Java TimeZone
[Java] Branch and repeat
[Java] logback slf4j sample
About synchronized and Reentrant Lock (Java & Kotlin implementation example)
[Java] Variables and types
java (classes and instances)
[Java] Overload and override
Sample code to parse date and time with Java SimpleDateFormat
A Simple CRUD Sample Using Java Servlet / JSP and MySQL
Study Java # 2 (\ mark and operator)
Digital signature sample code (JAVA)
[Java] Difference between == and equals
[Java] Stack area and static area
Java parallelization code sample collection
[Java] Generics classes and generics methods
Java encryption and decryption PDF
Java and Iterator Part 1 External Iterator
[Implementation] Java Process class notes
Java if and switch statements
[Java] Implementation of Faistel Network
Apache Hadoop and Java 9 (Part 1)
[Java] About String and StringBuilder
[Java] HashCode and equals overrides
[Java] Timer processing implementation method
☾ Java / Iterative statement and iterative control statement
Java standard log output sample
Java methods and method overloads
java Generics T and? Difference
Implementation of gzip in java
Advantages and disadvantages of Java
java (conditional branching and repetition)
About Java Packages and imports
[Java] Upload images and base64
Implementation of tri-tree in Java
C # and Java Overrides Story