Boyer-Moore implementation in Java

In order to study algorithms and Java grammar, we have implemented the Boyer-Moore method, which is one of the string search algorithms! !!

What is the Boyer-Moore method?

Boyer-Moore Act (1976) Proposed by R.S.Boyer and J.S.Moore Algorithm faster than simple search algorithm Change the position to shift depending on the comparison result of characters Time complexity is less than the simple search algorithm

There are various other character string search algorithms such as the simple search method and the Rabin-Karp method.

Skip table

Create a skip table before searching.

Search pattern: ighi

If so

pattern i g h i
Shift width 3 2 1 0

However, if there are duplicate character strings, the smaller one has priority.

Therefore,

pattern g h i Default
Shift width 2 1 0 4

It will be.

The default is provided to skip to the point where character comparisons do not overlap. This is the value when the pattern is counted as a natural number.

Search example

First time

Search target a b c d e f g i g h i
pattern i g h i

The search starts from the beginning and compares the search target with the end of the pattern. Since the character string to be searched and the end of the search target are "d" and "i", skip the default "4".

Second time

Search target a b c d e i g i g h i
pattern i g h i

With "i" and "i", the Skip table is "0", so compare the adjacent characters. They are different because they are "g" and "h". Looking at the Skip table, when it is "g", Skip is "1". Skip only one.

Third time

Search target a b c d e i g i g h i
pattern i g h i

With "g" and "i", the Skip table is "2", so skip two.

4th

Search target a b c d e i g i g h i
pattern i g h i

With "i" and "i", the Skip table is "0", so compare the adjacent strings. Since "h" and "h" are the same, compare the adjacent strings. Since "g" and "g" are the same, compare the adjacent strings. Since "i" and "i" are the same, the search is complete.

Now that you have an image, I'd like to implement it in code from here.

Variable definition

python


public final static int bmTableSize = 1_114_111;
public final static String txt = "Aiue Oppao"; //Search target
public final static char[] txtChar = txt.toCharArray(); //Split into char type
public final static String pattern = "Oppao"; //pattern
public final static char[] patternChar = pattern.toCharArray();

It was a little boring to be able to search only ASCII, so I'd like to support Japanese as well.

Looking at the Java reference Character type,

The Java SE API documentation uses Unicode code points for character values in the range U + 0000-U + 10FFFF and Unicode code units for 16-bit char values, which are UTF-16 encoding code units. For more information on Unicode terminology, see Unicode Glossary.

Character set: Unicode And Encoding scheme: UTF-16 You can see that

So I converted 10FFFF to decimal and set the table size (bmTableSize) to 1,114,111.

Create table

BoyerMooreTableInit


public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
        int count = 0;

        /*For characters that are not in the pattern, shift the pattern character string length to the width.*/・ ・ ・ ①
        for(count = 0; count < bmTableSize; count++){
            table[count] = ptnLen;
        }

        /*Determine the shift width of the characters included in the pattern*/・ ・ ・ ②
        for(count = 0; count < ptnLen; count++){
            table[(int)patternChar[count]] = ptnLen - count - 1;
        }

        /*Debug output*/
        System.out.printf("[table]  : default: step=%d\n", ptnLen);
        for(count = 0; count < bmTableSize; count++){
            if(table[count] != ptnLen)
                System.out.printf("         : char=%c: table[%03d]: step=%d\n",
                                 (char)count,count,(int)table[count]);
        }
   return table;
}

Skip table,

pattern O Tsu Pa O
Shift width 3 2 1 0

If there are duplicate characters, the smaller one will be given priority.

pattern Tsu Pa O Default
Shift width 2 1 0 4

Eventually it will be like this.

Regarding (1), the default "4" is put in all the tables.

Regarding ②, table [(int) patternChar [count]] = ptnLen --count --1; is

[O, pa, tsu, o] is included in patternChar [count]. If you cast this to an int type, the value converted from UTF-16 to decimal will be returned.

"Tsu" ... table[12387] = 2
"Pa" ... table[12401] = 1
"O" ... table[12362] = 0
"Default" ... table[Places other than the above three] = 4

Search implementation

BoyerMooreSearch


public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
        int table[] = new int[bmTableSize];
        int txtLen = 0;
        int ptnLen = 0;
        int i = 0; /*Text comparison position*/
        int j = 0; /*Pattern comparison position*/

        txtLen = txtChar.length;
        ptnLen = patternChar.length;

        /*Create a shift table*/
        table = BoyerMooreTableInit(table, patternChar, ptnLen);

        /*Comparison processing*/
        i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
        while((i < txtLen) && (j >= 0)){
            PrintCompareProcess(txt, pattern, i, j);

            if(txtChar[i] != patternChar[j]){
                /*Refer to the shift table and set the next comparison position*/
                i += next_step(table, txtChar[i], (ptnLen - j));
                j = ptnLen - 1;   /*Make the comparison position the end of the pattern*/
            }else{
                /*Since the characters match, collate the previous characters*/
                j--;
                i--;
            }
        }

        if(j < 0) {
            return i + 2;
        }else {
            return 0;
        }
    }
}

・ First comparison

Search target Ah I U e Pa Tsu O Tsu Pa O
pattern O Tsu Pa O

Since the end "e" and "o" are different and the character string does not exist in the pattern, Skip 4 characters.

・ Second comparison

Search target Ah I U e Pa Tsu O Tsu Pa O
pattern O Tsu Pa O

Since the end is the character that exists in the pattern and "tsu" = 2, Two-letter sentence is skipped.

・ Third comparison

Search target Ah I U e Pa Tsu O Tsu Pa O
pattern O Tsu Pa O

Since the end is the character that exists in the pattern and "pa" = 1, One character sentence is skipped.

・ Fourth comparison

Search target Ah I U e Pa Tsu O Tsu Pa O
pattern O Tsu Pa O

Since the end is "O" = 0, search for the next character.

・ ・ ・ Repeat below

In the end, it looks like this.

code

Main.java


import java.io.UnsupportedEncodingException;

public class Main {
	public final static int bmTableSize = 1_114_111;
	public final static String txt = "Aiue Pappa";
	public final static char[] txtChar = txt.toCharArray();
	public final static String pattern = "Oppao";
	public final static char[] patternChar = pattern.toCharArray();

	public static void main(String[] args) throws UnsupportedEncodingException {
		int result;

		System.out.printf("[text]   :%s\n", txt);
		System.out.printf("[pattern]:%s\n", pattern);

		result = BoyerMooreSearch(txtChar, patternChar);

		if (0 < result) {
			System.out.println(result + "It was second! !!");
		}else {
			System.out.println("I couldn't find it");
		}

	}

	public static int[] BoyerMooreTableInit(int[] table, char[] patternChar, int ptnLen) {
		int count = 0;

	    /*For characters that are not in the pattern, shift the pattern character string length to the width.*/
	    for(count = 0; count < bmTableSize; count++){
	    	table[count] = ptnLen;
	    }

	    /*Determine the shift width of the characters included in the pattern*/
	    for(count = 0; count < ptnLen; count++){
	    	table[(int)patternChar[count]] = ptnLen - count - 1;
	    }

	    /*Debug output*/
	    System.out.printf("[table]  : default: step=%d\n", ptnLen);
	    for(count = 0; count < bmTableSize; count++){
	        if(table[count] != ptnLen)
	        	System.out.printf("         : char=%c: table[%03d]: step=%d\n",
	        					 (char)count,count,(int)table[count]);
	    }

	    return table;
	}

	public static void PrintCompareProcess(String txt, String pattern, int i, int j) {
	    int count = 0;

	    System.out.printf("-----------------------------------\n");
	    System.out.printf("[compare]:(text i=%d)(pattern j=%d)\n", i+1, j+1);
	    System.out.printf(" text    :%s\n", txt);

	    /*Overlay patterns at comparison positions*/
	    System.out.printf(" pattern :");
	    for(count = 0; count < (i - j); count++) System.out.print(" "); //It shifts in full-width and half-width.
	    System.out.printf("%s\n", pattern);

	    /*Mark the comparison point*/
	    System.out.printf("         :");
	    for(count = 0; count < i; count++) System.out.printf(" ");
	    System.out.printf("^\n");
	}

	public static int next_step(int[] table, char target, int remain) {
	    /*Check to prevent loops*/
	    if(table[(int)target] > remain){
	        /*Get the value from the staggered table*/
	        return(table[(int)target]);
	    }else{
	        /*Advance to the next character at the point where the collation started*/
	        return(remain);
	    }
	}

	public static int BoyerMooreSearch(char[] txtChar, char[] patternChar) {
		int table[] = new int[bmTableSize];
		int txtLen = 0;
		int ptnLen = 0;
		int i = 0; /*Text comparison position*/
		int j = 0; /*Pattern comparison position*/

		txtLen = txtChar.length;
		ptnLen = patternChar.length;

	    /*Create a shift table*/
		table = BoyerMooreTableInit(table, patternChar, ptnLen);

	    /*Comparison processing*/
	    i = j = ptnLen - 1; /*Make the comparison position the end of the pattern*/
	    while((i < txtLen) && (j >= 0)){
	    	PrintCompareProcess(txt, pattern, i, j);

	        if(txtChar[i] != patternChar[j]){
	            /*Refer to the shift table and set the next comparison position*/
	            i += next_step(table, txtChar[i], (ptnLen - j));
	            j = ptnLen - 1;   /*Make the comparison position the end of the pattern*/
	        }else{
	            /*Since the characters match, collate the previous characters*/
	            j--;
	            i--;
	        }
	    }

	    if(j < 0) {
	    	return i + 2;
	    }else {
		    return 0;
	    }
	}
}

Output result

[text]   :Aiue Pappa
[pattern]:Oppao
[table]  : default: step=4
         : char=O: table[12362]: step=0
         : char=Tsu: table[12387]: step=2
         : char=Pa: table[12401]: step=1
-----------------------------------
[compare]:(text i=4)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :   ^
-----------------------------------
[compare]:(text i=8)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :       ^
-----------------------------------
[compare]:(text i=10)(pattern j=4)
 text    :Aiue Pappa
 pattern :Oppao
         :         ^
-----------------------------------
[compare]:(text i=9)(pattern j=3)
 text    :Aiue Pappa
 pattern :Oppao
         :        ^
-----------------------------------
[compare]:(text i=8)(pattern j=2)
 text    :Aiue Pappa
 pattern :Oppao
         :       ^
-----------------------------------
[compare]:(text i=7)(pattern j=1)
 text    :Aiue Pappa
 pattern :Oppao
         :      ^
It was 7th! !!

I hope you find it helpful.

References

-Java Character type reference ・ [C Language Algorithm-BM Method](http://capm-network.com/?tag=C%E8%A8%80%E8%AA%9E%E3%82%A2%E3%83%AB%E3 % 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0-BM% E6% B3% 95) -Unicode-compatible character code table -[Java] char type can be cast to int type

Recommended Posts

Boyer-Moore implementation in Java
Heapsort implementation (in java)
Implementation of gzip in java
Implementation of tri-tree in Java
Implementation of like function in Java
Changes in Java 11
Rock-paper-scissors in Java
Pi in Java
FizzBuzz in Java
Implementation of DBlayer in Java (RDB, MySQL)
[java] sort in list
Read JSON in Java
Make Blackjack in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
Combine arrays in Java
"Hello World" in Java
Check Java toString () implementation
Callable Interface in Java
Comments in Java source
Azure functions in java
Format XML in Java
Simple htmlspecialchars in Java
Hello World in Java
Use OpenCV in Java
webApi memorandum in java
Type determination in Java
Ping commands in Java
Various threads in java
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
POST JSON in Java
Express failure in Java
Create JSON in Java
Date manipulation in Java 8
What's new in Java 8
Use PreparedStatement in Java
What's new in Java 9,10,11
Parallel execution in Java
Initializing HashMap in Java
Try using RocksDB in Java
Avoid Yubaba's error in Java
Get EXIF information in Java
Save Java PDF in Excel
Edit ini in Java: ini4j
Java history in this world
Try calling JavaScript in Java
Try functional type in Java! ①
I made roulette in Java.
Create hyperlinks in Java PowerPoint
[Implementation] Java Process class notes
Implement two-step verification in Java
Refactoring: Make Blackjack in Java
Topic Analysis (LDA) in Java
NEologd preprocessing in Java neologdn-java
Change java encoding in windows
Java Stream API in 5 minutes
Cannot find javax.annotation.Generated in Java 11
Read standard input in Java