About the phenomenon that StackOverflowError occurs in processing using Java regular expressions

Confirmed environment

phenomenon

StackOverflowError occurs when processing a long sentence that satisfies a specific regular expression.

Test.java


public class Test {
    public static void main(final String[] args) {
        final StringBuilder input = new StringBuilder();
        for (int i = 0; i < 5000; i++) {
            input.append("a ");
        }
        System.out.println(input.toString().replaceAll("(a|\\s)+", ""));
    }
}
Exception in thread "main" java.lang.StackOverflowError
	at java.base/java.util.regex.Pattern$GroupHead.match(Pattern.java:4786)
	at java.base/java.util.regex.Pattern$Loop.match(Pattern.java:4923)
	at java.base/java.util.regex.Pattern$GroupTail.match(Pattern.java:4845)
	at java.base/java.util.regex.Pattern$BranchConn.match(Pattern.java:4695)
...

Cause

  1. Regular expressions are treated as a recursive structure
  2. Recursive method calls continue as long as the string to be processed meets the conditions.

Cause details

First, the regular expression given as a string is the Pattern of the inner class of [Pattern](https://docs.oracle.com/javase/jp/9/docs/api/java/util/regex/Pattern.html). It is decomposed into subclasses of $ Node. This is for processing Pattern.compile Equivalent. For example, the above-mentioned regular expression (a | \\ s) + has the following structure.

Start
  ---> Prolog
      ---> Loop  "+"
           ---> GroupHead "("
                 ---> Branch "|"
                     ---> BmpCharProperty "a"
                           ---> BranchConn
                                ---> GroupTail
                     ---> BmpCharProperty "\\s"
                           ---> BranchConn
                                ---> GroupTail
           ---> GroupTail ")"
                 ---> Loop

Then, according to the interpreted structure, it is confirmed whether the condition is satisfied one by one from the beginning of the character string to be processed, and the judgment method is called recursively as long as the condition is satisfied.

  1. Loop
  2. GroupHead
  3. Branch
  4. BmpCharProperty
  5. BranchConn
  6. GroupTail
  7. Loop
  8. ... Repeat

Reproduction pattern

Reproduced with the following combinations.

For example, regular expressions like (a \\ s?) + Have the same problem.

Countermeasures

Use character classes

If you don't need to group by parentheses, create an OR condition with the character class (square brackets). In the case of the example, if you set [a \\ s] +, no error will occur.

Do not use longest match number quantum

For patterns that just replace with blanks, the unit to match is not particularly important, so even if you simply ʻa | \ swithout specifying*or+` (longest match number quantum), all You can replace characters (processing time increases by a few milliseconds).

Or the shortest match number quantum(a|\\s)+?Or a greedy number quantum(a|\\s)++If you use, it will be interpreted as a different structure, so no error will occur.

Remarks

Be careful other than replaceAll

Since it is a structural problem that represents a regular expression, a similar problem occurs in the processing that uses a regular expression.

String

Matcher

Behavior in Android apps

The Android app did not have this issue. Java implementation is based on OpenJDK and regular expressions are based on icu native code It's implemented and it's for a completely different thing.

Recommended Posts

About the phenomenon that StackOverflowError occurs in processing using Java regular expressions
Match IP addresses using regular expressions in Java
About regular expressions in Ruby
About file copy processing in Java
Try using the Stream API in Java
ERRORCODE = -4471 occurs in Java application using Db2.
About the confusion seen in startup Java servers
About the idea of anonymous classes in Java
ChatWork4j for using the ChatWork API in Java
A story about the JDK in the Java 11 era
Impressions and doubts about using java for the first time in Android Studio
Display "Hello World" in the browser using Java
Try using the COTOHA API parsing in Java
About regular expressions used in ruby sub method
[Java] Sort the list using streams and lambda expressions
About the problem of deadlock in parallel processing in gem'sprockets' 4.0
[Java] Explains ConcurrentModificationException that occurs in java.util.ArrayList for newcomers
Try global hooking in Java using the JNativeHook library
Code that only displays the built-in camera in Processing
Think about the JAVA = JAVAscript problem (needed in the future)
Differences in code when using the length system in Java
Notes on how to use regular expressions in Java
The story that .java is also built in Unity 2018
About Java lambda expressions
Learn about the spec while shortening FizzBuzz written in Java
[Java] Comparison method of character strings and comparison method using regular expressions
Identify threads in the Java process that are wasting CPU
Think about the differences between functions and methods (in Java)
Talk about using Java input wait (Scanner) in VS Code
Shortcut keys that complement standard output processing in the IDE
Try using RocksDB in Java
[Java] Summary of regular expressions
Measured parallel processing in Java
About abstract classes in java
[Java] Get date information 10 days later using milliseconds in the Date class
[Java] Try to solve the Fizz Buzz problem using recursive processing
How to solve the unknown error when using slf4j in Java
Sample code that uses the Mustache template engine JMustache in Java