A lot of notes on how to use regular expressions in Java.
There are several methods in String
that accept regular expressions.
matches(String)
package sample.regexp;
public class Main {
public static void main(String[] args) {
String text = "abc123";
System.out.println(text.matches("[a-z0-9]+"));
System.out.println(text.matches("[a-z]+"));
}
}
Execution result
true
false
--Verify that the string ** exactly matches the specified regular expression **
--If only a part matches, it will be false
replaceAll(String, String)
package sample.regexp;
public class Main {
public static void main(String[] args) {
String text = "abc123";
System.out.println(text.replaceAll("[a-z]", "*"));
}
}
Execution result
***123
--Pass a regular expression as the first argument and replace all matching parts with the string of the second argument
package sample.regexp;
public class Main {
public static void main(String[] args) {
String text = "<<abc123>>";
System.out.println(text.replaceAll("([a-z]+)([0-9]+)", "$0, $1, $2"));
}
}
Execution result
<<abc123, abc, 123>>
--By including $ n
in the replacement string, the matched group can be reused after replacement.
-- n
starts with 0
--0
refers to the entire matched string
--Since it is the part that matches ([a-z] +) ([0-9] +)
, ʻabc123is the target. --From
1 onward, you can refer to the groups enclosed in
() in order. --
$ 1 matches
([a-z] +) , but ʻabc
is
-- $ 2
is for 123
that matches ([0-9] +)
--If you specify n
more than the number of matching groups, ʻIndexOutOfBoundsExceptionis thrown. --If you just want to replace it with the string
$, escape it with a backslash (
`)
- text.replaceAll("[a-z]+", "\\$")
--If not escaped, ʻIllegalArgumentException` is thrown
--Groups can be referred to by name in addition to indexes
--See [here](#% E3% 82% B0% E3% 83% AB% E3% 83% BC% E3% 83% 97) for details.
replaceFirst(String, String)
package sample.regexp;
public class Main {
public static void main(String[] args) {
String text = "abc123";
System.out.println(text.replaceFirst("[a-z]", "*"));
}
}
Execution result
*bc123
--Replace only the first matching part of the substring that matches the regular expression
--The substring can be referenced with $ n
, which is the same asreplaceAll ()
.
split(String, int)
package sample.regexp;
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
String text = "a1b2";
for (int i=-1; i<5; i++) {
String[] elements = text.split("[0-9]", i);
System.out.println("limit=" + i + ",\telements=" + Arrays.toString(elements));
}
}
}
Execution result
limit=-1, elements=[a, b, ]
limit=0, elements=[a, b]
limit=1, elements=[a1b2]
limit=2, elements=[a, b2]
limit=3, elements=[a, b, ]
limit=4, elements=[a, b, ]
--Split the string at the location that matches the regular expression specified in the first argument
--The second argument, limit
, determines the upper limit of the size of the return array.
--If you specify a value greater than or equal to 1
for limit
, the matching substring will be split up to limit --1
.
--If limit == 1
, then limit -1 => 0
, so no splitting is done (as a result, the size of the array is 1
).
--In the case of limit == 2
, it becomes limit --1 => 1
, so the split is performed at the 1
part of ʻa1b2 that first matches the regular expression
[0-9] . We then end the split (resulting in an array size of
2) --If a value less than or equal to
0 is specified for
limit, it will be treated as unlimited and division will be executed up to the end of the character string. --However, the behavior when there is no character string left at the end of the split result (it becomes blank) differs between
0and negative numbers. --For negative numbers, the last whitespace is also left as an array element --If
0`, the last whitespace is discarded
If the beginning becomes blank as a result of division, the blank is set as an element of the array as it is.
package sample.regexp;
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
String text = "0a1b2";
String[] elements = text.split("[0-9]", 0);
System.out.println(Arrays.toString(elements));
}
}
Execution result
[, a, b]
split(String)
This is the same behavior as setting the second argument of split (String, int)
to 0
.
With some exceptions [^ 1], methods that use regular expressions in the String
class delegate processing to the Pattern
class behind the scenes.
For example, if you check the implementation of the replaceAll ()
method, it looks like this:
String.replaceAll()
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
This Pattern
class (and Matcher
) is in charge of processing regular expressions in Java.
The Pattern
class interprets the string passed bycompile ()
as a regular expression.
If the regular expression you use is fixed, it's more efficient to run this compile ()
only the first time and then reuse the Pattern
instance.
(The Pattern
class is immutable, so it can be safely reused even in multithreading.)
However, when using a method that uses a regular expression of the String
class, thiscompile ()
is executed every time.
Therefore, if you use the String
method when executing a fixed regular expression over and over again, the processing speed will be slower than reusing the Pattern
instance.
Example of reusing Pattern
public class Hoge {
//Reuse the compiled Pattern instance
private static final Pattern HOGE_PATTERN = Pattern.compile("[0-9]+");
public boolean test(String text) {
return HOGE_PATTERN.matcher(text).matches(); //The movement is text.maches("[0-9]+")Same as
}
}
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[0-9]+");
Matcher abc = pattern.matcher("123abc");
System.out.println(abc.matches());
Matcher _123 = pattern.matcher("123");
System.out.println(_123.matches());
}
}
Execution result
false
true
--First, compile the regular expression with Pattern.compile (String)
and get the Pattern
instance.
--Next, pass the character string (** input sequence **) you want to verify with the Pattern.matcher (String)
method and get the Matcher
instance.
--Use the acquired Matcher
instance to verify whether it matches or not.
--Matcher.matches ()
verifies that the entire input sequence matches the regular expression and returns a result with boolean
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.Arrays;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[a-z]+");
String[] elements = pattern.split("123abc456def789ghi");
System.out.println(Arrays.toString(elements));
elements = pattern.split("123abc456def789ghi", -1);
System.out.println(Arrays.toString(elements));
}
}
Execution result
[123, 456, 789]
[123, 456, 789, ]
--Split the string at the part of the specified string that matches the regular expression with Pattern.split (String)
--The movement is the same as String.split (String)
, String.split (String, int)
Matcher
--Pattern
is a class that interprets regular expressions, and Matcher
does the following:
--Whether the input sequence matches the regular expression
--Extraction of matched parts
--Replacement of matched parts
--Matcher
is used in the following steps.
start ()
Start index on matched input sequence
--ʻEnd () End index on matched input sequence + 1 --
group () Matched substring --If you execute these methods without performing a match operation, ʻIllegalStateException
will be thrown.
--Note that Matcher
is not thread-safe ** **There are three match operations in Matcher
.
matches()
--Verify that the entire input sequence matches the regular expressionlookingAt()
--Verify that the regular expression matches from the beginning of the input sequencefind()
--Verify in order whether there is a part that matches the regular expression in the input sequencematches()
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
test("abc");
test("abc123");
}
private static void test(String text) {
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher(text);
System.out.println("[text=" + text + "]");
if (matcher.matches()) {
System.out.println("matches = true");
System.out.println("start = " + matcher.start());
System.out.println("end = " + matcher.end());
System.out.println("group = " + matcher.group());
} else {
System.out.println("matches = false");
}
}
}
Execution result
[text=abc]
matches = true
start = 0
end = 3
group = abc
[text=abc123]
matches = false
--matches ()
verifies that the entire input sequence matches the regular expression
--Returns true
if there is a match
lookingAt()
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
test("abc");
test("123abc");
test("ab12");
}
private static void test(String text) {
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher(text);
System.out.println("[text=" + text + "]");
if (matcher.lookingAt()) {
System.out.println("lookingAt = true");
System.out.println("start = " + matcher.start());
System.out.println("end = " + matcher.end());
System.out.println("group = " + matcher.group());
} else {
System.out.println("lookingAt = false");
}
}
}
Execution result
[text=abc]
lookingAt = true
start = 0
end = 3
group = abc
[text=123abc]
lookingAt = false
[text=ab12]
lookingAt = true
start = 0
end = 2
group = ab
--lookingAt ()
verifies that the regular expression matches from the beginning of the input sequence
--If the verification results from the beginning match, true
is returned (the whole does not have to match).
find()
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
test("abc");
test("123abc456def789");
}
private static void test(String text) {
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher(text);
System.out.println("[text=" + text + "]");
while (matcher.find()) {
System.out.println("start = " + matcher.start());
System.out.println("end = " + matcher.end());
System.out.println("group = " + matcher.group());
}
}
}
Execution result
[text=abc]
start = 0
end = 3
group = abc
[text=123abc456def789]
start = 3
end = 6
group = abc
start = 9
end = 12
group = def
--The find ()
method scans the beginning of the input sequence for a matching regular expression.
--Returns true
if there is a matching substring
--If you execute find ()
again, it will scan for a substring that matches again from the previously matched part.
--Matched substrings can be extracted by repeatedly executing
--start ()
, ʻend () ,
group () `returns the result of the last match
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher("abc123def");
System.out.println("replaceAll = " + matcher.replaceAll("*"));
System.out.println("replaceFirst = " + matcher.replaceFirst("*"));
}
}
Execution result
replaceAll = *123*
replaceFirst = *123def
--Replace all matched substrings with Matcher.replaceAll (String)
--Matcher.replaceFirst (String)
replaces only the first matched substring
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("([a-z]+)([0-9]+)");
Matcher matcher = pattern.matcher("abc123de45fg");
int groupCount = matcher.groupCount();
System.out.println("groupCount=" + groupCount);
while (matcher.find()) {
System.out.println("==========");
String group = matcher.group();
System.out.println("group=" + group);
for (int i=0; i<=groupCount; i++) {
String g = matcher.group(i);
System.out.println("group(" + i + ")=" + g);
}
}
}
}
Execution result
groupCount=2
==========
group=abc123
group(0)=abc123
group(1)=abc
group(2)=123
==========
group=de45
group(0)=de45
group(1)=de
group(2)=45
--The following methods are provided to refer to the group defined by the regular expression (the part enclosed by ()
).
--groupCount ()
Get the number of groups defined by the regular expression
--group ()
Get the entire string matched by the most recent match operation
--group (int)
Get the group of the specified index among the groups matched in the latest match operation.
--The number 0
is the entire matched string, so it returns the same result asgroup ()
--From 1
to the matching substring
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("(?<alphabets>[a-z]+)(?<numbers>[0-9]+)");
Matcher matcher = pattern.matcher("abc123de45fg");
while (matcher.find()) {
System.out.println("==========");
System.out.println("group(alphabets)=" + matcher.group("alphabets"));
System.out.println("group(numbers)=" + matcher.group("numbers"));
}
}
}
Execution result
==========
group(alphabets)=abc
group(numbers)=123
==========
group(alphabets)=de
group(numbers)=45
--You can define a name for a group by defining the group as (? <Group name> pattern)
.
--You can get the substring that matches the group by specifying the name defined by the group (String)
method.
To refer to the group name in the replacement string, do the following:
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("(?<alphabets>[a-z]+)(?<numbers>[0-9]+)");
Matcher matcher = pattern.matcher("abc123def456");
String replaced = matcher.replaceAll("${numbers}${alphabets}");
System.out.println(replaced);
}
}
Execution result
123abc456def
--You can refer to a group with $ {group name}
--When creating a Pattern
instance, you can adjust the way regular expressions are interpreted with the ** flag **.
Compile with flags
Pattern pattern = Pattern.compile("[a-z]", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
--Flag is specified by the second argument of compile (String, int)
--You can specify constants declared as static
in the Pattern
class.
--Since it is a bit mask, when specifying multiple flags, specify them by concatenating them with |
.
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[a-z]+", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("ABC");
System.out.println(matcher.matches());
}
}
Execution result
true
--If you specify CASE_INSENSITIVE
, the match is case insensitive.
--Only US-ASCII characters are indistinguishable
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[a-zA-Z]+", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher matcher = pattern.matcher("ABCabc");
System.out.println(matcher.matches());
}
}
Execution result
true
―― Combining ʻUNICODE_CASEand
CASE_INSENSITIVE` provides case-insensitive matching in Unicode.
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
Pattern pattern = Pattern.compile("[a-z]+", Pattern.LITERAL);
Matcher matcher = pattern.matcher("abc");
System.out.println(matcher.matches());
matcher = pattern.matcher("[a-z]+");
System.out.println(matcher.matches());
}
}
Execution result
false
true
--If LITERAL
is specified, the string passed in the first argument ofcompile (String, int)
will be processed as a simple string.
--Regular expression meaningful characters such as []
and +
are simply interpreted as the characters themselves.
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.function.Supplier;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
test("[default]", () -> Pattern.compile("^[a-z]+$"));
test("[MULTILINE]", () -> Pattern.compile("^[a-z]+$", Pattern.MULTILINE));
}
private static void test(String label, Supplier<Pattern> patternSupplier) {
System.out.println(label);
Pattern pattern = patternSupplier.get();
String text = "abc\n"
+ "def\n";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
}
}
Execution result
[default]
[MULTILINE]
abc
def
--When MULTILINE
is specified, the handling of^
and$
representing the beginning and end of lines changes
--If nothing is specified, ^
and $
match purely at the beginning and end of the string.
--If MULTILINE
is specified, each line break will be treated as a string, so^
and$
will match the beginning and end of each line.
package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
String regexp = "#This line is ignored as a comment\n"
+ " [a-z]+ ";
Pattern pattern = Pattern.compile(regexp, Pattern.COMMENTS);
Matcher matcher = pattern.matcher("abc");
System.out.println(matcher.matches());
}
}
Execution result
true
--If COMMENTS
is specified, the following strings will be treated as comments and ignored.
--From #
to the end of the line
--Blank space
.
)package sample.regexp;
import org.openjdk.jmh.runner.RunnerException;
import java.util.function.Supplier;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) throws RunnerException {
test("[default1]", () -> Pattern.compile(".+"));
test("[default2]", () -> Pattern.compile(".+$"));
test("[DOTALL]", () -> Pattern.compile(".+", Pattern.DOTALL));
}
private static void test(String label, Supplier<Pattern> patternSupplier) {
System.out.println(label);
Pattern pattern = patternSupplier.get();
String text = "abc\n"
+ "def\n";
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
}
}
Execution result
[default1]
abc
[default2]
def
[DOTALL]
abc
def
--If you specify DOTALL
, .
will also match the end of the line.
--By default, .
does not match end of line
[^ 1]: For example, the split (String regexp)
method splits without using Pattern
when regexp
is a plain string that does not use regular expression metacharacters. Is going
Recommended Posts