About Python regular expressions. Until now, I googled and researched and implemented it when necessary, but I thought it was time to deepen my understanding. It's said that "googling when needed and implementing it" seems to be great, but it is a beginner level. I am writing with an awareness of ** what beginners should learn from scratch ** and ** what the occasional user will relearn **. In this article, Knock 100 Language Processing 2015 ["Chapter 3: Regular Expressions"](http://www.cl. ecei.tohoku.ac.jp/nlp100/#ch3) is organized.
Link | Remarks |
---|---|
Regular expression HOWTO | Python Official Regular Expression How To |
re ---Regular expression operation | Python official re package description |
Python uses the package re
to implement regular expressions. Omit ʻimport re` in subsequent Python statements.
import re
Use functions such as re.match
and re.sub
.
#The first argument is a regular expression pattern(Search term), The second argument is the search target
result = re.match('Hel', 'Hellow python')
print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>
print(result.group())
# Hel
Use functions such as match
and sub
after compiling the regular expression pattern.
#Compile regular expression patterns in advance
regex = re.compile('Hel')
result = regex.match('Hellow python')
print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>
print(result.group())
# Hel
** If you want to use multiple regular expression patterns many times, use the compile method **. Official has the following description.
It is more efficient to use re.compile () to save and reuse the resulting regular expression object when you want to use that expression many times in one program. The latest patterns passed to re.compile () and module-level matching functions are cached as compiled, so programs that use a small amount of regular expressions at a time do not need to compile regular expressions.
If you use the same regex pattern many times, compiling doesn't seem to have a speed advantage. I haven't checked how much it is cached.
Raw strings are not a regular expression-specific topic, but they can be used to ** disable escape sequences **.
In the former case of the following example, \ t
becomes a tab and\ n becomes a
line feed, but in the latter case, it is treated as a \ t
, \ n
character string as it is.
print('a\tb\nA\tB')
print(r'a\tb\nA\tB')
Terminal output result
a b
A B
a\tb\nA\tB
** I don't want to write escape sequences for backslashes in the regular expression pattern, so I use raw strings **
result = re.match(r'\d', '329')
Articles "Writing regular expressions using raw Python strings" and "Ignore (disable) escape sequences in Python" Raw string " has a detailed explanation.
re.VERBOSE
You can use newlines in the regular expression pattern by enclosing them in '''
triple quotes (which can be " ""
) (no newlines are fine).
You can exclude whitespace and comments from the regular expression pattern by passing re.VERBOSE
.
** Ripple quotes and re.VERBOSE
make it very readable **
It is easy to see if you write the following regular expression pattern.
a = re.compile(r'''\d + # the integral part
\. # the decimal point
\d * # some fractional digits''', re.VERBOSE)
You can read more about triple quotes in the article "String generation in Python (quotes, str constructor)" (https://note.nkmk.me/python-str-literal-constructor/).
By the way, if you want to use multiple compile flags in the compile
parameter flags
, you can simply add +
(addition).
a = re.compile(r'''\d''', re.VERBOSE+re.MULTILINE)
letter | Description | Remarks | Example | Match | Does not match |
---|---|---|---|---|---|
\d | Numbers | [0-9]Same as | |||
\D | Other than numbers | [^0-9]Same as | |||
\s | Whitespace character | [\t\n\r\f\v]Same as | |||
\S | Other than whitespace | [^\t\n\r\f\v]Same as | |||
\w | Alphanumeric characters and underscore | [a-zA-Z0-9_]Same as | |||
\W | Non-alphanumeric characters | [\a-zA-Z0-9_]Same as | |||
\A | The beginning of the string | ^Similar to | |||
\Z | End of string | $Similar to | |||
\b | Word boundaries(space) | ||||
. | Any single letter | - | 1.3 | 123, 133 | 1223 |
^ | The beginning of the string | - | ^123 | 1234 | 0123 |
$ | End of string | - | 123$ | 0123 | 1234 |
* | Repeat 0 or more times | - | 12* | 1, 12, 122 | 11, 22 |
+ | Repeat one or more times | - | 12+ | 12, 122 | 1, 11, 22 |
? | 0 times or 1 time | - | 12? | 1, 12 | 122 |
{m} | Repeat m times | - | 1{3} | 111 | 11, 1111 |
{m,n} | Repeat m ~ n times | - | 1{2, 3} | 11, 111 | 1, 1111 |
[] | set | [^5]Then other than 5 | [1-3] | 1, 2, 3 | 4, 5 |
| | Union(or) | - | 1|2 | 1, 2 | 3 |
() | Grouping | - | (12)+ | 12, 1212 | 1, 123 |
I often use the following functions.
function | Purpose |
---|---|
match | Of the stringAt the beginningDetermine if it matches a regular expression |
search | Find where the regular expression matches |
findall | Finds all matching substrings and returns them as a list |
sub | String replacement |
Re.match
matches only at the beginning of the string, and re.search
matches regardless of the position in the string. See Official "search () vs. match ()" for more information. Both return only the first pattern (do not return matches after the second).
>>> re.match("c", "abcdef") #The beginning is"c"Does not match because it is not
>>> re.search("c", "abcdef") #Match
<re.Match object; span=(2, 3), match='c'>
The result is in group
. All the results are contained in group (0)
, and the grouped search results are sequentially numbered from 1.
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0) # The entire match
'Isaac Newton'
>>> m.group(1) # The first parenthesized subgroup.
'Isaac'
>>> m.group(2) # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2) # Multiple arguments give us a tuple.
('Isaac', 'Newton')
findall Findall returns all the strings that match the pattern in list format.
>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']
You can specify the capture target using ()
, but if you specify more than one, it will be as follows. It will be returned as a tuple for each group.
>>> print(re.findall(r'''(1st)(2nd)''', '1st2nd1st2nd'))
[('1st', '2nd'), ('1st', '2nd')]
sub Replace characters. In the order of arguments: 1. regular expression pattern, 2. character string after replacement, 3. character string to be replaced.
>>> re.sub(r'Before replacement', 'After replacement', 'Before replacement 対象外 Before replacement')
'After replacement Not applicable After replacement'
count
, and the 5th is flags
(compile flags). I was desperately passing the compile flag without noticing the fourth, and I wasted about 30 minutes without success ...The following are the compile flags that are often used. Pass it to the function parameter flags
.
flag | meaning |
---|---|
DOTALL | . Set to any character including line breaks |
IGNORECASE | Case insensitive |
MULTILINE | ^ Or$ Matches a multi-line character string with. |
VERBOSE | Ignore comments and whitespace in regular expressions |
I will explain in a little more detail except for the VERBOSE and the confusing ones that I have already explained.
DOTALL
re.DOTALL is an option to include a newline for the wildcard .
(DOT).
string = r'''\
1st line
2nd line'''
print(re.findall(r'1st.*2nd', string, re.DOTALL))
# ['1st line\n line beginning 2nd']
print(re.findall(r'1st.*2nd', string))
# No Match
See the article Python: Replacing multi-line matching with regular expressions for details.
MULTILINE
Use this when you want to search for multiple lines individually. In the example below, if you use re.MULTILINE
, the second line ("the beginning of the line 2nd line") will also be the target.
* In the case of the match
function, it does not make sense to use re.MULTILINE
string = r'''\
1st line
2nd line'''
print(re.findall(r'^Beginning of line.*', string, re.MULTILINE))
# ['1st line', '2nd line']
print(re.findall(r'^Beginning of line.*', string))
# ['1st line']
See the article Python: Replacing multi-line matching with regular expressions for details.
Tips
If you add (?: ...)
, it will not be included in the search result string ** and will not be captured. The official Regular Expression Syntax explains:
An uncaptured version of regular parentheses. Matches a regular expression enclosed in parentheses, but the substrings that this group matches cannot be retrieved after the match is performed or referenced later in the pattern.
In the example below, the 4
part is used as a regular expression pattern, but it is not output in the result.
>>> re.findall(r'(.012)(?:4)', 'A0123 B0124 C0123')
['B012']
** You can control the length of the search result target string **. ** A greedy match is a match with the maximum length, and a non-greedy match is a match with the minimum length.
The default is greedy match, and to make it a non-greedy match, attach ?
to continuous special characters (*,?, +
). Below are example sentences of both.
#Greedy match
>>> print(re.findall(r'.0.*2', 'A0123 B0123'))
['A0123 B012']
#Non-greedy match(*After the?)
>>> print(re.findall(r'.0.*?2', 'A0123 B0123'))
['A012', 'B012']
See the article "Greedy and non-greedy matches" for more information.
You can use \ number
to match the contents of the previous group. In Official Syntax, the following description.
Matches the contents of the group with the same number. Groups can be numbered starting with 1. For example, (. +) \ 1 matches'the the'or '55 55', but not'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0 or number is a 3-digit octal number, it is interpreted as a character with the octal value number, not as a group match. All numeric escapes between the character classes'[' and']' are treated as characters.
Specifically, like this, the \ 1
part matches the abcab with the same meaning as the part that matched in the previous(ab)
, but abddd does not have the 4th and 5th characters ab. Does not match.
>>> print(re.findall(r'''(ab).\1''', 'abcab abddd'))
['ab']
Although it is not included in the match target, there are the following four usages for including / not including the character string in the search condition.
--Positive Lookahead Assertions --Negative Lookahead Assertions --Positive Lookbehind Assertions --Negative Lookbehind Assertions
The following shape is made into a matrix.
positive | denial | |
---|---|---|
Look-ahead | (?=...) ... Match if the part continues next |
(?!...) ... Match if the part does not follow |
Look-ahead | (?<=...) ... Match if the part is before the current position and there is a match |
(?<!...) ... Match if the part is before the current position and there is no match |
A concrete example is easier to understand than a detailed explanation.
>>> string = 'A01234 B91235 C01234'
#Positive look-ahead assertion(Positive Lookahead Assertions)
# '123'Next to'5'String followed by('(?=5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?=5).', string))
['B91235']
#Negative look-ahead assertion(Negative Lookahead Assertions)
# '123'Next to'5'String that does not follow('(?!5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?!5).', string))
['A01234', 'C01234']
#Affirmative look-behind assertion(Positive Lookbehind Assertions)
# '0'But'123'Matching string before('(?<=0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<=0)123', string))
['A0123', 'C0123']
#Negative look-ahead assertion(Negative Lookbehind Assertions)
# '0'But'123'String that does not match before('(?<!0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<!0)123', string))
['B9123']
Recommended Posts