About Python regular expressions. Until now, I googled and researched and implemented it when necessary, but I thought it was time to deepen my understanding. It's said that "googling when needed and implementing it" seems to be great, but it is a beginner level. I am writing with an awareness of ** what beginners should learn from scratch ** and ** what the occasional user will relearn **. In this article, Knock 100 Language Processing 2015 ["Chapter 3: Regular Expressions"](http://www.cl. ecei.tohoku.ac.jp/nlp100/#ch3) is organized.

Reference link

Link	Remarks
Regular expression HOWTO	Python Official Regular Expression How To
re ---Regular expression operation	Python official re package description

Basic

Python uses the package re to implement regular expressions. Omit ʻimport re` in subsequent Python statements.

import re

Two ways to use

1. Use function directly

Use functions such as re.match and re.sub.

#The first argument is a regular expression pattern(Search term), The second argument is the search target
result = re.match('Hel', 'Hellow python')

print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>

print(result.group())
# Hel

2. Compile and use

Use functions such as match and sub after compiling the regular expression pattern.

#Compile regular expression patterns in advance
regex = re.compile('Hel')

result = regex.match('Hellow python')

print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>

print(result.group())
# Hel

Two types of usage

** If you want to use multiple regular expression patterns many times, use the compile method **. Official has the following description.

It is more efficient to use re.compile () to save and reuse the resulting regular expression object when you want to use that expression many times in one program. The latest patterns passed to re.compile () and module-level matching functions are cached as compiled, so programs that use a small amount of regular expressions at a time do not need to compile regular expressions.

If you use the same regex pattern many times, compiling doesn't seem to have a speed advantage. I haven't checked how much it is cached.

Definition of regular expression patterns (search terms)

Escape sequence disabled in raw string

Raw strings are not a regular expression-specific topic, but they can be used to ** disable escape sequences **.

In the former case of the following example, \ t becomes a tab and\ n becomes aline feed, but in the latter case, it is treated as a \ t, \ n character string as it is.

print('a\tb\nA\tB')
print(r'a\tb\nA\tB')

`Terminal output result`


a	b
A	B

a\tb\nA\tB

** I don't want to write escape sequences for backslashes in the regular expression pattern, so I use raw strings **

result = re.match(r'\d', '329')

Articles "Writing regular expressions using raw Python strings" and "Ignore (disable) escape sequences in Python" Raw string " has a detailed explanation.

Ignore line breaks, comments and whitespace with triple quotes and `re.VERBOSE`

You can use newlines in the regular expression pattern by enclosing them in ''' triple quotes (which can be " "" ) (no newlines are fine). You can exclude whitespace and comments from the regular expression pattern by passing re.VERBOSE. ** Ripple quotes and re.VERBOSE make it very readable ** It is easy to see if you write the following regular expression pattern.

a = re.compile(r'''\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits''', re.VERBOSE)

You can read more about triple quotes in the article "String generation in Python (quotes, str constructor)" (https://note.nkmk.me/python-str-literal-constructor/).

By the way, if you want to use multiple compile flags in the compile parameter flags, you can simply add + (addition).

a = re.compile(r'''\d''', re.VERBOSE+re.MULTILINE)

special character

letter	Description	Remarks	Example	Match	Does not match
\d	Numbers	[0-9]Same as
\D	Other than numbers	[^0-9]Same as
\s	Whitespace character	[\t\n\r\f\v]Same as
\S	Other than whitespace	[^\t\n\r\f\v]Same as
\w	Alphanumeric characters and underscore	[a-zA-Z0-9_]Same as
\W	Non-alphanumeric characters	[\a-zA-Z0-9_]Same as
\A	The beginning of the string	^Similar to
\Z	End of string	$Similar to
\b	Word boundaries(space)
.	Any single letter	-	1.3	123, 133	1223
^	The beginning of the string	-	^123	1234	0123
$	End of string	-	123$	0123	1234
*	Repeat 0 or more times	-	12*	1, 12, 122	11, 22
+	Repeat one or more times	-	12+	12, 122	1, 11, 22
?	0 times or 1 time	-	12?	1, 12	122
{m}	Repeat m times	-	1{3}	111	11, 1111
{m,n}	Repeat m ~ n times	-	1{2, 3}	11, 111	1, 1111
[]	set	[^5]Then other than 5	[1-3]	1, 2, 3	4, 5
\|	Union(or)	-	1\|2	1, 2	3
()	Grouping	-	(12)+	12, 1212	1, 123

Match function

I often use the following functions.

function	Purpose
match	Of the stringAt the beginningDetermine if it matches a regular expression
search	Find where the regular expression matches
findall	Finds all matching substrings and returns them as a list
sub	String replacement

match and search .html # re.search)

Re.match matches only at the beginning of the string, and re.search matches regardless of the position in the string. See Official "search () vs. match ()" for more information. Both return only the first pattern (do not return matches after the second).

>>> re.match("c", "abcdef")    #The beginning is"c"Does not match because it is not
>>> re.search("c", "abcdef")   #Match
<re.Match object; span=(2, 3), match='c'>

The result is in group. All the results are contained in group (0), and the grouped search results are sequentially numbered from 1.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

findall Findall returns all the strings that match the pattern in list format.

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

You can specify the capture target using (), but if you specify more than one, it will be as follows. It will be returned as a tuple for each group.

>>> print(re.findall(r'''(1st)(2nd)''', '1st2nd1st2nd'))
[('1st', '2nd'), ('1st', '2nd')]

sub Replace characters. In the order of arguments: 1. regular expression pattern, 2. character string after replacement, 3. character string to be replaced.

>>> re.sub(r'Before replacement', 'After replacement', 'Before replacement 対象外 Before replacement')
'After replacement Not applicable After replacement'

By the way, the 4th argument is count, and the 5th is flags (compile flags). I was desperately passing the compile flag without noticing the fourth, and I wasted about 30 minutes without success ...

Compile Flags

The following are the compile flags that are often used. Pass it to the function parameter flags.

flag	meaning
DOTALL	`.`Set to any character including line breaks
IGNORECASE	Case insensitive
MULTILINE	`^`Or`$`Matches a multi-line character string with.
VERBOSE	Ignore comments and whitespace in regular expressions

I will explain in a little more detail except for the VERBOSE and the confusing ones that I have already explained.

DOTALL re.DOTALL is an option to include a newline for the wildcard . (DOT).

string = r'''\
1st line
2nd line'''

print(re.findall(r'1st.*2nd', string, re.DOTALL))
# ['1st line\n line beginning 2nd']

print(re.findall(r'1st.*2nd', string))
# No Match

See the article Python: Replacing multi-line matching with regular expressions for details.

MULTILINE Use this when you want to search for multiple lines individually. In the example below, if you use re.MULTILINE, the second line ("the beginning of the line 2nd line") will also be the target. * In the case of the match function, it does not make sense to use re.MULTILINE

string = r'''\
1st line
2nd line'''

print(re.findall(r'^Beginning of line.*', string, re.MULTILINE))
# ['1st line', '2nd line']

print(re.findall(r'^Beginning of line.*', string))
# ['1st line']

See the article Python: Replacing multi-line matching with regular expressions for details.

Tips

Not subject to capture

If you add (?: ...), it will not be included in the search result string ** and will not be captured. The official Regular Expression Syntax explains:

An uncaptured version of regular parentheses. Matches a regular expression enclosed in parentheses, but the substrings that this group matches cannot be retrieved after the match is performed or referenced later in the pattern.

In the example below, the 4 part is used as a regular expression pattern, but it is not output in the result.

>>> re.findall(r'(.012)(?:4)', 'A0123 B0124 C0123')
['B012']

Greedy / non-greedy match

** You can control the length of the search result target string **. ** A greedy match is a match with the maximum length, and a non-greedy match is a match with the minimum length. The default is greedy match, and to make it a non-greedy match, attach ? to continuous special characters (*,?, +). Below are example sentences of both.

#Greedy match
>>> print(re.findall(r'.0.*2',  'A0123 B0123'))
['A0123 B012']

#Non-greedy match(*After the?)
>>> print(re.findall(r'.0.*?2', 'A0123 B0123'))
['A012', 'B012']

See the article "Greedy and non-greedy matches" for more information.

Back reference

You can use \ number to match the contents of the previous group. In Official Syntax, the following description.

Matches the contents of the group with the same number. Groups can be numbered starting with 1. For example, (. +) \ 1 matches'the the'or '55 55', but not'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0 or number is a 3-digit octal number, it is interpreted as a character with the octal value number, not as a group match. All numeric escapes between the character classes'[' and']' are treated as characters.

Specifically, like this, the \ 1 part matches the abcab with the same meaning as the part that matched in the previous(ab), but abddd does not have the 4th and 5th characters ab. Does not match.

Python lists etc. are counted from 0, but regular expressions are counted from 1.

>>> print(re.findall(r'''(ab).\1''', 'abcab abddd'))
['ab']

Look-ahead / look-behind assertions

Although it is not included in the match target, there are the following four usages for including / not including the character string in the search condition.

--Positive Lookahead Assertions --Negative Lookahead Assertions --Positive Lookbehind Assertions --Negative Lookbehind Assertions

The following shape is made into a matrix.

	positive	denial
Look-ahead	`(?=...)` `...`Match if the part continues next	`(?!...)` `...`Match if the part does not follow
Look-ahead	`(?<=...)` `...`Match if the part is before the current position and there is a match	`(?<!...)` `...`Match if the part is before the current position and there is no match

A concrete example is easier to understand than a detailed explanation.

>>> string = 'A01234 B91235 C01234'

#Positive look-ahead assertion(Positive Lookahead Assertions)
# '123'Next to'5'String followed by('(?=5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?=5).', string))
['B91235']

#Negative look-ahead assertion(Negative Lookahead Assertions)
# '123'Next to'5'String that does not follow('(?!5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?!5).', string))
['A01234', 'C01234']

#Affirmative look-behind assertion(Positive Lookbehind Assertions)
# '0'But'123'Matching string before('(?<=0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<=0)123', string))
['A0123', 'C0123']

#Negative look-ahead assertion(Negative Lookbehind Assertions)
# '0'But'123'String that does not match before('(?<!0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<!0)123', string))
['B9123']

Python regular expression basics and tips to learn from scratch