When I'm knocking 100 language processes, I've learned something new about re when expressing regular expressions, so I'll summarize it here.

Difference between search and match

search matches even if it does not match from the beginning of the match target string.
match does not match unless it matches from the beginning of the match target string.

For example, if you want to match the pattern ape to the string grape, if you use search, the grape contains ape, so it will match, butmatch If you use, it doesn't match because ape is included but doesn't start from the beginning.

m = re.match('ape', 'grape') #Does not match
m = re.search('ape', 'grape') #Match

Take out the matched part

When you match with a regular expression, you may want to retrieve the matched part.

In the following example, you want to retrieve '1993', '7', and '2'.

s = "Born July 2, 1993"
p = "([0-9]+)Year([0-9]+)Moon([0-9]+)Day"

m = re.search(p, s)

The part of the match target string (s in the snippet mentioned above) that matches the pattern enclosed by()can be retrieved with them.group (n)method. The argument n is as follows.

String that matches the entire pattern: n = 0
Character string corresponding to the i-th () from the beginning: n = i

In other words, in the above example, it becomes like this.

m.group(0) # -> 'July 2, 1993'
m.group(1) # -> '1993'
m.group(2) # -> '7'
m.group(3) # -> '2'

As a caveat, the return value of .group () is always a string, not a number, so if you want to treat it as a number, cast it as appropriate.

Give a name to the part to match

In the case of the previous section, it may be difficult to understand with numbers. In that case, if you write like (? P <name> regex), you can extract it with the name name where you specified which part to extract with the argument in.group (). become.

Specifically, do as follows.

s = "Born July 2, 1993"
p = "(?P<year>[0-9]+)Year(?P<month>[0-9]+)Moon(?P<day>[0-9]+)Day"

m = re.search(p, s)

m.group(0) # -> 'July 2, 1993'
m.group('year') # -> '1993'
m.group('month') # -> '7'
m.group('day') # -> '2'

Take out all the matched parts

When extracting regular expressions in a long sentence, for example, you may want to extract all the words that start with "con". In that case, use re.findall ().

s = 'It\'s convenient to conclude you are conservative.' 
p = 'con\w+'

m = re.findall(p, s)

m # -> ['convenient', 'conclude', 'conservative']

Matches newline characters

If you use . in the pattern string, you can match any character, with the exception of the newline character (\ n).

For example, in the following case, I think that 'abc = def \ nghi \ njkl' will match, but only up to 'abc = def' will match.

s = 'abc=def\nghi\njkl'
p = '^abc=.+'

m = re.search(p, s)
m.group() # -> 'abc=def'

This is because the metacharacter '.' Does not exceptionally match \ n. In such a case, set the re.DOTALL flag in the third argument ofsearch ().

s = 'abc=def\nghi\njkl'
p = '^abc=.+'

m = re.search(p, s, re.DOTALL)
m.group() # -> 'abc=def\nghi\njkl'

Right? Isn't it easy?

Matches multiple lines of text

When web scraping, there may be cases where you want to retrieve only the lines that start with a specific tag. (I've never done it before)

s = """<p>Pieter Pipar piked a peck of pickled pepers.</p>
<hr>
<p>A pek of pickled pepers Pieter Pipar piked.</p>
<p>If Pieter Pipar piked a pek of pickled pepers,<p>
<hr>
<p>How many pickled pepper did Pieter Pipar picked?</p>"""

p = "^<p>.+$"

I surrounded it with <p> and put a horizontal line with <hr>. Suppose you want to extract only the lines that start with <p> from this state.

At this time, .findall is used, but since a newline character is included, the re.MULTILINE flag can be used to perform pattern matching for each line after dividing by the newline character.

m = re.findall(p, s, re.MULTILINE)
m
# ['<p>Pieter Pipar piked a peck of pickled pepers.</p>',
#  '<p>A pek of pickled pepers Pieter Pipar piked.</p>',
#  '<p>If Pieter Pipar piked a pek of pickled pepers,<p>',
#  "<p>Where's the pek of pickled pepers that Pieter Pipar picked?</p>"]

It's a mystery that only the last is double quotes, but it's convenient.

reference

re — Regular expression operations — Python 3.9.1 documentation
https://docs.python.org/3/library/re.html
Peter Piper-Wikipedia https://ja.wikipedia.org/wiki/%E3%83%94%E3%83%BC%E3%82%BF%E3%83%BC%E3%83%BB%E3%83%91%E3%82%A4%E3%83%91%E3%83%BC

Makes you think that Python regular expressions are great