When I'm knocking 100 language processes, I've learned something new about re
when expressing regular expressions, so I'll summarize it here.
For example, if you want to match the pattern ape
to the string grape
, if you use search
, the grape
contains ape
, so it will match, butmatch If you use
, it doesn't match because ape
is included but doesn't start from the beginning.
m = re.match('ape', 'grape') #Does not match
m = re.search('ape', 'grape') #Match
When you match with a regular expression, you may want to retrieve the matched part.
In the following example, you want to retrieve '1993'
, '7'
, and '2'
.
s = "Born July 2, 1993"
p = "([0-9]+)Year([0-9]+)Moon([0-9]+)Day"
m = re.search(p, s)
The part of the match target string (s
in the snippet mentioned above) that matches the pattern enclosed by()
can be retrieved with them.group (n)
method. The argument n
is as follows.
()
from the beginning: n = iIn other words, in the above example, it becomes like this.
m.group(0) # -> 'July 2, 1993'
m.group(1) # -> '1993'
m.group(2) # -> '7'
m.group(3) # -> '2'
As a caveat, the return value of .group ()
is always a string, not a number, so if you want to treat it as a number, cast it as appropriate.
In the case of the previous section, it may be difficult to understand with numbers. In that case, if you write like (? P <name> regex)
, you can extract it with the name name
where you specified which part to extract with the argument in.group ()
. become.
Specifically, do as follows.
s = "Born July 2, 1993"
p = "(?P<year>[0-9]+)Year(?P<month>[0-9]+)Moon(?P<day>[0-9]+)Day"
m = re.search(p, s)
m.group(0) # -> 'July 2, 1993'
m.group('year') # -> '1993'
m.group('month') # -> '7'
m.group('day') # -> '2'
When extracting regular expressions in a long sentence, for example, you may want to extract all the words that start with "con". In that case, use re.findall ()
.
s = 'It\'s convenient to conclude you are conservative.'
p = 'con\w+'
m = re.findall(p, s)
m # -> ['convenient', 'conclude', 'conservative']
If you use .
in the pattern string, you can match any character, with the exception of the newline character (\ n
).
For example, in the following case, I think that 'abc = def \ nghi \ njkl'
will match, but only up to 'abc = def'
will match.
s = 'abc=def\nghi\njkl'
p = '^abc=.+'
m = re.search(p, s)
m.group() # -> 'abc=def'
This is because the metacharacter '.'
Does not exceptionally match \ n
. In such a case, set the re.DOTALL
flag in the third argument ofsearch ()
.
s = 'abc=def\nghi\njkl'
p = '^abc=.+'
m = re.search(p, s, re.DOTALL)
m.group() # -> 'abc=def\nghi\njkl'
Right? Isn't it easy?
When web scraping, there may be cases where you want to retrieve only the lines that start with a specific tag. (I've never done it before)
s = """<p>Pieter Pipar piked a peck of pickled pepers.</p>
<hr>
<p>A pek of pickled pepers Pieter Pipar piked.</p>
<p>If Pieter Pipar piked a pek of pickled pepers,<p>
<hr>
<p>How many pickled pepper did Pieter Pipar picked?</p>"""
p = "^<p>.+$"
I surrounded it with <p>
and put a horizontal line with <hr>
.
Suppose you want to extract only the lines that start with <p>
from this state.
At this time, .findall
is used, but since a newline character is included, the re.MULTILINE
flag can be used to perform pattern matching for each line after dividing by the newline character.
m = re.findall(p, s, re.MULTILINE)
m
# ['<p>Pieter Pipar piked a peck of pickled pepers.</p>',
# '<p>A pek of pickled pepers Pieter Pipar piked.</p>',
# '<p>If Pieter Pipar piked a pek of pickled pepers,<p>',
# "<p>Where's the pek of pickled pepers that Pieter Pipar picked?</p>"]
It's a mystery that only the last is double quotes, but it's convenient.
Recommended Posts