Here is the regular expression for "amount" in python.
The end-of-yen version is below.
pattern = r'^(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'
# OK
#0 Yen
# 1,000 Yen
#100 yen
#12345 yen
#2000 yen
#1234 yen
#1000 yen
# NG
# 0,000 Yen
#000 Yen
# ,Circle
# 10,00 yen
The starting version of \ (yen mark) is as follows.
pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'
# OK
# ¥0
# ¥1,000
# ¥100
# ¥12345
# ¥2000
# ¥1234
# ¥1000
# NG
# ¥0,000
# ¥000
# ¥,
# ¥10,00
The environment uses Google Colaboratory. The Python version is below.
import platform
print("python " + platform.python_version())
# python 3.6.9
The regular expression check tool used: https://regex101.com/ While checking here, we will create a regular expression and implement it in the code.
Also, this is easy to understand about Python regular expressions in general. https://qiita.com/luohao0404/items/7135b2b96f9b0b196bf3
Let's write the code immediately. First, import the library for using regular expressions.
import re
First of all 1000 yen Let's create a regular expression that matches the string.
pattern = r'1000 yen'
Of course, this is an exact match, so it matches. Let's check with the code.
pattern = r'1000 yen'
string = r'1000 yen'
prog = re.compile(pattern)
result = prog.match(string)
if result:
print(result.group())
#1000 yen
The matched string is displayed. After that, for the sake of simplicity, only the regular expression pattern is described.
In addition to "1000 yen", there are "2000 yen" and "1234 yen". The regular expressions that match these are as follows.
pattern = r'\d\d\d\d yen'
The regular expression used is:
letter | Description |
---|---|
\d | Any number |
Example | Matching string |
---|---|
\d\d\d\d | 1000, 2000, 1234 |
The regular expression above can be expressed more easily.
pattern = r'\d{4}Circle'
The newly used regular expressions are:
letter | Description |
---|---|
{m} | Repeat m of the previous character m times |
Example | Matching string |
---|---|
\d{4} | 1000, 2000, 1234 |
However, with this, you can only take four-digit amounts such as "100 yen" and "12345 yen". Let's deal with any number of digits.
The modified regular expression is as follows.
pattern = r'\d+Circle'
The newly used regular expressions are:
letter | Description |
---|---|
+ | One or more repetitions of the previous character |
Example | Matching string |
---|---|
\d+ | 1000, 100, 12345 |
However, with this, it is not possible to take a character string containing ", (comma)" such as "1,000 yen". Allow commas as well as numbers.
The modified regular expression is as follows.
pattern = r'[\d,]+Circle'
The newly used regular expressions are:
letter | Description |
---|---|
[abc] | a,b,Any letter of c |
Example | Matching string |
---|---|
[\d,] | Numbers or,(comma) |
I also used the following regular expression:
letter | Description |
---|---|
+ | One or more repetitions of the previous character |
Example | Matching string |
---|---|
[\d,]+ | Numbers or,(Comma) one or more repetitions |
Now you can handle numbers and (comma).
However, this will result in strings with incorrect comma positions, such as ", yen" and "10,00 yen". The comma is modified to be in every 3 digits, such as "1,000 yen" or "1,000,000 yen".
The modified regular expression is as follows.
pattern = r'\d{1,3}(,\d{3})+Circle'
The newly used regular expressions are:
letter | Description |
---|---|
{m,n} | Repeat m or more and n or less of the previous character |
Example | Matching string |
---|---|
\d{1,3} | Repeating numbers from 1 to 3 times |
I also used the following regular expression:
letter | Description |
---|---|
(abc) | Treat the string abc as a block |
Example | Matching string |
---|---|
(,\d{3}) | 「,"000", such as ",(Comma) ”and 3 numbers are treated as one block |
If you do this, you will not be able to take the string without commas that you used to take. I will modify it so that I can get only numbers.
pattern = r'(\d+|\d{1,3}(,\d{3})+)Circle'
The newly used regular expressions are:
letter | Description |
---|---|
(abc|efg) | Either abc or efg string |
Example | Matching string |
---|---|
(\d+|\d{1,3}(,\d{3})+) | 1000, 1,000 |
However, this will also result in 0-starting strings such as "0,000 yen" and "000 yen".
The modified regular expression is as follows.
pattern = r'([1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'
The newly used regular expressions are:
letter | Description |
---|---|
[a-c] | a,b,Any letter of c |
Example | Matching string |
---|---|
[1-9] | 1~9 (numbers excluding 0) |
I also used the following regular expression:
letter | Description |
---|---|
* | Repeat 0 or more times of the previous character |
Example | Matching string |
---|---|
\d* | Repeat the number 0 or more times |
You have now excluded 0-based strings. However, only 0 yen must be allowed, so add this.
pattern = r'^(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'
The newly used regular expressions are:
letter | Description |
---|---|
^ | The beginning of the string |
If you do not add "^ (hat)", "0 yen" such as "0,000 yen" will be taken as a partial match.
Some amounts start with ¥ (yen mark) as well as those ending in yen, so let's create a regular expression here as well. In the regular expression above, delete the last "yen" and add "" at the beginning.
pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)'
However, in this case, "¥ 1" of "¥ 1,000" will be taken as a partial match. The modified version is as follows.
pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'
The newly used regular expressions are:
letter | Description |
---|---|
$ | End of string |
By adding $ at the end, it is prevented from taking a partial match.
This time, I used Python to create a regular expression for "amount".
Character strings with a certain pattern, such as dates, times, and amounts, are compatible with regular expressions. Try to extract various character strings with regular expressions.
Recommended Posts