Conclusion

Here is the regular expression for "amount" in python.

The end-of-yen version is below.

pattern = r'^(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'

# OK
#0 Yen
# 1,000 Yen
#100 yen
#12345 yen
#2000 yen
#1234 yen
#1000 yen

# NG
# 0,000 Yen
#000 Yen
# ,Circle
# 10,00 yen

The starting version of \ (yen mark) is as follows.

pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'

# OK
# ¥0
# ¥1,000
# ¥100
# ¥12345
# ¥2000
# ¥1234
# ¥1000

# NG
# ¥0,000
# ¥000
# ¥,
# ¥10,00

Preparation

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

The regular expression check tool used: https://regex101.com/ While checking here, we will create a regular expression and implement it in the code.

スクリーンショット 2020-04-20 13.32.32.png

Also, this is easy to understand about Python regular expressions in general. https://qiita.com/luohao0404/items/7135b2b96f9b0b196bf3

Let's make a regular expression for the amount

End of circle version

Let's write the code immediately. First, import the library for using regular expressions.

import re

First of all 1000 yen Let's create a regular expression that matches the string.

pattern = r'1000 yen'

Of course, this is an exact match, so it matches. Let's check with the code.

pattern = r'1000 yen'
string = r'1000 yen'
prog = re.compile(pattern)
result = prog.match(string)
if result:
    print(result.group())
#1000 yen

The matched string is displayed. After that, for the sake of simplicity, only the regular expression pattern is described.

In addition to "1000 yen", there are "2000 yen" and "1234 yen". The regular expressions that match these are as follows.

pattern = r'\d\d\d\d yen'

The regular expression used is:

letter	Description
\d	Any number

Example	Matching string
\d\d\d\d	1000, 2000, 1234

The regular expression above can be expressed more easily.

pattern = r'\d{4}Circle'

The newly used regular expressions are:

letter	Description
{m}	Repeat m of the previous character m times

Example	Matching string
\d{4}	1000, 2000, 1234

However, with this, you can only take four-digit amounts such as "100 yen" and "12345 yen". Let's deal with any number of digits.

The modified regular expression is as follows.

pattern = r'\d+Circle'

The newly used regular expressions are:

letter	Description
+	One or more repetitions of the previous character

Example	Matching string
\d+	1000, 100, 12345

However, with this, it is not possible to take a character string containing ", (comma)" such as "1,000 yen". Allow commas as well as numbers.

The modified regular expression is as follows.

pattern = r'[\d,]+Circle'

The newly used regular expressions are:

letter	Description
[abc]	a,b,Any letter of c

Example	Matching string
[\d,]	Numbers or,(comma)

I also used the following regular expression:

letter	Description
+	One or more repetitions of the previous character

Example	Matching string
[\d,]+	Numbers or,(Comma) one or more repetitions

Now you can handle numbers and (comma).

However, this will result in strings with incorrect comma positions, such as ", yen" and "10,00 yen". The comma is modified to be in every 3 digits, such as "1,000 yen" or "1,000,000 yen".

The modified regular expression is as follows.

pattern = r'\d{1,3}(,\d{3})+Circle'

The newly used regular expressions are:

letter	Description
{m,n}	Repeat m or more and n or less of the previous character

Example	Matching string
\d{1,3}	Repeating numbers from 1 to 3 times

I also used the following regular expression:

letter	Description
(abc)	Treat the string abc as a block

Example	Matching string
(,\d{3})	「,"000", such as ",(Comma) ”and 3 numbers are treated as one block

If you do this, you will not be able to take the string without commas that you used to take. I will modify it so that I can get only numbers.

pattern = r'(\d+|\d{1,3}(,\d{3})+)Circle'

The newly used regular expressions are:

letter	Description
(abc\|efg)	Either abc or efg string

Example	Matching string
(\d+\|\d{1,3}(,\d{3})+)	1000, 1,000

However, this will also result in 0-starting strings such as "0,000 yen" and "000 yen".

The modified regular expression is as follows.

pattern = r'([1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'

The newly used regular expressions are:

letter	Description
[a-c]	a,b,Any letter of c

Example	Matching string
[1-9]	1~9 (numbers excluding 0)

I also used the following regular expression:

letter	Description
*	Repeat 0 or more times of the previous character

Example	Matching string
\d*	Repeat the number 0 or more times

You have now excluded 0-based strings. However, only 0 yen must be allowed, so add this.

pattern = r'^(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)Circle'

The newly used regular expressions are:

letter	Description
^	The beginning of the string

If you do not add "^ (hat)", "0 yen" such as "0,000 yen" will be taken as a partial match.

¥ Beginning version

Some amounts start with ¥ (yen mark) as well as those ending in yen, so let's create a regular expression here as well. In the regular expression above, delete the last "yen" and add "" at the beginning.

pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)'

However, in this case, "¥ 1" of "¥ 1,000" will be taken as a partial match. The modified version is as follows.

pattern = r'^¥(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)$'

The newly used regular expressions are:

letter	Description
$	End of string

By adding $ at the end, it is prevented from taking a partial match.

Summary

This time, I used Python to create a regular expression for "amount".

Character strings with a certain pattern, such as dates, times, and amounts, are compatible with regular expressions. Try to extract various character strings with regular expressions.

I tried to make a regular expression of "amount" using Python

Conclusion

Preparation

Let's make a regular expression for the amount

End of circle version

¥ Beginning version

Summary