What is regular notation?

You can use canonical notation to search for and replace complex patterns and strings. This time, I will use this canonical notation to ** read and extract only the character string that represents the date from the input data **.

Since it is the basis, I will omit the details. Please read the official documentation in the references below.

Personally, the regular notation is somewhat complicated and difficult to grasp, but I'm gradually getting used to it.

Use the standard library re

import re

Preparation of input data

First, prepare the date data. I stored various data as DATE in the list. There are various things that have nothing to do with the date, regrettable things, and things that differ only in the delimiters.

`date.py`


DATE = ["2020/01/05",
        "2020/1/5",
        "January 5, 2020",
        "2020-1-5",
        "2020/1/5",
        "2020.1.5",
        "2020/20/20",
        "2020 1 5",
        "2020 01 05",
        "1995w44w47",
        "Thank you",
        "1998/33/52",
        "3020/1/1",
        ]

Regular notation for dates

For example, if you enter "today" on your smartphone or computer, you may see expressions such as "January 5, 2020", "2020/01/05", and "January 5, Reiwa 2" in predictive conversion. I will. This time, we will use the Christian era and handle the notation of YYYY-MM-DD.

If you write a commonly used date regular expression as a sample, you can write it like this: ^ \ d {4}-\ d {1,2}-\ d {1,2} $.

If you do not understand the expression of the symbol, please take a closer look at the official document.

But this is still sweet. The only supported notation is the string separated by -. That's where \ D is used.

\ D represents any non-numeric character. It's equivalent to [^ 0-9]. Therefore, you can use it to determine anything other than numbers, such as hyphens, strings, spaces, and dots.

I will make it at once.

`date_type.py`


date_type = re.compile(r"""(
    (^\d{4})        # First 4 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    )""",re.VERBOSE)

It's done. The method uses re.compile ().

Compared to the dates shown above, $ is gone. $ Checks if the end of the string matches, but this time the end is not necessarily \ d {1,2} = MM. That is because there is January 5, 2020 in the input data. You can't use a fixed $ with it, given that it has a day at the end or some other string.

Extract the date

Now that you're ready, consider extracting the date. First, use the .search () method to output a notation that partially matches the canonical notation.

`hit_data_1.py`


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    print(hit_date)

`Output result_1.py`


<re.Match object; span=(0, 10), match='2020/01/05'>
<re.Match object; span=(0, 8), match='2020/1/5'>
<re.Match object; span=(0, 8), match='January 5, 2020'>
<re.Match object; span=(0, 8), match='2020-1-5'>
<re.Match object; span=(0, 8), match='2020/1/5'>
<re.Match object; span=(0, 8), match='2020.1.5'>
<re.Match object; span=(0, 10), match='2020/20/20'>
<re.Match object; span=(0, 8), match='2020 1 5'>
<re.Match object; span=(0, 10), match='2020 01 05'>
<re.Match object; span=(0, 10), match='1995w44w47'>
None
<re.Match object; span=(0, 10), match='1998/33/52'>
<re.Match object; span=(0, 8), match='3020/1/1'>

Naturally, None was returned to Thank you. Other notations still look fine.

Next, omit None in bool type, and if True, return tuple type with .groups (). Let's improve the script a little.

`hit_data_2.py`


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()
        print(split)

`Output result_2.py`


('2020/01/05', '2020', '/', '01', '/', '05')
('2020/1/5', '2020', '/', '1', '/', '5')
('January 5, 2020', '2020', 'Year', '1', 'Month', '5')
('2020-1-5', '2020', '-', '1', '-', '5')
('2020/1/5', '2020', '/', '1', '/', '5')
('2020.1.5', '2020', '.', '1', '.', '5')
('2020/20/20', '2020', '/', '20', '/', '20')
('2020 1 5', '2020', ' ', '1', ' ', '5')
('2020 01 05', '2020', ' ', '01', ' ', '05')
('1995w44w47', '1995', 'w', '44', 'w', '47')
('1998/33/52', '1998', '/', '33', '/', '52')
('3020/1/1', '3020', '/', '1', '/', '1')

Yes! It will be a little more when you come here. The information you want is stored in [1], [3] and [5], respectively, in the Christian era, month, and day. Use tuple unpacking to classify this.

Furthermore, the type in the tuple is <class'str'>, so let's change it to an int type. Doing so will make it easier to judge.

Next, determine whether the int-type year, month, and day are inconsistent numbers. I will omit 3000 years because I rarely use it on a daily basis. There can be no more than 13 months and no more than 32 days. I will do it like that. If you do it in detail, you have to think about leap years, so feel free to change the judgment here.

Considering the above, it looks like this.

`hit_data_3.py`


for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()

        # Tuple unpacking
        year, month, day = int(split[1]),int(split[3]),int(split[5])

        if year>3000 or month >12 or day > 31:
            print("False")
        else:
            print(year, month, day)

`Output result_3.py`


2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
False
2020 1 5
2020 1 5
False
False
False

I think I was able to extract only the expressions that seemed to be dates.

Completed sample code

`main.py`


import re

# data of date
DATE = ["2020/01/05",
        "2020/1/5",
        "January 5, 2020",
        "2020-1-5",
        "2020/1/5",
        "2020.1.5",
        "2020/20/20",
        "2020 1 5",
        "2020 01 05",
        "1995w44w47",
        "Thank you",
        "1998/33/52",
        "3020/1/1",
        ]

# date :sample of Regular expression operations
date_type = re.compile(r"""(
    (^\d{4})        # First 4 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    (\D)            # Something other than numbers
    (\d{1,2})       # 1 or 2 digits number
    )""",re.VERBOSE)

for date in DATE:
    # Hit data to "hit_date"
    hit_date = date_type.search(date)
    bool_value = bool(hit_date)
    if bool_value is True:
        split = hit_date.groups()

        # Tuple unpacking
        year, month, day = int(split[1]),int(split[3]),int(split[5])

        if year>3000 or month >12 or day > 31:
            print("False")
        else:
            print(year, month, day)

`Output result.py`


2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
2020 1 5
False
2020 1 5
2020 1 5
False
False
False

Summary

How was it. There may be other better ways, but that's all I can do.

When I was thinking of developing a tool to streamline my daily work, I ran into this, so I wrote it on Qiita as well.

I hope it helps. Click here for Github

References

-Python standard library re --- Regular expression operation -I don't want to google the regular expressions I use often!

Python canonical notation: How to determine and extract only valid date representations from input data

What is regular notation?

Preparation of input data

date.py

Regular notation for dates

date_type.py

Extract the date

hit_data_1.py

Output result_1.py

hit_data_2.py

Output result_2.py

hit_data_3.py

Output result_3.py