[Python] Get the text of the law from the e-GOV Law API

We have summarized how to obtain and format Japanese legal data from e-Gov Law API. I refer to the following Qiita article.

-Access the legal API from Google Colab (Python) -Extraction of decree including specific words using e-Gov decree API and XML Python --Body formatting: [Python3] Delete parentheses and character strings in parentheses

You can download each code in this article from GitHub repository, including the classes that appear in the final "Summary".

1. Trigger

I wanted to use the ministerial ordinance (J-GCP, ministerial ordinance on the standards for conducting clinical trials of pharmaceutical products) that I often confirm at work as the subject of studying natural language processing. I'm worried that the amount is small compared to the posted text on Twitter, but I thought that it would be useful as a subject for natural language processing because there are few notational fluctuations.

2. Environment

Use the requests (requires pip install) to access the API and the xml package (standard library) to parse the XML data. functools.lru_cache reduces the number of API accesses (function output cache), pprint displays dictionaries and lists neatly, and re removes unnecessary strings (by regular expression). It is used for (delete character string).

#Standard library
from functools import lru_cache
from pprint import pprint
import re
from xml.etree import ElementTree
# pip install requests
import requests
Execution environment
OS Windows Subsystem for Linux / Ubuntu
Package management pipenv
Language Python 3.8.5
requests 2.24.0

3. Acquisition of law number

It seems that a unique ID called "decree number" is set separately from the name of the law. The number is not a simple serial number, but a Japanese string ...

The law number (Horeibangou) is a number assigned individually for identification to various laws and regulations promulgated by the national and local governments. Numbers are initialized (starting from No. 1) at regular intervals (calendar year, etc.), serial numbers from a specific date (Independence Day, etc.), etc., depending on each government. Management and operation methods are different. "Law number" Source: Free encyclopedia "Wikipedia"

Check how to search for the law number by name, as it is specified using the law number when obtaining the text of the law.

Dictionary of law numbers

First, create a function that retrieves the relationship between the law name and the law number as a dictionary.

law_number.py


@lru_cache
def get_law_dict(category=1):
    #Obtain a list of laws and regulations included in each law type from the API
    url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
    r = requests.get(url)
    #Parsing XML data
    root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
    #dictionary{name:Law number}Creation
    names = [e.text for e in root.iter() if e.tag == "LawName"]
    numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
    return {name: num for (name, num) in zip(names, numbers)}

There are four types of laws (category arguments).

―― 1: All laws and regulations ―― 2: Constitution, law ―― 3: Cabinet Order, Royal Decree ―― 4: Ministerial Ordinance

Output example:

pprint(get_law_dict(category=2), compact=True)
# ->
{
    "Law No. 34 of Meiji 22 (Fighting crimes)": "Meiji 22 Law No. 34",
    "Deposit rules": "Meiji 23 Law No. 1",
    "Currency and Securities Imitation Control Law": "Meiji 28 Law No. 28",
    "Government Bond Securities Purchase Rejection Law": "Meiji 29 Law No. 5",
    "Civil law": "Meiji 29 Law No. 89",
...
    "Law concerning extraordinary special provisions of national tax-related law to deal with the effects of new coronavirus infections, etc.": "Reiwa 2nd Year Law No. 25",
    "Law concerning prohibition of seizure related to special fixed amount benefits, etc. for the second year of Reiwa": "Reiwa 2nd Year Law No. 27",
    "Disaster Prevention Priority Agricultural Reservoir Special Measures Law Concerning Promotion of Disaster Prevention Work, etc.": "Reiwa 2nd Year Law No. 56"
}

Root.iter () of "Create dictionary {name: law number}" divides XML data into element units and returns it as iteration. It can be executed by replacing it with root.getiterator (), but it seems that DeprecationWarning occurs as follows.

DeprecationWarning: This method will be removed in future versions.
Use 'tree.iter()' or 'list(tree.iter())' instead.

In addition, tags .text and .tag are set for each Element.

--When .tag ==" LawName ": .text indicates the name of the law --When .tag ==" LawNo ": .text indicates the law number

Image of Element


elements = [
    f"{e.tag=}, {e.text=}" for e in root.iter()
    if e.tag in set(["LawName", "LawNo"])
]
pprint(elements[:4], compact=False)
# ->
["e.tag='LawName', e.text='Revenue and Expenditure Budget Approximate Order'",
 "e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 12'",
 "e.tag='LawName', e.text='Scheduled expense calculation outline'",
 "e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 19'"]

Using this, I created a dictionary of names and law numbers in the following part.

get_law_dict()


names = [e.text for e in root.iter() if e.tag == "LawName"]
numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
return {name: num for (name, num) in zip(names, numbers)}

Keyword search for names

I think that it is rare to remember the official name of the law, so I will make it possible to search by keyword.

law_number.py


def get_law_number(keyword, category=1):
    """
    Return the law number.
    This will be retrieved from e-Gov (https://www.e-gov.go.jp/)

    Args:
        keyword (str): keyword of the law name
        category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

    Returns:
        dict(str, str): dictionary of law name (key) and law number (value)
    """
    law_dict = get_law_dict(category=category)
    return {k: v for (k, v) in law_dict.items() if keyword in k}

Output example:

Acquisition of law number


print(get_law_number("Clinical trials of pharmaceuticals", category=4))
# ->
{
    'Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals': '1997 Ministry of Health and Welfare Ordinance No. 28',
    'Ministerial Ordinance on Standards for Conducting Clinical Trials of Veterinary Drugs': '1997 Ministry of Agriculture, Forestry and Fisheries Ordinance No. 75'
}

The target J-GCP (Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals) was found to be the "Ministerial Ordinance No. 28 of 1997".

4. Acquisition of the text of the law

Send the law number to the API and get the text. Parses the XML to get the body and removes extra whitespace and blank lines.

law_contents.py


@lru_cache
def get_raw(number):
    """
    Retrieve contents of the law specified with law number from e-Gov API.

    Args:
        number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'

    Returns:
        raw (list[str]): raw contents of J-GCP
    """
    url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
    r = requests.get(url)
    root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
    contents = [e.text.strip() for e in root.iter() if e.text]
    return [t for t in contents if t]

Output example:

gcp_raw = get_raw("1997 Ministry of Health and Welfare Ordinance No. 28")
pprint(gcp_raw, compact=False)
# ->
[
    "0",
    "1997 Ministry of Health and Welfare Ordinance No. 28",
...
    "table of contents",
...
    "Chapter 1 General Rules",
    "(Effect)",
    "First article",
    "This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare, and the scientific quality of clinical trials and
Law Concerning Ensuring Quality, Effectiveness, and Safety of Pharmaceuticals, Medical Devices, etc. to Ensure Reliability of Results
(Hereinafter referred to as the "law") Article 14, paragraph 3 (applies mutatis mutandis in Article 14, paragraph 9 and Article 19-2, paragraph 5 of the law.
Including the case. same as below. ) And Article 14-4, paragraph 4 and Article 14-6, paragraph 4 of the Act (these provisions
Including cases where it is applied mutatis mutandis pursuant to Article 19-4 of the Act. same as below. ) Of the standards specified by the Ordinance of the Ministry of Health, Labor and Welfare
Those related to the implementation of clinical trials of pharmaceutical products and prescribed in Article 80-2, paragraphs 1, 4 and 5 of the Act
The standards specified by the Ordinance of the Ministry of Health, Labor and Welfare shall be established.",
    "(Definition)",
    "Article 2",
...
    "Supplementary provisions",
    "(Effective date)",
    "First article",
    "This Ministerial Ordinance shall come into effect on April 1, 1991."
]

5. Text shaping

Extracts and joins only the lines that end with a punctuation mark. Also, remove the character strings in parentheses (example: "Pharmaceutical Affairs Law ** (Act No. 145 of 1955) **") and "". Furthermore, in the case of J-GCP, Article 56 is mainly related to the replacement of words and is not used for analysis, so it is removed.

law_contents.py


def preprocess_gcp(raw):
    """
    Perform pre-processing on raw contents of J-GCP.

    Args:
        raw (list[str]): raw contents of J-GCP

    Returns:
        str: pre-processed string of J-GCP

    Notes:
        - Article 56 will be removed.
        - Strings enclosed with ( and ) will be removed.
        - 「 and 」 will be removed.
    """
    # contents = raw[:]
    # Remove article 56
    contents = raw[: raw.index("Article 56")]
    # Select sentenses
    contents = [s for s in contents if s.endswith("。")]
    # Join the sentenses
    gcp = "".join(contents)
    # 「 and 」 will be removed
    gcp = gcp.translate(str.maketrans({"「": "", "」": ""}))
    # Strings enclosed with ( and ) will be removed
    return re.sub("([^(|^)]*)", "", gcp)

Output example:

J-GCP shaping


gcp = preprocess_gcp(gcp_raw)
# ->
"Article 14 (3), Article 14-4 (4) and Article 14-5 (4) of the Pharmaceutical Affairs Law,
Based on the provisions of Article 80-2, paragraphs 1, 4 and 5, and Article 82
The ministerial ordinance on the criteria for conducting clinical trials of pharmaceutical products is stipulated as follows.
This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare.
To ensure the scientific quality of clinical trials and the reliability of results, the quality of pharmaceuticals, medical devices, etc.
Law Concerning Ensuring Effectiveness and Safety...(Omitted)
Written consent must be obtained for participation in the trial."

For the part to be deleted in Article 56, replace it with contents = raw [:] etc. in the case of other laws and regulations.

6. Summary

I put it together in a class.

law_all.py


class LawLoader(object):
    """
    Prepare law data with e-Gov (https://www.e-gov.go.jp/) site.

    Args:
        category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
    """

    def __init__(self, category=1):
        self.law_dict = self._get_law_dict(category=category)
        self.content_dict = {}

    @staticmethod
    def _get_xml(url):
        """
        Get XML data from e-Gov API.

        Args:
            url (str): key of the API

        Returns:
            xml.ElementTree: element tree of the XML data
        """
        r = requests.get(url)
        return ElementTree.fromstring(r.content.decode(encoding="utf-8"))

    def _get_law_dict(self, category):
        """
        Return dictionary of law names and numbers.

        Args:
            category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

        Returns:
            dict(str, str): dictionary of law names (keys) and numbers (values)
        """
        url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
        root = self._get_xml(url)
        names = [e.text for e in root.iter() if e.tag == "LawName"]
        numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
        return {name: num for (name, num) in zip(names, numbers)}

    def get_law_number(self, keyword, category=1):
        """
        Return the law number.
        This will be retrieved from e-Gov (https://www.e-gov.go.jp/)

        Args:
            keyword (str): keyword of the law name
            category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

        Returns:
            dict(str, str): dictionary of law name (key) and law number (value)
        """
        return {k: v for (k, v) in self.law_dict.items() if keyword in k}

    def get_raw(self, number):
        """
        Args:
            number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'

        Returns:
            raw (list[str]): raw contents of J-GCP
        """
        if number in self.content_dict:
            return self.content_dict[number]
        url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
        root = self._get_xml(url)
        contents = [e.text.strip() for e in root.iter() if e.text]
        raw = [t for t in contents if t]
        self.content_dict = {number: raw}
        return raw

    @staticmethod
    def pre_process(raw):
        """
        Perform pre-processing on raw contents.

        Args:
            raw (list[str]): raw contents

        Returns:
            str: pre-processed string

        Notes:
            - Strings enclosed with ( and ) will be removed.
            - 「 and 」 will be removed.
        """
        contents = [s for s in raw if s.endswith("。")]
        string = "".join(contents)
        string = string.translate(str.maketrans({"「": "", "」": ""}))
        return re.sub("([^(|^)]*)", "", string)

    def gcp(self):
        """
        Perform pre-processing on raw contents of J-GCP.

        Args:
            raw (list[str]): raw contents of J-GCP

        Returns:
            str: pre-processed string of J-GCP

        Notes:
            - Article 56 will be removed.
            - Strings enclosed with ( and ) will be removed.
            - 「 and 」 will be removed.
        """
        number_dict = self.get_law_number("Clinical trials of pharmaceuticals")
        number = number_dict["Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals"]
        raw = self.get_raw(number)
        raw_without56 = raw[: raw.index("Article 56")]
        return self.pre_process(raw_without56)

How to use:

How to use LawLoader


# The Constitution of Japan
loader2 = LawLoader(category=2)
consti_number = loader2.get_law_number("The Constitution of Japan")
print(consti_number) # -> 'Showa 21 Constitution'
consti_raw = loader2.get_raw("Showa 21 Constitution")
consti = loader2.pre_process(consti_raw)
# J-GCP: Registered as a method including data formatting
loader4 = LawLoader(category=4)
gcp = loader4.gcp()

7. Postscript

As a subject of natural language processing, I downloaded and shaped Japanese laws and regulations.

Thank you for your hard work!

Recommended Posts

[Python] Get the text of the law from the e-GOV Law API
Get the contents of git diff from python
Get the return code of the Python script from bat
Existence from the viewpoint of Python
Use the Flickr API from Python
Get upcoming weather from python weather api
Get your heart rate from the fitbit API in Python!
Learning notes from the beginning of Python 1
How to get followers and followers from python using the Mastodon API
Get the minutes of the Diet via API
[Python] Get the update date of a news article from HTML
[Python] Get the character code of the file
Learning notes from the beginning of Python 2
[Python] Get the main color from the screenshot
Get only the text from the Django form.
I tried to get the authentication code of Qiita API with Python.
Get the number of articles accessed and likes with Qiita API + Python
I tried to get the movie information of TMDb API with Python
[Python] Get / edit the scale label of the figure
[Python] Get the main topics of Yahoo News
Get the caller of a function in Python
[Python] Get the last updated date of the website
[Python] Get the day of the week (English & Japanese)
Try accessing the YQL API directly from Python 3
Get the update date of the Python memo file.
the zen of Python
The wall of changing the Django service from Python 2.7 to Python 3
Python: Japanese text: Characteristic of utterance from word similarity
Translator in Python from Visual Studio 2017 (Microsoft Translator Text API)
How to get the number of digits in Python
Learn Nim with Python (from the beginning of the year).
[Python] Get the official file path of the shortcut file (.lnk)
[python] Get the list of classes defined in the module
Study from the beginning of Python Hour1: Hello World
Get schedule from Garoon SOAP API with Python + Zeep
Python points from the perspective of a C programmer
Get the size (number of elements) of UnionFind in Python
Let's use the Python version of the Confluence API module.
[Python] Get the list of ExifTags names of Pillow library
Python: Japanese text: Characteristic of utterance from word continuity
Study from the beginning of Python Hour8: Using packages
Get the operation status of JR West with Python
[Python] Get the number of views of all posted articles
Get the URL of the HTTP redirect destination in Python
[Python] Use the Face API of Microsoft Cognitive Services
A little bit from Python using the Jenkins API
Get the value of a specific key in a list from the dictionary type in the list with Python
Towards the retirement of Python2
About the ease of Python
Get the number of digits
[Python] Get the previous month
Call the API with python3.
About the features of Python
Use e-Stat API from Python
The Power of Pandas: Python
Try to get the function list of Python> os package
Different from the import type of python. from A import B meaning
Get the number of specific elements in a python list
Let's touch the API of Netatmo Weather Station with Python. #Python #Netatmo
Get the value while specifying the default value from dict in Python
[Python] Extract text data from XML data of 10GB or more.