[Python] Get the text of the law from the e-GOV Law API

We have summarized how to obtain and format Japanese legal data from e-Gov Law API. I refer to the following Qiita article.

-Access the legal API from Google Colab (Python) -Extraction of decree including specific words using e-Gov decree API and XML Python --Body formatting: [Python3] Delete parentheses and character strings in parentheses

You can download each code in this article from GitHub repository, including the classes that appear in the final "Summary".

1. Trigger

I wanted to use the ministerial ordinance (J-GCP, ministerial ordinance on the standards for conducting clinical trials of pharmaceutical products) that I often confirm at work as the subject of studying natural language processing. I'm worried that the amount is small compared to the posted text on Twitter, but I thought that it would be useful as a subject for natural language processing because there are few notational fluctuations.

2. Environment

Use the requests (requires pip install) to access the API and the xml package (standard library) to parse the XML data. functools.lru_cache reduces the number of API accesses (function output cache), pprint displays dictionaries and lists neatly, and re removes unnecessary strings (by regular expression). It is used for (delete character string).

#Standard library
from functools import lru_cache
from pprint import pprint
import re
from xml.etree import ElementTree
# pip install requests
import requests

	Execution environment
OS	Windows Subsystem for Linux / Ubuntu
Package management	pipenv
Language	Python 3.8.5
requests	2.24.0

3. Acquisition of law number

It seems that a unique ID called "decree number" is set separately from the name of the law. The number is not a simple serial number, but a Japanese string ...

The law number (Horeibangou) is a number assigned individually for identification to various laws and regulations promulgated by the national and local governments. Numbers are initialized (starting from No. 1) at regular intervals (calendar year, etc.), serial numbers from a specific date (Independence Day, etc.), etc., depending on each government. Management and operation methods are different. "Law number" Source: Free encyclopedia "Wikipedia"

Check how to search for the law number by name, as it is specified using the law number when obtaining the text of the law.

Dictionary of law numbers

First, create a function that retrieves the relationship between the law name and the law number as a dictionary.

`law_number.py`


@lru_cache
def get_law_dict(category=1):
    #Obtain a list of laws and regulations included in each law type from the API
    url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
    r = requests.get(url)
    #Parsing XML data
    root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
    #dictionary{name:Law number}Creation
    names = [e.text for e in root.iter() if e.tag == "LawName"]
    numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
    return {name: num for (name, num) in zip(names, numbers)}

There are four types of laws (category arguments).

―― 1: All laws and regulations ―― 2: Constitution, law ―― 3: Cabinet Order, Royal Decree ―― 4: Ministerial Ordinance

Output example:

pprint(get_law_dict(category=2), compact=True)
# ->
{
    "Law No. 34 of Meiji 22 (Fighting crimes)": "Meiji 22 Law No. 34",
    "Deposit rules": "Meiji 23 Law No. 1",
    "Currency and Securities Imitation Control Law": "Meiji 28 Law No. 28",
    "Government Bond Securities Purchase Rejection Law": "Meiji 29 Law No. 5",
    "Civil law": "Meiji 29 Law No. 89",
...
    "Law concerning extraordinary special provisions of national tax-related law to deal with the effects of new coronavirus infections, etc.": "Reiwa 2nd Year Law No. 25",
    "Law concerning prohibition of seizure related to special fixed amount benefits, etc. for the second year of Reiwa": "Reiwa 2nd Year Law No. 27",
    "Disaster Prevention Priority Agricultural Reservoir Special Measures Law Concerning Promotion of Disaster Prevention Work, etc.": "Reiwa 2nd Year Law No. 56"
}

Root.iter () of "Create dictionary {name: law number}" divides XML data into element units and returns it as iteration. It can be executed by replacing it with root.getiterator (), but it seems that DeprecationWarning occurs as follows.

DeprecationWarning: This method will be removed in future versions.
Use 'tree.iter()' or 'list(tree.iter())' instead.

In addition, tags .text and .tag are set for each Element.

--When .tag ==" LawName ": .text indicates the name of the law --When .tag ==" LawNo ": .text indicates the law number

`Image of Element`


elements = [
    f"{e.tag=}, {e.text=}" for e in root.iter()
    if e.tag in set(["LawName", "LawNo"])
]
pprint(elements[:4], compact=False)
# ->
["e.tag='LawName', e.text='Revenue and Expenditure Budget Approximate Order'",
 "e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 12'",
 "e.tag='LawName', e.text='Scheduled expense calculation outline'",
 "e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 19'"]

Using this, I created a dictionary of names and law numbers in the following part.

`get_law_dict()`


names = [e.text for e in root.iter() if e.tag == "LawName"]
numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
return {name: num for (name, num) in zip(names, numbers)}

Keyword search for names

I think that it is rare to remember the official name of the law, so I will make it possible to search by keyword.

`law_number.py`


def get_law_number(keyword, category=1):
    """
    Return the law number.
    This will be retrieved from e-Gov (https://www.e-gov.go.jp/)

    Args:
        keyword (str): keyword of the law name
        category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

    Returns:
        dict(str, str): dictionary of law name (key) and law number (value)
    """
    law_dict = get_law_dict(category=category)
    return {k: v for (k, v) in law_dict.items() if keyword in k}

Output example:

`Acquisition of law number`


print(get_law_number("Clinical trials of pharmaceuticals", category=4))
# ->
{
    'Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals': '1997 Ministry of Health and Welfare Ordinance No. 28',
    'Ministerial Ordinance on Standards for Conducting Clinical Trials of Veterinary Drugs': '1997 Ministry of Agriculture, Forestry and Fisheries Ordinance No. 75'
}

The target J-GCP (Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals) was found to be the "Ministerial Ordinance No. 28 of 1997".

4. Acquisition of the text of the law

Send the law number to the API and get the text. Parses the XML to get the body and removes extra whitespace and blank lines.

`law_contents.py`


@lru_cache
def get_raw(number):
    """
    Retrieve contents of the law specified with law number from e-Gov API.

    Args:
        number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'

    Returns:
        raw (list[str]): raw contents of J-GCP
    """
    url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
    r = requests.get(url)
    root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
    contents = [e.text.strip() for e in root.iter() if e.text]
    return [t for t in contents if t]

Output example:

gcp_raw = get_raw("1997 Ministry of Health and Welfare Ordinance No. 28")
pprint(gcp_raw, compact=False)
# ->
[
    "0",
    "1997 Ministry of Health and Welfare Ordinance No. 28",
...
    "table of contents",
...
    "Chapter 1 General Rules",
    "(Effect)",
    "First article",
    "This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare, and the scientific quality of clinical trials and
Law Concerning Ensuring Quality, Effectiveness, and Safety of Pharmaceuticals, Medical Devices, etc. to Ensure Reliability of Results
(Hereinafter referred to as the "law") Article 14, paragraph 3 (applies mutatis mutandis in Article 14, paragraph 9 and Article 19-2, paragraph 5 of the law.
Including the case. same as below. ) And Article 14-4, paragraph 4 and Article 14-6, paragraph 4 of the Act (these provisions
Including cases where it is applied mutatis mutandis pursuant to Article 19-4 of the Act. same as below. ) Of the standards specified by the Ordinance of the Ministry of Health, Labor and Welfare
Those related to the implementation of clinical trials of pharmaceutical products and prescribed in Article 80-2, paragraphs 1, 4 and 5 of the Act
The standards specified by the Ordinance of the Ministry of Health, Labor and Welfare shall be established.",
    "(Definition)",
    "Article 2",
...
    "Supplementary provisions",
    "(Effective date)",
    "First article",
    "This Ministerial Ordinance shall come into effect on April 1, 1991."
]

5. Text shaping

Extracts and joins only the lines that end with a punctuation mark. Also, remove the character strings in parentheses (example: "Pharmaceutical Affairs Law ** (Act No. 145 of 1955) **") and "". Furthermore, in the case of J-GCP, Article 56 is mainly related to the replacement of words and is not used for analysis, so it is removed.

`law_contents.py`


def preprocess_gcp(raw):
    """
    Perform pre-processing on raw contents of J-GCP.

    Args:
        raw (list[str]): raw contents of J-GCP

    Returns:
        str: pre-processed string of J-GCP

    Notes:
        - Article 56 will be removed.
        - Strings enclosed with （ and ） will be removed.
        - 「 and 」 will be removed.
    """
    # contents = raw[:]
    # Remove article 56
    contents = raw[: raw.index("Article 56")]
    # Select sentenses
    contents = [s for s in contents if s.endswith("。")]
    # Join the sentenses
    gcp = "".join(contents)
    # 「 and 」 will be removed
    gcp = gcp.translate(str.maketrans({"「": "", "」": ""}))
    #　Strings enclosed with （ and ） will be removed
    return re.sub("（[^（|^）]*）", "", gcp)

Output example:

`J-GCP shaping`


gcp = preprocess_gcp(gcp_raw)
# ->
"Article 14 (3), Article 14-4 (4) and Article 14-5 (4) of the Pharmaceutical Affairs Law,
Based on the provisions of Article 80-2, paragraphs 1, 4 and 5, and Article 82
The ministerial ordinance on the criteria for conducting clinical trials of pharmaceutical products is stipulated as follows.
This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare.
To ensure the scientific quality of clinical trials and the reliability of results, the quality of pharmaceuticals, medical devices, etc.
Law Concerning Ensuring Effectiveness and Safety...(Omitted)
Written consent must be obtained for participation in the trial."

For the part to be deleted in Article 56, replace it with contents = raw [:] etc. in the case of other laws and regulations.

6. Summary

I put it together in a class.

`law_all.py`


class LawLoader(object):
    """
    Prepare law data with e-Gov (https://www.e-gov.go.jp/) site.

    Args:
        category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
    """

    def __init__(self, category=1):
        self.law_dict = self._get_law_dict(category=category)
        self.content_dict = {}

    @staticmethod
    def _get_xml(url):
        """
        Get XML data from e-Gov API.

        Args:
            url (str): key of the API

        Returns:
            xml.ElementTree: element tree of the XML data
        """
        r = requests.get(url)
        return ElementTree.fromstring(r.content.decode(encoding="utf-8"))

    def _get_law_dict(self, category):
        """
        Return dictionary of law names and numbers.

        Args:
            category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

        Returns:
            dict(str, str): dictionary of law names (keys) and numbers (values)
        """
        url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
        root = self._get_xml(url)
        names = [e.text for e in root.iter() if e.tag == "LawName"]
        numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
        return {name: num for (name, num) in zip(names, numbers)}

    def get_law_number(self, keyword, category=1):
        """
        Return the law number.
        This will be retrieved from e-Gov (https://www.e-gov.go.jp/)

        Args:
            keyword (str): keyword of the law name
            category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)

        Returns:
            dict(str, str): dictionary of law name (key) and law number (value)
        """
        return {k: v for (k, v) in self.law_dict.items() if keyword in k}

    def get_raw(self, number):
        """
        Args:
            number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'

        Returns:
            raw (list[str]): raw contents of J-GCP
        """
        if number in self.content_dict:
            return self.content_dict[number]
        url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
        root = self._get_xml(url)
        contents = [e.text.strip() for e in root.iter() if e.text]
        raw = [t for t in contents if t]
        self.content_dict = {number: raw}
        return raw

    @staticmethod
    def pre_process(raw):
        """
        Perform pre-processing on raw contents.

        Args:
            raw (list[str]): raw contents

        Returns:
            str: pre-processed string

        Notes:
            - Strings enclosed with （ and ） will be removed.
            - 「 and 」 will be removed.
        """
        contents = [s for s in raw if s.endswith("。")]
        string = "".join(contents)
        string = string.translate(str.maketrans({"「": "", "」": ""}))
        return re.sub("（[^（|^）]*）", "", string)

    def gcp(self):
        """
        Perform pre-processing on raw contents of J-GCP.

        Args:
            raw (list[str]): raw contents of J-GCP

        Returns:
            str: pre-processed string of J-GCP

        Notes:
            - Article 56 will be removed.
            - Strings enclosed with （ and ） will be removed.
            - 「 and 」 will be removed.
        """
        number_dict = self.get_law_number("Clinical trials of pharmaceuticals")
        number = number_dict["Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals"]
        raw = self.get_raw(number)
        raw_without56 = raw[: raw.index("Article 56")]
        return self.pre_process(raw_without56)

How to use:

`How to use LawLoader`


# The Constitution of Japan
loader2 = LawLoader(category=2)
consti_number = loader2.get_law_number("The Constitution of Japan")
print(consti_number) # -> 'Showa 21 Constitution'
consti_raw = loader2.get_raw("Showa 21 Constitution")
consti = loader2.pre_process(consti_raw)
# J-GCP: Registered as a method including data formatting
loader4 = LawLoader(category=4)
gcp = loader4.gcp()

7. Postscript

As a subject of natural language processing, I downloaded and shaped Japanese laws and regulations.

Thank you for your hard work!