We have summarized how to obtain and format Japanese legal data from e-Gov Law API. I refer to the following Qiita article.
-Access the legal API from Google Colab (Python) -Extraction of decree including specific words using e-Gov decree API and XML Python --Body formatting: [Python3] Delete parentheses and character strings in parentheses
You can download each code in this article from GitHub repository, including the classes that appear in the final "Summary".
I wanted to use the ministerial ordinance (J-GCP, ministerial ordinance on the standards for conducting clinical trials of pharmaceutical products) that I often confirm at work as the subject of studying natural language processing. I'm worried that the amount is small compared to the posted text on Twitter, but I thought that it would be useful as a subject for natural language processing because there are few notational fluctuations.
Use the requests
(requires pip install) to access the API and the xml
package (standard library) to parse the XML data. functools.lru_cache
reduces the number of API accesses (function output cache), pprint
displays dictionaries and lists neatly, and re
removes unnecessary strings (by regular expression). It is used for (delete character string).
#Standard library
from functools import lru_cache
from pprint import pprint
import re
from xml.etree import ElementTree
# pip install requests
import requests
Execution environment | |
---|---|
OS | Windows Subsystem for Linux / Ubuntu |
Package management | pipenv |
Language | Python 3.8.5 |
requests | 2.24.0 |
It seems that a unique ID called "decree number" is set separately from the name of the law. The number is not a simple serial number, but a Japanese string ...
The law number (Horeibangou) is a number assigned individually for identification to various laws and regulations promulgated by the national and local governments. Numbers are initialized (starting from No. 1) at regular intervals (calendar year, etc.), serial numbers from a specific date (Independence Day, etc.), etc., depending on each government. Management and operation methods are different. "Law number" Source: Free encyclopedia "Wikipedia"
Check how to search for the law number by name, as it is specified using the law number when obtaining the text of the law.
First, create a function that retrieves the relationship between the law name and the law number as a dictionary.
law_number.py
@lru_cache
def get_law_dict(category=1):
#Obtain a list of laws and regulations included in each law type from the API
url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
r = requests.get(url)
#Parsing XML data
root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
#dictionary{name:Law number}Creation
names = [e.text for e in root.iter() if e.tag == "LawName"]
numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
return {name: num for (name, num) in zip(names, numbers)}
There are four types of laws (category
arguments).
―― 1: All laws and regulations ―― 2: Constitution, law ―― 3: Cabinet Order, Royal Decree ―― 4: Ministerial Ordinance
Output example:
pprint(get_law_dict(category=2), compact=True)
# ->
{
"Law No. 34 of Meiji 22 (Fighting crimes)": "Meiji 22 Law No. 34",
"Deposit rules": "Meiji 23 Law No. 1",
"Currency and Securities Imitation Control Law": "Meiji 28 Law No. 28",
"Government Bond Securities Purchase Rejection Law": "Meiji 29 Law No. 5",
"Civil law": "Meiji 29 Law No. 89",
...
"Law concerning extraordinary special provisions of national tax-related law to deal with the effects of new coronavirus infections, etc.": "Reiwa 2nd Year Law No. 25",
"Law concerning prohibition of seizure related to special fixed amount benefits, etc. for the second year of Reiwa": "Reiwa 2nd Year Law No. 27",
"Disaster Prevention Priority Agricultural Reservoir Special Measures Law Concerning Promotion of Disaster Prevention Work, etc.": "Reiwa 2nd Year Law No. 56"
}
Root.iter ()
of "Create dictionary {name: law number}" divides XML data into element units and returns it as iteration. It can be executed by replacing it with root.getiterator ()
, but it seems that DeprecationWarning
occurs as follows.
DeprecationWarning: This method will be removed in future versions.
Use 'tree.iter()' or 'list(tree.iter())' instead.
In addition, tags .text
and .tag
are set for each Element.
--When .tag ==" LawName "
: .text
indicates the name of the law
--When .tag ==" LawNo "
: .text
indicates the law number
Image of Element
elements = [
f"{e.tag=}, {e.text=}" for e in root.iter()
if e.tag in set(["LawName", "LawNo"])
]
pprint(elements[:4], compact=False)
# ->
["e.tag='LawName', e.text='Revenue and Expenditure Budget Approximate Order'",
"e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 12'",
"e.tag='LawName', e.text='Scheduled expense calculation outline'",
"e.tag='LawNo', e.text='Meiji 22nd Cabinet Decree No. 19'"]
Using this, I created a dictionary of names and law numbers in the following part.
get_law_dict()
names = [e.text for e in root.iter() if e.tag == "LawName"]
numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
return {name: num for (name, num) in zip(names, numbers)}
I think that it is rare to remember the official name of the law, so I will make it possible to search by keyword.
law_number.py
def get_law_number(keyword, category=1):
"""
Return the law number.
This will be retrieved from e-Gov (https://www.e-gov.go.jp/)
Args:
keyword (str): keyword of the law name
category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
Returns:
dict(str, str): dictionary of law name (key) and law number (value)
"""
law_dict = get_law_dict(category=category)
return {k: v for (k, v) in law_dict.items() if keyword in k}
Output example:
Acquisition of law number
print(get_law_number("Clinical trials of pharmaceuticals", category=4))
# ->
{
'Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals': '1997 Ministry of Health and Welfare Ordinance No. 28',
'Ministerial Ordinance on Standards for Conducting Clinical Trials of Veterinary Drugs': '1997 Ministry of Agriculture, Forestry and Fisheries Ordinance No. 75'
}
The target J-GCP (Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals) was found to be the "Ministerial Ordinance No. 28 of 1997".
Send the law number to the API and get the text. Parses the XML to get the body and removes extra whitespace and blank lines.
law_contents.py
@lru_cache
def get_raw(number):
"""
Retrieve contents of the law specified with law number from e-Gov API.
Args:
number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'
Returns:
raw (list[str]): raw contents of J-GCP
"""
url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
r = requests.get(url)
root = ElementTree.fromstring(r.content.decode(encoding="utf-8"))
contents = [e.text.strip() for e in root.iter() if e.text]
return [t for t in contents if t]
Output example:
gcp_raw = get_raw("1997 Ministry of Health and Welfare Ordinance No. 28")
pprint(gcp_raw, compact=False)
# ->
[
"0",
"1997 Ministry of Health and Welfare Ordinance No. 28",
...
"table of contents",
...
"Chapter 1 General Rules",
"(Effect)",
"First article",
"This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare, and the scientific quality of clinical trials and
Law Concerning Ensuring Quality, Effectiveness, and Safety of Pharmaceuticals, Medical Devices, etc. to Ensure Reliability of Results
(Hereinafter referred to as the "law") Article 14, paragraph 3 (applies mutatis mutandis in Article 14, paragraph 9 and Article 19-2, paragraph 5 of the law.
Including the case. same as below. ) And Article 14-4, paragraph 4 and Article 14-6, paragraph 4 of the Act (these provisions
Including cases where it is applied mutatis mutandis pursuant to Article 19-4 of the Act. same as below. ) Of the standards specified by the Ordinance of the Ministry of Health, Labor and Welfare
Those related to the implementation of clinical trials of pharmaceutical products and prescribed in Article 80-2, paragraphs 1, 4 and 5 of the Act
The standards specified by the Ordinance of the Ministry of Health, Labor and Welfare shall be established.",
"(Definition)",
"Article 2",
...
"Supplementary provisions",
"(Effective date)",
"First article",
"This Ministerial Ordinance shall come into effect on April 1, 1991."
]
Extracts and joins only the lines that end with a punctuation mark. Also, remove the character strings in parentheses (example: "Pharmaceutical Affairs Law ** (Act No. 145 of 1955) **") and "". Furthermore, in the case of J-GCP, Article 56 is mainly related to the replacement of words and is not used for analysis, so it is removed.
law_contents.py
def preprocess_gcp(raw):
"""
Perform pre-processing on raw contents of J-GCP.
Args:
raw (list[str]): raw contents of J-GCP
Returns:
str: pre-processed string of J-GCP
Notes:
- Article 56 will be removed.
- Strings enclosed with ( and ) will be removed.
- 「 and 」 will be removed.
"""
# contents = raw[:]
# Remove article 56
contents = raw[: raw.index("Article 56")]
# Select sentenses
contents = [s for s in contents if s.endswith("。")]
# Join the sentenses
gcp = "".join(contents)
# 「 and 」 will be removed
gcp = gcp.translate(str.maketrans({"「": "", "」": ""}))
# Strings enclosed with ( and ) will be removed
return re.sub("([^(|^)]*)", "", gcp)
Output example:
J-GCP shaping
gcp = preprocess_gcp(gcp_raw)
# ->
"Article 14 (3), Article 14-4 (4) and Article 14-5 (4) of the Pharmaceutical Affairs Law,
Based on the provisions of Article 80-2, paragraphs 1, 4 and 5, and Article 82
The ministerial ordinance on the criteria for conducting clinical trials of pharmaceutical products is stipulated as follows.
This Ministerial Ordinance aims to protect the human rights of subjects, maintain safety and improve welfare.
To ensure the scientific quality of clinical trials and the reliability of results, the quality of pharmaceuticals, medical devices, etc.
Law Concerning Ensuring Effectiveness and Safety...(Omitted)
Written consent must be obtained for participation in the trial."
For the part to be deleted in Article 56, replace it with contents = raw [:]
etc. in the case of other laws and regulations.
I put it together in a class.
law_all.py
class LawLoader(object):
"""
Prepare law data with e-Gov (https://www.e-gov.go.jp/) site.
Args:
category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
"""
def __init__(self, category=1):
self.law_dict = self._get_law_dict(category=category)
self.content_dict = {}
@staticmethod
def _get_xml(url):
"""
Get XML data from e-Gov API.
Args:
url (str): key of the API
Returns:
xml.ElementTree: element tree of the XML data
"""
r = requests.get(url)
return ElementTree.fromstring(r.content.decode(encoding="utf-8"))
def _get_law_dict(self, category):
"""
Return dictionary of law names and numbers.
Args:
category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
Returns:
dict(str, str): dictionary of law names (keys) and numbers (values)
"""
url = f"https://elaws.e-gov.go.jp/api/1/lawlists/{category}"
root = self._get_xml(url)
names = [e.text for e in root.iter() if e.tag == "LawName"]
numbers = [e.text for e in root.iter() if e.tag == "LawNo"]
return {name: num for (name, num) in zip(names, numbers)}
def get_law_number(self, keyword, category=1):
"""
Return the law number.
This will be retrieved from e-Gov (https://www.e-gov.go.jp/)
Args:
keyword (str): keyword of the law name
category (int): category number, like 1 (all), 2 (Decree), 3 (Cabinet Order), 4 (Ministerial ordinance)
Returns:
dict(str, str): dictionary of law name (key) and law number (value)
"""
return {k: v for (k, v) in self.law_dict.items() if keyword in k}
def get_raw(self, number):
"""
Args:
number (str): Number of the law, like '1997 Ministry of Health and Welfare Ordinance No. 28'
Returns:
raw (list[str]): raw contents of J-GCP
"""
if number in self.content_dict:
return self.content_dict[number]
url = f"https://elaws.e-gov.go.jp/api/1/lawdata/{number}"
root = self._get_xml(url)
contents = [e.text.strip() for e in root.iter() if e.text]
raw = [t for t in contents if t]
self.content_dict = {number: raw}
return raw
@staticmethod
def pre_process(raw):
"""
Perform pre-processing on raw contents.
Args:
raw (list[str]): raw contents
Returns:
str: pre-processed string
Notes:
- Strings enclosed with ( and ) will be removed.
- 「 and 」 will be removed.
"""
contents = [s for s in raw if s.endswith("。")]
string = "".join(contents)
string = string.translate(str.maketrans({"「": "", "」": ""}))
return re.sub("([^(|^)]*)", "", string)
def gcp(self):
"""
Perform pre-processing on raw contents of J-GCP.
Args:
raw (list[str]): raw contents of J-GCP
Returns:
str: pre-processed string of J-GCP
Notes:
- Article 56 will be removed.
- Strings enclosed with ( and ) will be removed.
- 「 and 」 will be removed.
"""
number_dict = self.get_law_number("Clinical trials of pharmaceuticals")
number = number_dict["Ministerial Ordinance on Standards for Conducting Clinical Trials of Pharmaceuticals"]
raw = self.get_raw(number)
raw_without56 = raw[: raw.index("Article 56")]
return self.pre_process(raw_without56)
How to use:
How to use LawLoader
# The Constitution of Japan
loader2 = LawLoader(category=2)
consti_number = loader2.get_law_number("The Constitution of Japan")
print(consti_number) # -> 'Showa 21 Constitution'
consti_raw = loader2.get_raw("Showa 21 Constitution")
consti = loader2.pre_process(consti_raw)
# J-GCP: Registered as a method including data formatting
loader4 = LawLoader(category=4)
gcp = loader4.gcp()
As a subject of natural language processing, I downloaded and shaped Japanese laws and regulations.
Thank you for your hard work!
Recommended Posts