In the 5th post of Advent Calendar, I will try to extract the desired information from the XBRL data acquired by the EDINET API. This time, I tried to target the average salary and average age of companies that are often seen in job hunting.
(The program in this article is provided as it is without any guarantee, and XBRL Japan takes all responsibility for any disadvantages or problems caused by using this program, regardless of the cause. I will not bear it.)
This is a Python language program that decompresses and analyzes the zip file (including XBRL data) acquired by the EDINET API, extracts information such as salary from the securities report, and then outputs the necessary information as CSV. (All codes are listed in "3. Source code") The items to be extracted from the XBRL data are as shown in the table below.
item | Element name defined in XBRL | Acquisition example |
---|---|---|
EDINETCODE | EDINETCodeDEI | E00004 |
Company name | FilerNameInJapaneseDEI | Kaneko Seed Co., Ltd. |
Average annual salary (yen) | AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees | 5,554,933 |
Average years of service (years) | AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees | 13.0 |
Average years of service (month) | AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees | - |
Average age (years) | AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees | 42.2 |
Average age (month) | AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees | - |
Number of employees (persons) | NumberOfEmployees | 625 |
Other characteristic points are described below.
○ Use the EDINET code list to identify the industry
of the company
○ It does not correspond to the acquisition of the company's unique extension item
○ An OSS parser called ʻArellewas used for XBRL analysis. ○ If there is an average length of service (month) and average age (month),
annual conversion` (rounded to the first decimal place).
Please take the following actions before executing the program. In addition, it is necessary to install other libraries in advance. Depending on the execution environment, change the encoding and path specification method as appropriate.
Download the zip file (including XBRL data) from EDINET by utilizing "Let's search for the data described in XBRL" and "Let's collect the data described in XBRL". The average annual salary (yen) and other items are newly added items on EDINET, so for the time being, 2019 Download the XBRL data disclosed after April 1st.
Install the XBRL Parser called Arelle. When ʻArelle` loads the instance, it analyzes with the DTS set including external taxonomy information. Therefore, it is suitable for utilizing advanced XBRL data, but if you use only the instance, it takes time to process, so consider replacing it with the OSS XML library or the parser you are developing. ..
The qiita article also has information on how to install Arelle and other parsers under development, so I will list some of them here. ・ XBRL analysis that does not start from zero (utilization of Arelle) -I made a Python library for EDINET's XBRL --Parser ・ I made a class to download and parse XBRL from UFO Catcher ・ Acquire the securities report XBRL of a listed company and read the value (← Mainly improve the acquisition part of namespace
You can download it from the ʻEDINET code list` at the bottom of the page after the [Download] tab transition of the EDINET site. This list is listed with ʻEDINET CODE and industry. Therefore, by using this list, you can also acquire the type of business of the company.
○ Decide the storage folder to read the downloaded EDINET code list.
edinetcodedlinfo_filepath = 'C://Users//xxx//Desktop//xbrlReport//EdinetcodeDlInfo.csv'
○ Decide the folder to store the zip file obtained by EDINET API.
zip_dir = 'C://Users//xxx//Desktop//xbrlReport//SR//'
xbrl_file_expressions = 'C://Users//xxx//Desktop//xbrlReport//SR//XBRL//PublicDoc//*.xbrl'
○ Decide the folder and file name to output CSV.
employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')
Unzip the zip file obtained by EDINET API, and get the list of instance files
(.xbrl extension) under the placed PublicDoc folder with glob
. (The explanation for the process per zip decompression is omitted)
Code1
xbrl_file_expressions = glob.glob('C:\\Users\\xxx\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\*.xbrl')
Result1
['C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E00112-000_2019-03-31_01_2019-06-27.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E01422-000_2019-03-31_01_2019-06-27.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E02770-000_2019-03-31_01_2019-06-21.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E04510-000_2019-03-31_01_2019-06-27.xbrl']
Loop processing is performed using xbrl_files
, and XBRL data is read by Arelle's modelManager.load (xbrl_file) for each instance file. As a result, you can get Arelle's model_xbrl object that holds the XBRL data information.
Code2
for index, xbrl_file in enumerate(xbrl_files):
ctrl = Cntlr.Cntlr()
model_manager= ModelManager.initialize(ctrl)
model_xbrl= modelManager.load(xbrl_file)
Result2
<arelle.ModelXbrl.Model_xbrl object at 0x0EF6F350>
Next, loop processing is performed using model_xbrl.facts
in which all fact
s (actual data such as the numerical values of the elements) included in the instance are defined in list format, and the element name of each fact
(fact. get concept.qname.localName
). When the element name is the element name to be extracted this time, it is a mechanism to acquire the numerical value (fact.value
) of the element. When processing EDINETCODE, incorporate the logic to identify the industry of the company from ʻEdinetcodeDlInfo.csv. Regarding the number of employees, since the same element name exists multiple times, the current period (
CurrentYearInstant) is specified in the period attribute called
context`.
Code3
for fact in model_xbrl.facts:
if fact.concept.qname.localName == 'EDINETCodeDEI':
edinet_code = fact.value
for code_name in edinet_info_list:
if code_name[0] == edinet_code:
industry_code = code_name[1]
break
elif fact.concept.qname.localName=='FilerNameInJapaneseDEI':
elif fact.concept.qname.localName=='AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees':
elif fact.concept.qname.localName=='AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees':
elif fact.concept.qname.localName=='AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees':
elif fact.concept.qname.localName=='AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees':
elif fact.concept.qname.localName=='AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees':
elif fact.concept.qname.localName=='NumberOfEmployees':
if fact.contextID == 'CurrentYearInstant_NonConsolidatedMember':
After that, the information of each acquired element (ʻedinet_code,
filer_name_jp, ʻindustry_code
, salary_info
, service_years
, ʻage_years,
number_of_employees) is stored in
company_info_list. At this time, if the average years of service (months) (
service_months) and average age (months) (ʻage_months
) exist, we have incorporated a process to convert them into years. After that, if you store company_info_list
in EDINETCODE unit (instance file unit) in ʻedinet_company_info_list`, a list containing information of employees of each company will be completed.
Code4
company_info_list.append(edinet_code)
company_info_list.append(filer_name_jp)
company_info_list.append(industry_code)
company_info_list.append(salary_info)
if len(service_months) != 0:
service_years_decimal= round(int(service_months)/12,1)
service_years = int(service_years) + service_years_decimal
service_years = str(service_years)
company_info_list.append(service_years)
if len(age_months) != 0:
age_years_decimal= round(int(age_months)/12,1)
age_years = int(age_years) + age_years_decimal
age_years = str(age_years)
company_info_list.append(age_years)
company_info_list.append(number_of_employees)
edinet_company_info_list.append(Company_info_list)
ʻEdinet_company_info_list` looks like this:
Result4
[['E00112', 'Totetsu Kogyo Co., Ltd.', 'Construction industry', '8547489', '13.8', '41.2', '1673'], ['E01422', 'Fuji Sash Co., Ltd.', 'Metal products', '5527000', '20.7', '44.7', '850'], ['E02770', 'MISUMI Group Inc. Headquarters', 'Wholesale business', '', '', '', '1293'], ['E04510', 'Electric Power Development Co., Ltd.', 'Electricity and gas industry', '7980312', '19.6', '40.9', '2445']・ ・ ・]
Finally, the obtained ʻedinet_company_info_list` is output in CSV format. This completes the extraction of the required elements.
Code5
employee_frame = pd.DataFrame(edinet_company_info_list,
columns=['EDINETCODE', 'Company name', 'Industry', 'Average annual salary (yen)', 'Average years of service (years)', 'Average age (years)', 'Number of employees (persons)'])
employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')
By the way, when you run the program, the following CSV file will be finally output. There are several types of average annual salary units (yen, 1,000 yen, etc.) depending on the method of corporate disclosure. In addition, although the data is targeted for data from April 2019 onward, due to the fiscal year end of each company, there are some that have not been tagged yet and salary-related data cannot be obtained.
# -*- coding: utf-8 -*-
from arelle import ModelManager
from arelle import Cntlr
import os
import zipfile
import glob
import pandas as pd
def make_edinet_info_list(edinetcodedlinfo_filepath):
edinet_info = pd.read_csv(edinetcodedlinfo_filepath, skiprows=1,
encoding='cp932')
edinet_info = edinet_info[["EDINET code", "Submitter industry"]]
edinet_info_list = edinet_info.values.tolist()
return edinet_info_list
def unzip_file(zip_dir,xbrl_file_expressions):
zip_files = glob.glob(os.path.join(zip_dir, '*.zip'))
number_of_zip_lists = len(zip_files)
print("number_of_zip_lists:", number_of_zip_lists)
for index, zip_file in enumerate(zip_files):
print(zip_file, ":", index + 1, "/", number_of_zip_lists)
with zipfile.ZipFile(zip_file) as zip_f:
zip_f.extractall(zip_dir)
zip_f.close()
xbrl_files = glob.glob(xbrl_file_expressions)
return xbrl_files
def make_edinet_company_info_list(xbrl_files,edinet_info_list):
edinet_company_info_list = []
for index, xbrl_file in enumerate(xbrl_files):
edinet_code = "" # EDINETCODE
filer_name_jp = "" #Company name
industry_code = "" #Industry
salary_info = "" #Average annual salary (yen)
service_years = "" #Average years of service (years)
service_months = "" #Average years of service (month)
age_years = "" #Average age (years)
age_months = "" #Average age (month)
number_of_employees = "" #Number of employees (persons)
company_info_list = [] #Corporate information
ctrl = Cntlr.Cntlr()
model_manager = ModelManager.initialize(ctrl)
model_xbrl = model_manager.load(xbrl_file)
print(xbrl_file, ":", index + 1, "/", len(xbrl_files))
for fact in model_xbrl.facts:
if fact.concept.qname.localName == 'EDINETCodeDEI':
print("EDINET code", fact.value)
edinet_code = fact.value
for code_name in edinet_info_list:
if code_name[0] == edinet_code:
print("Industry",code_name[1])
industry_code = code_name[1]
break
elif fact.concept.qname.localName == 'FilerNameInJapaneseDEI':
print("Company name", fact.value)
filer_name_jp = fact.value
elif fact.concept.qname.localName == 'AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees':
print("Average annual salary (yen)", fact.value)
salary_info = fact.value
elif fact.concept.qname.localName == 'AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees':
print("Average years of service (years)", fact.value)
service_years = fact.value
elif fact.concept.qname.localName == 'AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees':
print("Average years of service (month)", fact.value)
service_months = fact.value
elif fact.concept.qname.localName == 'AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees':
print("Average age (years)", fact.value)
age_years = fact.value
elif fact.concept.qname.localName == 'AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees':
print("Average age (month)", fact.value)
age_months = fact.value
elif fact.concept.qname.localName == 'NumberOfEmployees':
if fact.contextID == 'CurrentYearInstant_NonConsolidatedMember':
print("Number of employees (persons)", fact.value)
number_of_employees = fact.value
print("")
company_info_list.append(edinet_code)
company_info_list.append(filer_name_jp)
company_info_list.append(industry_code)
company_info_list.append(salary_info)
if len(service_months) != 0:
service_years_decimal = round(int(service_months) / 12, 1)
service_years = int(service_years) + service_years_decimal
service_years = str(service_years)
company_info_list.append(service_years)
if len(age_months) != 0:
age_years_decimal = round(int(age_months) / 12, 1)
age_years = int(age_years) + age_years_decimal
age_years = str(age_years)
company_info_list.append(age_years)
company_info_list.append(number_of_employees)
edinet_company_info_list.append(company_info_list)
return edinet_company_info_list
def write_csv_of_employee_info(edinet_company_info_list):
employee_frame = pd.DataFrame(edinet_company_info_list,
columns=['EDINETCODE', 'Company name', 'Industry', 'Average annual salary (yen)', 'Average years of service (years)', 'Average age (years)', 'Number of employees (persons)'])
print(employee_frame)
employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')
def main():
edinetcodedlinfo_filepath = 'C://Users//xxx//Desktop//xbrlReport//EdinetcodeDlInfo.csv'
edinet_info_list = make_edinet_info_list(edinetcodedlinfo_filepath)
zip_dir = 'C://Users//xxx//Desktop//xbrlReport//SR//'
xbrl_file_expressions = 'C://Users//xxx//Desktop//xbrlReport//SR//XBRL//PublicDoc//*.xbrl'
xbrl_files = unzip_file(zip_dir,xbrl_file_expressions)
edinet_company_info_list = make_edinet_company_info_list(xbrl_files,edinet_info_list)
print(edinet_company_info_list)
write_csv_of_employee_info(edinet_company_info_list)
print("extract finish")
if __name__ == "__main__":
main()
The element name defined in XBRL
(ex.EDINETCodeDEI) was described without any explanation, but all of them are published on the EDINET site. The latest version as of December 2019 is the taxonomy element list
andaccount list of [About the publication of the 2020 EDINET taxonomy](https://www.fsa.go.jp/search/20191101.html). You can check from
. The subjects extracted this time are listed in the taxonomy element list (1e_ElementList.xlsx
). For details, see [Cabinet Office Ordinance No. 3 Form Securities Report on Disclosure of Corporate Information, etc. (jpcrp030000-asr)]-[9] Sheet --212000b Employee Status (jpcrp_212000-002_2019-11-01_pre.xml) ), Please check. By utilizing these lists, it is possible to extract various subjects other than employee information. The EDINET taxonomy may be updated in order to respond to the revision of laws and accounting standards. Therefore, when processing an instance, make sure that the EDINET taxonomy version is correct.
For inquiries regarding this article, please contact the following e-mail address. e-mail:[email protected] (Of course, comments on qiita are also welcome)
This e-mail address will be the contact point for inquiries about the Development Committee of XBRL Japan, which writes the article for qiita. I will. Therefore, we cannot answer general inquiries about the organization depending on the content, but please feel free to contact us with any technical questions, opinions, requests, advice, etc. regarding XBRL. Please note that it may take some time to respond because the committee members are volunteers.
Recommended Posts