Let's play with the corporate analysis data set "CoARiJ" created by TIS ②

Purpose

See below for "CoARiJ" https://www.tis.co.jp/news/2019/tis_news/20191114_1.html https://github.com/chakki-works/CoARiJ/blob/master/README.md

Last time

https://qiita.com/vbnshin/items/09be86b4793c68f70172

things to do

Summary

data

The data provided by "CoARiJ" is as follows

image.png

  • Non-financial data
  • Annual report (from EDINET, XBRL file format)
  • File parsed the above item by item (txt format)
  • CSR report (pdf format)
  • Not available in txt format </ b>
  • The types of documents obtained from EDINET are as follows (FY 2018)

image.png

Points to note in analysis

There is duplicate data

df_14 = pd.read_csv('../data/finance_reports/2014/2014/documents.csv', sep='\t')

dup_name = df_14[df_14.duplicated()].iloc[0]['filer_name']
df_14[df_14['filer_name'] == dup_name]
edinet_code 	sec_code 	jcn 	filer_name 	fiscal_year 	fiscal_period 	submit_date 	period_start 	period_end 	doc_id 	... 	operating_income_on_sales 	ordinary_income_on_sales 	capital_ratio 	dividend_payout_ratio 	doe 	open 	high 	low 	close 	average
55 E00091 19710 2010001034861 Chuo Built Industry Co., Ltd. 2014 FY 2015-06-24 	2014-04-01 	2015-03-31 	S10053TB 	... 	7.78 	7.41 	31.99 	14.01 	1.69 	139.0 	208.0 	108.0 	118.0 	139.25
56 E00091 19710 2010001034861 Chuo Built Industry Co., Ltd. 2014 FY 2015-06-24 	2014-04-01 	2015-03-31 	S10053TB 	... 	7.78 	7.41 	31.99 	14.01 	1.69 	139.0 	208.0 	108.0 	118.0 	139.25

Edinet code fluctuates

df_14 = pd.read_csv('../data/finance_reports/2014/2014/documents.csv', sep='\t')

df_14 = df_14.groupby('edinet_code').max().reset_index()
df_14_part = df_14[['filer_name', 'fiscal_year', 'roa']]
dup_name = df_14_part[df_14_part['filer_name'].duplicated()].iloc[0]['filer_name']
df_14[df_14_part['filer_name'] == dup_name][['edinet_code', 'sec_code', 'jcn', 'filer_name', 'fiscal_year', 'fiscal_period', 'submit_date']]
 	edinet_code 	sec_code 	jcn 	filer_name 	fiscal_year 	fiscal_period 	submit_date
245 E00484 28140 5180001075845 Sato Foods Industries, Ltd. 2014 FY 2015-06-26
263 E00510 29230 8110001002068 Sato Foods Industries, Ltd. 2014 FY 2015-07-24

No ROE minus company (miss?)

df_14 = pd.read_csv('../data/finance_reports/2014/2014/documents.csv', sep='\t')
df_14 = df_14.groupby('edinet_code').max().reset_index()
df_15 = pd.read_csv('../data/finance_reports/2015/2015/documents.csv', sep='\t')
df_15 = df_15.groupby('edinet_code').max().reset_index()
df_16 = pd.read_csv('../data/finance_reports/2016/2016/documents.csv', sep='\t')
df_16 = df_16.groupby('edinet_code').max().reset_index()
df_17 = pd.read_csv('../data/finance_reports/2017/2017/documents.csv', sep='\t')
df_17 = df_17.groupby('edinet_code').max().reset_index()
df_18 = pd.read_csv('../data/finance_reports/2018/2018/documents.csv', sep='\t')
df_18 = df_18.groupby('edinet_code').max().reset_index()

df = pd.concat([df_14, df_15, df_16, df_17, df_18])
df = df[~df.duplicated()]

df[df['filer_name'].isin(['Sato Foods Industry Co., Ltd.', 'Alpha Corporation', 'FUJI CORPORATION'])]

print(len(df[df['roe'] < 0]))

>>> 0
  1. .. ..

Matching with positive data

ROE (Return on Equity) of Japan Display

  • [Securities Report-16th Term (April 1, 2017-March 31, 2018)] (https://disclosure.edinet-fsa.go.jp/E01EW/download?uji.verb=W0EZA104CXP001003Action&uji.bean=ee.bean.parent.EECommonSearchBean&PID=W1E63011&SESSIONKEY=1575770510504&lgKbn=2&pkbn=0&skbn=1&dskbxxxaskb= = & preId = 1 & mul = Japan Display & fls = on & cal = 2 & yer = 2018 & mon = & pfs = 5 & row = 100 & idx = 0 & str = & kbn = 1 & flg = & syoruiKanriNo = & s = S100D87L)
スクリーンショット 2019-12-08 12.22.38.png
  • Value of "CoARiJ"
df[df['edinet_code'] == 'E30481'][['edinet_code', 'filer_name', 'fiscal_year', 'roe']]
edinet_code 	filer_name 	fiscal_year 	roe
3160 E30481 Japan Display Co., Ltd. 2014 4.13
3196 E30481 Japan Display Co., Ltd. 2015 2.92
3270 E30481 Japan Display Co., Ltd. 2016 10.64
2884 E30481 Japan Display Co., Ltd. 2018 734.39
  • All ROE is +, and there is no FY2017 data in the first place.
  • Does the value change whether it is concatenated or single?
  • Even so, it is strange that there are no ROE minus companies.

from now on

  • The accuracy of the data is not good, so no further analysis will be conducted at this time.

  • Since the CSR report is in pdf format, it takes several steps to use it for analysis.

  • Thank you for including the edinet code in the file name (with this, it is easy to link with other information).

  • I thought I'd try to extract information from the color usage of the CSR report, the number of photos, the number of characters, and so on, but how much would it cost for GCP?

  • In any case, I don't know if the performance data to be matched is correct, so let's stop the analysis.

  • Please let me know if there is an error in the analysis.

  • I don't think there is any mistake only for TIS. .. ..

Recommended Posts