[Stock price analysis] Learning pandas with fictitious data (001: environment preparation-file reading)

At first

I want to treat stocks as a learning subject, but I am not confident that I can analyze them with raw data. Therefore, I would like to create some artificial data and run it with python. I would like to proceed with the purpose as learning the python program.

About the data to be handled

We handle trading dates from January 4, 2016 to November 8, 2019. In addition, this data is

period Trend
2016 Declining trend
2017 Neutral
2018 Increasing tendency
2019 Increasing tendency(strength)

It is a fictitious brand whose closing price changes with.

I want to upload the data (text file) I handled, but I wonder if Qiita can only upload images ...

A 944 line csv file containing the following information.

SampleStock01.csv


Fictitious company 01
date,Open price,High price,Low price,closing price
2016/1/4,9,934,10,055,9,933,10,000
2016/1/5,10,062,10,092,9,942,10,015
2016/1/6,9,961,10,041,9,928,10,007
2016/1/7,9,946,10,060,9,889,9,968
2016/1/8,9,812,9,952,9,730,9,932
2016/1/12,9,912,9,966,9,907,9,940
2016/1/13,9,681,9,964,9,607,9,928
2016/1/14,9,748,9,864,9,686,9,858
(Omission)

Advance preparation

It's just studying, so I'll start from a clean environment. The learning environment is

command prompt


python -m venv stock
.\stock\Scripts\Activate

After upgrading pip, with matplotlib and pandas

command prompt


python -m pip install --upgrade pip
pip install matplotlib
pip install pandas
pip install Seaborn

Check the installed packages

command prompt


pip list

Execution result

Package Version --------------- ------- cycler 0.10.0 kiwisolver 1.1.0 matplotlib 3.1.1 numpy 1.17.4 pandas 0.25.3 pip 19.3.1 pyparsing 2.4.5 python-dateutil 2.8.1 pytz 2019.3 scipy 1.3.2 seaborn 0.9.0 setuptools 40.8.0 six 1.13.0

Read file

Failure example 01

First of all, without thinking about anything, try reading with pd.read_csv ().

fail_case01.py


import pandas as pd

dframe = pd.read_csv('SampleStock01.csv')

Execution result

As expected, an error is returned.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

The file I / O part is the one that stumbles when dealing with python ... Thinking is interrupted here every time.

Failure example 02

However, failure example 01 is a category of expectation, and here it is just a matter of specifying encoding.

fail_case02.py


import pandas as pd

#CSV file(SampleStock01.csv)Specify the character code of
dframe = pd.read_csv('SampleStock01.csv', encoding="SJIS")

Execution result

Yes. I knew It also fails here.

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 9

Since the first line of the CSV file is set to the brand name, you have to read from the second line.

Failure example 03

This is just a matter of ignoring the first line and reading from the second line.

fail_case03.py


import pandas as pd

#CSV file(SampleStock01.csv)Specify the character code of
dframe = pd.read_csv('SampleStock01.csv', encoding="SJIS", header=1)

print(dframe)

Execution result

Date Open Price High Low Price Close 2016/1/4 9 934 10 55 9 933 10 0 2016/1/5 10 62 10 92 9 942 10 15 2016/1/6 9 961 10 41 9 928 10 7 2016/1/7 9 946 10 60 9 889 9 968 2016/1/8 9 812 9 952 9 730 9 932 ... ... .. ... .. ... 2019/11/1 13 956 15 59 13 940 14 928 2019/11/5 13 893 15 54 13 820 14 968 2019/11/6 14 3 15 155 13 919 15 47 2019/11/7 14 180 15 54 14 57 15 41 2019/11/8 14 76 15 52 13 939 15 41

[942 rows x 5 columns]

I was able to read it into the data frame properly! I was pleased with it.

CSV delimiter "," and digit delimiter "," are mixed </ font> </ strong> and cannot be read correctly in dataframe.

CSV file you wanted to read

SampleStock01.csv


Fictitious company 01
date,Open price,High price,Low price,closing price
2016/1/4,9,934,10,055,9,933,10,000
2016/1/5,10,062,10,092,9,942,10,015
2016/1/6,9,961,10,041,9,928,10,007
2016/1/7,9,946,10,060,9,889,9,968
2016/1/8,9,812,9,952,9,730,9,932
2016/1/12,9,912,9,966,9,907,9,940
2016/1/13,9,681,9,964,9,607,9,928
2016/1/14,9,748,9,864,9,686,9,858
(Omission)

To be honest, I think that there is no choice but to modify the read file for this, so I modified the CSV delimiter from "," to "tab character". However, what should I do if I encounter this kind of event when analyzing business logs? ?? If anyone knows a good way, please let me know.

Anyway, modify the CSV to be read as follows.

SampleStock01_t1.csv


Fictitious company 01
Date Open Price High Low Price Close
2016/1/4	9,934 	10,055 	9,933 	10,000 
2016/1/5	10,062 	10,092 	9,942 	10,015 
2016/1/6	9,961 	10,041 	9,928 	10,007 
2016/1/7	9,946 	10,060 	9,889 	9,968 
2016/1/8	9,812 	9,952 	9,730 	9,932 
2016/1/12	9,912 	9,966 	9,907 	9,940 
2016/1/13	9,681 	9,964 	9,607 	9,928 
2016/1/14	9,748 	9,864 	9,686 	9,858 
(Omission)

Success story

I tried to be honest for the fourth time by adding a process to specify that the delimiter is a tab character in the code so far.

Success_case.py


import pandas as pd

#CSV file(SampleStock01.csv)Specify the character code of
import pandas as pd

#CSV file(SampleStock01.csv)Specify the character code of
dframe = pd.read_csv('SampleStock01_t1.csv', encoding='SJIS', \
	header=1, sep='\t')

print(dframe)

Execution result

Date Open Price High Low Price Close 0 2016/1/4 9,934 10,055 9,933 10,000 1 2016/1/5 10,062 10,092 9,942 10,015 2 2016/1/6 9,961 10,041 9,928 10,007 3 2016/1/7 9,946 10,060 9,889 9,968 4 2016/1/8 9,812 9,952 9,730 9,932 .. ... ... ... ... ... 937 2019/11/1 13,956 15,059 13,940 14,928 938 2019/11/5 13,893 15,054 13,820 14,968 939 2019/11/6 14,003 15,155 13,919 15,047 940 2019/11/7 14,180 15,054 14,057 15,041 941 2019/11/8 14,076 15,052 13,939 15,041

[942 rows x 5 columns]

Although there are many concerns such as index specification and column type, finally read_csv has been completed. For reference books, it's a few lines of work, but ...

Finally

File I / O is the biggest challenge when dealing with dataframes, but are other people going easily? Not limited to dataframe, python as a whole, no, file I / O has been a demon for me since the C language era.

Once it's loaded, it's easy because it's a program problem. (

Recommended Posts