Aidemy 2020/10/30
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of the stock price forecast. Nice to meet you.
What to learn this time ・ ③ Acquire time series data of Nikkei Stock Average ・ ④ Create a model that predicts the ups and downs of stock prices
I. Acquire time-series data of stock prices and convert to DataFrame Ii Delete items other than the "closing price" of the data and sort by date Ⅲ Combine tweet data and time series data
-Although __ "pd.read_csv ()" __ is used this time as well, since csv is obtained from the URL, it is necessary to specify the URL using the __urllib.request module __ learned in "Scraping 1". is there. Therefore, create the "read_csv" function that summarizes these processes by yourself. -First, open the URL with __urllib.request.urlopen () __, read it with __read () __, and decode it with __decode ('shift_jis') __. If you go so far, you can get it as a DataFrame with __ "pd.read_csv" __ like a normal csv file.
·code
-In this forecast, only __ "data date" __ and __ "closing price" __ are used, so __ delete the others __. Also, I want to sort the data by __date __, so I sort by converting the "data date" to __index . - "pd.to_datetime ()" __ handles "data date" as time series data. Also, use __ "set_index ()" __ to make this an index. -Data that is not used this time __ ['Open price','High price','Low price'] __ is deleted with __ "drop ()" __, and __ "sort_index (ascending = True)" __ Sorts the dates in ascending order (oldest first) with.
·code
-The time series data __ "df" __ created in the previous section and organized in the previous section, and the tweet data __ "df_tweets" __ "join ()" created in Chapter 1 and linked to the PN value. Combine with __ and delete NaN with __ "dropna ()" __. Save it as "table.csv" with __ "to_csv ()" __.
-Code![Screenshot 2020-10-20 18.40.26.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/e66eda5b-959c-bb46- 2e0b-fa7d9051bc85.png)
・ Output result (only part)
I Divide the data and create data with index of "date" and column of "pn value" and "closing price" Ii Create data with index of "date" and column of "pn value" and "closing price" Ⅲ For training data, calculate the difference between the stock price (end value) and the previous day of the PN value. Ⅳ Obtain the difference between the stock price (end value) and PN value for the last three days on day i and calculate it as a feature quantity. V Make a model from features
・ This time, we will use __ "Technical Analysis" __ to forecast the stock price. In this method, the up and down of the stock price on the next day is predicted using the time-series change (difference) of the average stock price over the past three days and the change (difference) of the PN value __ as features. -First, divide the data "table" that stores the daily "PN value" and "closing price" created in the previous section into training data and test data. Use __ "train_test_split ()" __ for splitting. The __PN value __ is stored in "X", and the __closing price __ is stored in "y". -Next, the training data (PN value) is standardized, and the __test data is also standardized using the mean and variance of the __training data.
·code
・ After standardization, the next step is to convert training data and test data into DataFrame. The column is __ "PN value" and "closing price" __, and stores __ "standardized X" and "y" __, respectively. Also, __index is a date __. -Save the created DataFrame as csv so that it can be used in the following sections.
·code
-__ The purpose is to calculate the change (difference) between the PN value and the closing price for the past 3 days __, so first __ process the data so that the change per day can be calculated __. ・ First, open df_train.csv created in the previous section. Also, since this DataFrame will be split immediately afterwards, prepare an empty list so that it can be stored for each element. Specifically, "exchange_dates" that stores dates, "pn_rates" that stores PN values, "pn_rates_diff" that finds the difference from the PN value of the previous row, and "pn_rates_diff" that stores stock prices (closing prices) as well. Prepare an empty list of "exchange_rates" and "exchange_rates_diff" that finds the difference from the stock price of the previous line. -In addition, prepare the PN value of the previous day and the closing prices "prev_pn" and "prev_exch".
-Once you have reached this point, __DataFrame will be split __. I want to handle it one day at a time, so I take out __line by line __. For each row, divide it by __ ",", the first column is the date, the second column is the PN value, and the third column is the stock price, so each __ in the empty list created earlier Store __. -Similarly, the difference between the PN value / stock price of the previous row is stored in an empty list. -By putting the current value in __ "prev_pn" "prev_exch" __ at the end of the for statement, it can be treated as the data of the previous day at the next repetition.
·code
・ If you can describe the changes per day, write the changes every __3 days in the same way __. -To specify the index number (date), first store the value of __how many days to divide __ in the variable "INPUT_LEN". Similarly, prepare "tr_input_mat" that stores the changes in the last three days and "tr_angle_mat" that stores the top and bottom of the stock price on the reference day (day i) as empty lists. -Next, for any __i day within the range of the PN value change data length "data_len" __, __INPUT_LEN days PN value and stock price change __ along with an empty list called tmp_arr Store it and store it in "tr_input_mat" of the next higher dimension. -Similarly, the change in stock price (exchange_rates_diff) on day i is stored in "tr_angle_mat", which stores the top and bottom of the stock price, as "1" for __plus and "0" for minus.
-By converting "tr_input_mat" and "tr_angle_mat" created so far into __NumPy format and using __, __ "train_feature_arr" __ (learning data), __ "train_label_arr" __ (teacher label), it is finally a model. The data that can be passed is completed.
-The explanation so far and the following code are all about training data, but of course test data is also required to create a model, so it is necessary to do iii and iv for __test data in the same way __.
·code
・ Once you have created the training data and teacher labels for the training data and test data, you can finally build a model using them. -Create a prediction model with logistic regression, SVM, and random forest and compare the prediction accuracy of each. ・ (Review) Model learning is done with __ "model.fit ()" __, and application is done with __ "model.format ()" __.
-To acquire time-series data, acquire the time-series data of the Nikkei Stock Average with URL, and after performing necessary processing such as decoding, acquire with csv file. Once you get it, __ delete unused columns __. -When creating a stock price forecast model, it is first necessary to create the data to be passed to the __ model __. __ "Difference between stock price (end value) and PN value for the last 3 days on day i" __ is the learning data, __ "Whether the stock price went up or down" __ is the teacher label, so do your best. To create. -The model creation itself can be __ "model.fit ()" __.
This time is over. Thank you for reading until the end.
Recommended Posts