There are two methods for investing in stocks: technical analysis and fundamentals analysis. This time we will deal with technical analysis.
Predict the Nikkei Stock Average using Twitter. First, I will explain the general flow.
1, Use Twitter API to get past tweets of an account from Twitter. 2, Analyze the sentiment of daily tweets using the polarity dictionary. 3, Get time series data of Nikkei Stock Average. 4, Predict the ups and downs of the stock price on the next day from the daily sentiment using machine learning.
You will need an access token to get tweets from Twitte. This is like the ID and PASS in the user account
It refers to two types of character strings, "Access Token Key" and "Access Token Secret".
Here you will get tweets that contain a certain word.
import time
from requests_oauthlib import OAuth1Session
import json
import datetime, time, sys
CK = '' #Consumer Key''Enter in
CS = '' #Consumer Secret''Enter in
AT = '' #Access Token''Enter in
AS = '' #Access Token Secret''Enter in
session = OAuth1Session(CK, CS, AT, AS)
url = 'https://api.twitter.com/1.1/search/tweets.json'
res = session.get(url, params = {'q':u'python', 'count':100})
res_text = json.loads(res.text)
for tweet in res_text['statuses']:
print ('-----')
print (tweet['created_at'])
print (tweet['text'])
Click here to get tweets including artificial intelligence
import time
from requests_oauthlib import OAuth1Session
import json
import datetime, time, sys
CK = '' #Consumer Key''Enter in
CS = '' #Consumer Secret''Enter in
AT = '' #Access Token''Enter in
AS = '' #Access Token Secret''Enter in
session = OAuth1Session(CK, CS, AT, AS)
url = 'https://api.twitter.com/1.1/search/tweets.json'
res = session.get(url, params = {'q':u'Artificial intelligence', 'count':100})
res_text = json.loads(res.text)
for tweet in res_text['statuses']:
print ('-----')
print (tweet['created_at'])
print (tweet['text'])
I will try to get the tweets of Nikkei Sangyo Shimbun.
import tweepy
import csv
consumer_key = "" #“Consumer obtained with a personal account here_key""Please enter in "
consumer_secret = "" #“Consumer obtained with a personal account here_secret""Please enter in "
access_key = "" #“Access obtained here with a personal account_key""Please enter in "
access_secret = "" #“Access obtained here with a personal account_secret""Please enter in "
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#Get Tweets
tweet_data = []
tweets = tweepy.Cursor(api.user_timeline,screen_name = "@nikkei_bizdaily",exclude_replies = True)
for tweet in tweets.items():
tweet_data.append([tweet.id,tweet.created_at,tweet.text.replace('\n',''),tweet.favorite_count,tweet.retweet_count])
tweet_data
# tweets.Save as csv in data folder
with open('./6050_stock_price_prediction_data/tweets.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerow(["id", "text", "created_at", "fav", "RT"])
writer.writerows(tweet_data)
Sentiment analysis uses natural language processing and the text has a positive meaning Or it is a technique to judge whether it has a negative meaning.
By sentiment analysis of product reviews It is widely used for marketing and customer support.
The main mechanism of sentiment analysis is the words that appear in the sentence Judge whether it has a positive, negative, or neutral meaning.
There is a polarity dictionary as a criterion for judgment It is defined in a dictionary in which positive or negative morphemes are defined in advance.
Sentiment analysis is performed by referring to the polarity dictionary for each word in the document. Let's first analyze the morpheme using MeCab.
import MeCab
import re
#Create a MeCab instance. If no argument is specified, it becomes an IPA dictionary.
m = MeCab.Tagger('')
#A function that morphologically parses text and returns a list of dictionaries
def get_diclist(text):
parsed = m.parse(text) #Morphological analysis result (obtained as a character string including line breaks)
lines = parsed.split('\n') #List the analysis results separately for each line (1 word)
lines = lines[0:-2] #The last two lines are unnecessary, so delete them
diclist = []
for word in lines:
l = re.split('\t|,',word) #Each line is separated by a tab and a comma
d = {'Surface':l[0], 'POS1':l[1], 'POS2':l[2], 'BaseForm':l[7]}
diclist.append(d)
return(diclist)
It will be sunny tomorrow. Click here when set in the argument
import MeCab
import re
#Create a MeCab instance. If no argument is specified, it becomes an IPA dictionary.
m = MeCab.Tagger('')
#A function that morphologically parses text and returns a list of dictionaries
def get_diclist(text):
parsed = m.parse(text) #Morphological analysis result (obtained as a character string including line breaks)
lines = parsed.split('\n') #List the analysis results separately for each line (1 word)
lines = lines[0:-2] #The last two lines are unnecessary, so delete them
diclist = []
for word in lines:
l = re.split('\t|,',word) #Each line is separated by a tab and a comma
d = {'Surface':l[0], 'POS1':l[1], 'POS2':l[2], 'BaseForm':l[7]}
diclist.append(d)
return(diclist)
get_diclist("It will be sunny tomorrow.")
This time, we will use the word-emotion polarity correspondence table as the polarity dictionary.
This assigns real numbers from -1 to +1 with reference to the "Iwanami Japanese Dictionary (Iwanami Shoten)".
The closer it is to -1, the more negative The closer it is to +1 the more positive it is.
Then read the polarity dictionary Create lists and dictionaries.
#word_list, pn_Store Word and PN in list type respectively.
import pandas as pd
pn_df = pd.read_csv('./6050_stock_price_prediction_data/pn_ja.csv', encoding='utf-8', names=('Word','Reading','POS', 'PN'))
word_list=list(pn_df['Word'])
pn_list=list(pn_df['PN'])
#pn_word as dict_list, pn_Create a dictionary that stores the list.
pn_dict = dict(zip(word_list,pn_list))
Implement where the PN value is returned by referring to the polarity dictionary.
Also Pass get_diclist ("It will be fine tomorrow") to the add_pnvalue function to see how it works We also pass it to the get_mean function to find the mean of the PN values.
import numpy as np
def add_pnvalue(diclist_old, pn_dict):
diclist_new = []
for word in diclist_old:
base = word['BaseForm'] #Get uninflected words from individual dictionaries
if base in pn_dict:
pn = float(pn_dict[base])
else:
pn = 'notfound' #If the word is not in the PN Table
word['PN'] = pn
diclist_new.append(word)
return(diclist_new)
#Find the average PN for each tweet
def get_mean(dictlist):
pn_list = []
for word in dictlist:
pn = word['PN']
if pn!='notfound':
pn_list.append(pn)
if len(pn_list)>0:
pnmean = np.mean(pn_list)
else:
pnmean=0
return pnmean
dl_old = get_diclist("It will be sunny tomorrow.")
# get_diclist("It will be sunny tomorrow.")The function add_Pass it to pnvalue to see how it works.
dl_new = add_pnvalue(dl_old, pn_dict)
print(dl_new)
#Also function get it_Pass it to mean to find out the average of the PN values.
pnmean = get_mean(dl_new)
print(pnmean)
The change of PN value is displayed in a graph.
import matplotlib.pyplot as plt
%matplotlib inline
df_tweets = pd.read_csv('./6050_stock_price_prediction_data/tweets.csv', names=['id', 'date', 'text', 'fav', 'RT'], index_col='date')
df_tweets = df_tweets.drop('text', axis=0)
df_tweets.index = pd.to_datetime(df_tweets.index)
df_tweets = df_tweets[['text']].sort_index(ascending=True)
# means_Create an empty list called list and find the average value for each tweet.
means_list = []
for tweet in df_tweets['text']:
dl_old = get_diclist(tweet)
dl_new = add_pnvalue(dl_old, pn_dict)
pnmean = get_mean(dl_new)
means_list.append(pnmean)
df_tweets['pn'] = means_list
df_tweets = df_tweets.resample('D', how='mean')
#Plot the date on the x-axis and the PN value on the y-axis.
x = df_tweets.index
y = df_tweets.pn
plt.plot(x,y)
plt.grid(True)
# df_tweets.df with the name csv_Please output tweets again.
df_tweets.to_csv('./6050_stock_price_prediction_data/df_tweets.csv')
Looking at the results of the graph, it seems that there are many negative values overall.
This is because the polar dictionary contains a lot of vocabulary with negative implications. Standardize to adjust for this result.
Standardize the PN value Also, change the PN to the average for each date and plot it.
# means_Standardize list, x_Output as std
df_tweets['pn'] = (df_tweets['pn'] - df_tweets['pn'].mean()) / df_tweets['pn'].std()
#Also, change the PN to the average for each date and plot it.
df_tweets = df_tweets.resample('D', how='mean')
x = df_tweets.index
y = df_tweets.pn
plt.plot(x,y)
plt.grid(True)
Recommended Posts