Introduction

The origin of this was trying to learn scraping for the production of horse racing AI, and I was looking for the learning task. Anyway, I think it would be nice to have a sequence of data acquisition → preprocessing → learning → inference. ** Estimate the number of likes from the tweet content of my favorite men Oba Hana ** I decided to make it a learning task. This time, the aim is to start moving with a simple flow, and I will work on the accuracy in the future.

Please point out any mistakes.

Data collection

Data collection uses the Twitter API. For the registration method, I referred to the article below. Tips for passing Twitter API registration in 30 minutes (with Japanese translation) Well, it's a pain to write considering the reason, so please be prepared. I was crushed here for a day. (Japanese is inconvenient ...)

After registration, you can get Tweet, so I will write the acquisition program. Normally, the Twitter API seems to be able to get only up to 200 Tweets, and this can also be realized by referring to the following article "Getting all tweets of @hana_oba" I will write the code. In addition, since most of them are the round copy of the program of the reference article, only the part I wrote myself is listed here. Get a lot of tweets with TwitterAPI. Consider server-side errors (in python)

`get_twitter.py`


    #Get by specifying user (screen_name）
    getter = TweetsGetter.byUser('hana_oba')
    df = pd.DataFrame(columns = ['week_day','have_photo','have_video','tweet_time','text_len','favorite_count','retweet_count'])
 
    cnt = 0
    for tweet in getter.collect(total = 10000):
        cnt += 1

        week_day = tweet['created_at'].split()[0]
        tweet_time = tweet['created_at'].split()[3][:2]

        #Specify the column you want to Eoncode in the list. Of course, you can specify more than one.
        list_cols = ['week_day']
        #Specify the column you want to OneHotEncode. Also specify the completion method when Null or unknown.
        ce_ohe = ce.OneHotEncoder(cols=list_cols,handle_unknown='impute')

        photo = 0
        video = 0
        if 'media' in tweet['entities']:

            if 'photo' in tweet['entities']['media'][0]['expanded_url']:
                photo = 1
            else:
                video = 1
        
        df = df.append(pd.Series([week_day, photo, video, int(tweet_time), len(tweet['text']),tweet['favorite_count'],tweet['retweet_count']], index=df.columns),ignore_index=True)
        df_session_ce_onehot = ce_ohe.fit_transform(df)

    df_session_ce_onehot.to_csv('oba_hana_data.csv',index=False)

I haven't devised anything in particular, but it is difficult to handle the data well without understanding the specifications of the Twitter API, so let's get the data you want by google or trial and error. This dataset is --Day of the week --Presence or absence of image --Presence or absence of video --Tweet time --Number of characters in the tweet Only. The days of the week are divided by seven days using one-hot Encoding. We will study this time with only this information.

Preprocessing, learning

Up to this point, I was running the program on my own PC, but although it is not a big deal, the machine specifications are not enough based on future developments, so from here I will implement it on Google Colaboratory.

First, read the data output by the previous program.

import pandas as pd

datapath = '/content/drive/My Drive/data_science/'
df = pd.read_csv(datapath + 'oba_hana_data.csv')

The total number of data is

df.shape

There were 2992 cases.

(2992, 13)

I will use 70% of the total as learning data and the remaining 30% as verification data. The data are arranged in chronological order, and if it is simply 70% from the front, the conditions such as the number of followers at that time will be different, so I would like to randomly acquire 70%. Although it contains some analog elements, this time we will use 2400 cases, which is about 70% of the total, as training data and the rest as verification data.

df_train = df.sample(n=2400)
df_test = df.drop(df_train.index)

x_train = df_train.iloc[:,:11]
t_train = df_train['favorite_count']

x_test = df_test.iloc[:,:11]
t_test = df_test['favorite_count']

Let's learn with this data once in a random forest.

#Model declaration
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=2000, max_depth=10,
                                min_samples_leaf=4, max_features=0.2, random_state=0)

#Model learning
model.fit(x_train, t_train)

Let's see the score

#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))

0.5032870524389081
0.3102920436689621

What the score represents is the coefficient of determination because this is a regression problem. Taking a value from 0 to 1, the closer it is to 1, the higher the accuracy.

Looking at the score this time, the accuracy is not good. Now, let's carry out preprocessing to improve accuracy. First, let's look at the contribution of each parameter.

#Variables are stored in descending order of contribution of features
feat_names = x_train.columns.values
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 10))
plt.title('Feature importances')
plt.barh(range(len(indices)), importances[indices])
plt.yticks(range(len(indices)), feat_names[indices], rotation='horizontal')
plt.show();

I would like you to think about what is important in making an estimate. It seems that have_photo has the highest contribution, that is, the presence or absence of a photo is large. The videos don't seem to be that important, but the videos are probably not as many as 3% of the total. It seems that the day of the week has almost nothing to do with it. This can be removed from the data.

We will also look at outliers.

#Visualization with graph
plt.figure(figsize=(8, 6))
plt.scatter(range(x_train.shape[0]), np.sort(t_train.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.show()

You can see that some of the training data are clearly out of alignment. I will remove these data because I may be dragged by these and learn.

#It is the reverse of the explanation order, but if you do not do it in this order, you can not split with outliers removed
#Outlier removal
df_train = df_train[df_train['favorite_count'] < 4500]
df_train.shape

#Delete day of the week
x_train = df_train.iloc[:,7:11]
t_train = df_train['favorite_count']
x_test = df_test.iloc[:,7:11]
t_test = df_test['favorite_count']

Now let's learn again and see the score.

#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))

0.5175871090277164
0.34112337762190204

It's better than before. Finally, let's look at the distribution of the actual number of likes and the estimated number of likes in a histogram.

Blue is the estimated number and orange is the actual number of likes. There is certainly a gap, but I can't say anything because I haven't seen each of them.

The improvement of accuracy will also be next.

reference

--How to register Twitter API

Tips for passing Twitter API registration in 30 minutes (with Japanese translation)

--Data acquisition using Twitter API

Get a lot of tweets with TwitterAPI. Consider server-side errors (in python)

Try to estimate the number of likes on Twitter

Introduction

Data collection

get_twitter.py

Preprocessing, learning

reference

`get_twitter.py`