Continued ...

Below the previous article Estimate the number of likes on Twitter

The number of likes is estimated from the content of the tweet.

This time we will seek accuracy.

Review the dataset

There is too little data, so we will get more relevant items.

That said, the following two things seemed to be useful for whether or not it was a tweet that grew so much.

--Whether it is a reply or not --Whether it is a quote retweet

Let's add these two.

`get_twitter.py`


    #Get by specifying user (screen_name）
    getter = TweetsGetter.byUser('hana_oba')
    df = pd.DataFrame(columns = ['week_day','have_photo','have_video','tweet_time','text_len','favorite_count','retweet_count','quoted_status','reply','year_2018','year_2019','year_2020'])
 
    cnt = 0
    for tweet in getter.collect(total = 10000):
        cnt += 1

        week_day = tweet['created_at'].split()[0]
        tweet_time = tweet['created_at'].split()[3][:2]
        year = tweet['created_at'].split()[5]

        #Specify the column you want to Eoncode in the list. Of course, you can specify more than one.
        list_cols = ['week_day']
        #Specify the column you want to OneHotEncode. Also specify the completion method when Null or unknown.
        ce_ohe = ce.OneHotEncoder(cols=list_cols,handle_unknown='impute')

        photo = 0
        video = 0
        quoted_status = 0
        reply = 0
        yar_2018 = 0
        yar_2019 = 0
        yar_2020 = 0

        if 'media' in tweet['entities']:

            if 'photo' in tweet['entities']['media'][0]['expanded_url']:
                photo = 1
            else:
                video = 1

        if 'quoted_status_id' in tweet:
            quoted_status = 1
        else:
            quoted_status = 0
        
        if tweet['in_reply_to_user_id_str'] is None:
            reply = 0
        else:
            reply = 1
        if year == '2018':
            yar_2018 = 1
            yar_2019 = 0
            yar_2020 = 0
        if year == '2019':
            yar_2018 = 0
            yar_2019 = 1
            yar_2020 = 0
        if year == '2020':
            yar_2018 = 0
            yar_2019 = 0
            yar_2020 = 1


        df = df.append(pd.Series([week_day, photo, video, int(tweet_time), len(tweet['text']),tweet['favorite_count'],tweet['retweet_count'],quoted_status,reply,yar_2018,yar_2019,yar_2020], index=df.columns),ignore_index=True)
        df_session_ce_onehot = ce_ohe.fit_transform(df)

    df_session_ce_onehot.to_csv('oba_hana_data.csv',index=False)

I will give you a score with this.

`IhaveOBAHANAfullyunderstood.ipynb`


datapath = '/content/drive/My Drive/data_science/'
df = pd.read_csv(datapath + 'oba_hana_data.csv')

train_count = int(df.shape[0]*0.7)
df_train = df.sample(n=train_count)
df_test = df.drop(df_train.index)

have_photo = 'have_photo'
have_video = 'have_video'
tweet_time = 'tweet_time'
text_len = 'text_len'
favorite_count = 'favorite_count'
retweet_count = 'retweet_count'
quoted_status = 'quoted_status'
reply = 'reply'
year_2018 = 'year_2018'
year_2019 = 'year_2019'
year_2020 = 'year_2020'


#Model declaration
from sklearn.ensemble import RandomForestRegressor

#Outlier removal
df_train = df_train[df_train['favorite_count'] < 4500]
df_train.shape

x_train = df_train.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_train = df_train['favorite_count']
x_test = df_test.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_test = df_test['favorite_count']

#Model declaration
model = RandomForestRegressor(n_estimators=2000, max_depth=10,
                                min_samples_leaf=4, max_features=0.2, random_state=0)

#Model learning
model.fit(x_train, t_train)
#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))

0.7189988420451674
0.6471214647821018

ダウンロード (5).png

The accuracy has improved dramatically! Whether it is a quote retweet or not does not contribute so much, but it seems that it is not as irrelevant as the day of the week.

However, another thing I was interested in was the tweet time. Currently, it is only viewed as a numerical value, so it seems better to divide it into several bands.

time_mean = pd.DataFrame(columns = ['time','favorite_mean'])
time_array = range(23)
for i in range(23):
  time_mean = time_mean.append(pd.Series([i,df_train[df_train['tweet_time'] == time_array[i]].favorite_count.mean()], index=time_mean.columns),ignore_index=True)

time_mean['time'] = time_mean['time'].astype(int)

sns.set_style('darkgrid')
plt.figure(figsize=(12, 8))
sns.catplot(x="time", y="favorite_mean", data=time_mean,
                height=6, kind="bar", palette="muted")
plt.show()

ダウンロード (6).png

Time is standard time

Since there is a rule that SNS is until 24:00 for members of Eikolab (it is safe even if it sticks out a little), there is a part of 0, but other than that, the average value is in the 24:00 range of Japan time (probably a happy birthday tweet) You can see that it is expensive. In ↑, I said that it is better to divide by several bands, but probably this is better to use TargetEncoding instead of dividing by band (because it seems difficult to roughly divide by time)

pip install category_encoders

from category_encoders.target_encoder import TargetEncoder
df_train["tweet_time"] = df_train["tweet_time"].astype(str)

TE = TargetEncoder(smoothing=0.1)

df_train["target_enc_tweet_time"] = TE.fit_transform(df_train["tweet_time"],df_train["favorite_count"])
df_test["target_enc_tweet_time"] = TE.transform(df_test["tweet_time"])

Learn using target_enc_tweet_time instead oftweet_timeand see the score

0.6999237089367164
0.6574824327192588

The training data went down, but the verification data went up. By the way, when both tweet_time andtarget_enc_tweet_timeare adopted, it becomes as follows.

0.7210047209796951
0.6457969793382683

The score in the training data is the best, but the score in the validation data is not. All of them are difficult to attach, but let's leave all possibilities and go to the next.

Model change

Right now I'm in Random Forest and I haven't messed with any settings. So I'd like to set the model to XGBoost and useoptunato find the best parameters.

Install optuna

!pip install optuna

Next, we will functionize XGboost so that it can be turned withoptuna. Here, appropriate numerical values are lined up for each hyperparameter value, but this will be fine-tuned as it is repeated many times.

#Import XGboost library
import xgboost as xgb
#Model instantiation
#mod = xgb.XGBRegressor()
import optuna

def objective(trial):
    #Hyperparameter candidate setting
    min_child_samples = trial.suggest_int('min_child_samples', 60, 75)
    max_depth  = trial.suggest_int('max_depth', -60, -40)
    learning_rate   = trial.suggest_uniform('suggest_uniform ', 0.075, 0.076)
    min_child_weight = trial.suggest_uniform('min_child_weight', 0.1, 0.8)
    num_leaves = trial.suggest_int('num_leaves', 2, 3)
    n_estimators = trial.suggest_int('n_estimators', 100, 180)
    subsample_for_bin = trial.suggest_int('subsample_for_bin', 450000, 600000)

    model = xgb.XGBRegressor(min_child_samples = min_child_samples,min_child_weight = min_child_weight,
                          num_leaves = num_leaves,subsample_for_bin = subsample_for_bin,learning_rate = learning_rate,
                          n_estimators = n_estimators)


    #Learning
    model.fit(x_train, t_train)

    #Return rating
    return (1 - model.score(x_test, t_test))

First, let's turn it 100 times.

#Specify the number of trials
study = optuna.create_study()
study.optimize(objective, n_trials=100)

print('Hyperparameters:', study.best_params)
print('accuracy:', 1 - study.best_value)

Let's see the score for each

① Only tweet_time is adopted

0.690093409305073
0.663908038217022

② Only target_enc_tweet_time is adopted

0.6966901697205284
0.667797061960107

③ Adopt both tweet_time and target_enc_tweet_time

0.6972461315076879
0.6669948080176482

Although it is slight, ② seems to be the most accurate.

It's decided that you should try every possible possibility, but which one is best if you narrow down to one from here? (2) and (3) are almost the same, but (2) is the verification data and the score is good, and (3) is the training and the score is good. From here, when making fine adjustments with optuna to improve accuracy ――Do you think that the higher the training score, the higher the growth margin? ――The higher the verification score, the less overfitting, so we will improve the accuracy. Which one should I think about? It may not be a problem to say either, but I would appreciate it if anyone who knows which is the right path in general can tell me.

This time, we will proceed with the person who masters ②.

Once you have a wide range of hyperparameters, increase the number of trials to 1000. Narrow the range near the hyperparameters for the best results obtained, and try again 1000 times. The result obtained by this is as follows.

0.6962221939011508
0.6685252235753019

We got the best verification results so far. I've tried all the accuracy improvement methods I know, so I'll stop here this time.

Finally, let's take a look at the inference result and the histogram of the actual values.

ダウンロード (7).png

It would have been nice to know what the mountains around 1600 were actually pulled from. I don't know what changed from the first low accuracy, so I wonder if I made a mistake in the type of graph to plot ...

Hard to see, orange: real likes, green: inferred likes, blue: error I tried to plot with.

ダウンロード (8).png

Basically, you infer low. After all, Oba Hana is tweeting more than AI predicted ... (Your model is not something that can be called AI) Fin

Try to improve the accuracy of Twitter like number estimation