Below the previous article Estimate the number of likes on Twitter
The number of likes is estimated from the content of the tweet.
This time we will seek accuracy.
There is too little data, so we will get more relevant items.
That said, the following two things seemed to be useful for whether or not it was a tweet that grew so much.
--Whether it is a reply or not --Whether it is a quote retweet
Let's add these two.
get_twitter.py
#Get by specifying user (screen_name)
getter = TweetsGetter.byUser('hana_oba')
df = pd.DataFrame(columns = ['week_day','have_photo','have_video','tweet_time','text_len','favorite_count','retweet_count','quoted_status','reply','year_2018','year_2019','year_2020'])
cnt = 0
for tweet in getter.collect(total = 10000):
cnt += 1
week_day = tweet['created_at'].split()[0]
tweet_time = tweet['created_at'].split()[3][:2]
year = tweet['created_at'].split()[5]
#Specify the column you want to Eoncode in the list. Of course, you can specify more than one.
list_cols = ['week_day']
#Specify the column you want to OneHotEncode. Also specify the completion method when Null or unknown.
ce_ohe = ce.OneHotEncoder(cols=list_cols,handle_unknown='impute')
photo = 0
video = 0
quoted_status = 0
reply = 0
yar_2018 = 0
yar_2019 = 0
yar_2020 = 0
if 'media' in tweet['entities']:
if 'photo' in tweet['entities']['media'][0]['expanded_url']:
photo = 1
else:
video = 1
if 'quoted_status_id' in tweet:
quoted_status = 1
else:
quoted_status = 0
if tweet['in_reply_to_user_id_str'] is None:
reply = 0
else:
reply = 1
if year == '2018':
yar_2018 = 1
yar_2019 = 0
yar_2020 = 0
if year == '2019':
yar_2018 = 0
yar_2019 = 1
yar_2020 = 0
if year == '2020':
yar_2018 = 0
yar_2019 = 0
yar_2020 = 1
df = df.append(pd.Series([week_day, photo, video, int(tweet_time), len(tweet['text']),tweet['favorite_count'],tweet['retweet_count'],quoted_status,reply,yar_2018,yar_2019,yar_2020], index=df.columns),ignore_index=True)
df_session_ce_onehot = ce_ohe.fit_transform(df)
df_session_ce_onehot.to_csv('oba_hana_data.csv',index=False)
I will give you a score with this.
IhaveOBAHANAfullyunderstood.ipynb
datapath = '/content/drive/My Drive/data_science/'
df = pd.read_csv(datapath + 'oba_hana_data.csv')
train_count = int(df.shape[0]*0.7)
df_train = df.sample(n=train_count)
df_test = df.drop(df_train.index)
have_photo = 'have_photo'
have_video = 'have_video'
tweet_time = 'tweet_time'
text_len = 'text_len'
favorite_count = 'favorite_count'
retweet_count = 'retweet_count'
quoted_status = 'quoted_status'
reply = 'reply'
year_2018 = 'year_2018'
year_2019 = 'year_2019'
year_2020 = 'year_2020'
#Model declaration
from sklearn.ensemble import RandomForestRegressor
#Outlier removal
df_train = df_train[df_train['favorite_count'] < 4500]
df_train.shape
x_train = df_train.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_train = df_train['favorite_count']
x_test = df_test.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_test = df_test['favorite_count']
#Model declaration
model = RandomForestRegressor(n_estimators=2000, max_depth=10,
min_samples_leaf=4, max_features=0.2, random_state=0)
#Model learning
model.fit(x_train, t_train)
#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))
0.7189988420451674
0.6471214647821018
The accuracy has improved dramatically! Whether it is a quote retweet or not does not contribute so much, but it seems that it is not as irrelevant as the day of the week.
However, another thing I was interested in was the tweet time. Currently, it is only viewed as a numerical value, so it seems better to divide it into several bands.
time_mean = pd.DataFrame(columns = ['time','favorite_mean'])
time_array = range(23)
for i in range(23):
time_mean = time_mean.append(pd.Series([i,df_train[df_train['tweet_time'] == time_array[i]].favorite_count.mean()], index=time_mean.columns),ignore_index=True)
time_mean['time'] = time_mean['time'].astype(int)
sns.set_style('darkgrid')
plt.figure(figsize=(12, 8))
sns.catplot(x="time", y="favorite_mean", data=time_mean,
height=6, kind="bar", palette="muted")
plt.show()
Since there is a rule that SNS is until 24:00 for members of Eikolab (it is safe even if it sticks out a little), there is a part of 0, but other than that, the average value is in the 24:00 range of Japan time (probably a happy birthday tweet) You can see that it is expensive.
In ↑, I said that it is better to divide by several bands, but probably this is better to use TargetEncoding
instead of dividing by band (because it seems difficult to roughly divide by time)
pip install category_encoders
from category_encoders.target_encoder import TargetEncoder
df_train["tweet_time"] = df_train["tweet_time"].astype(str)
TE = TargetEncoder(smoothing=0.1)
df_train["target_enc_tweet_time"] = TE.fit_transform(df_train["tweet_time"],df_train["favorite_count"])
df_test["target_enc_tweet_time"] = TE.transform(df_test["tweet_time"])
Learn using target_enc_tweet_time
instead oftweet_time
and see the score
0.6999237089367164
0.6574824327192588
The training data went down, but the verification data went up.
By the way, when both tweet_time
andtarget_enc_tweet_time
are adopted, it becomes as follows.
0.7210047209796951
0.6457969793382683
The score in the training data is the best, but the score in the validation data is not. All of them are difficult to attach, but let's leave all possibilities and go to the next.
Right now I'm in Random Forest and I haven't messed with any settings. So I'd like to set the model to XGBoost
and useoptuna
to find the best parameters.
Install optuna
!pip install optuna
Next, we will functionize XGboost
so that it can be turned withoptuna
.
Here, appropriate numerical values are lined up for each hyperparameter value, but this will be fine-tuned as it is repeated many times.
#Import XGboost library
import xgboost as xgb
#Model instantiation
#mod = xgb.XGBRegressor()
import optuna
def objective(trial):
#Hyperparameter candidate setting
min_child_samples = trial.suggest_int('min_child_samples', 60, 75)
max_depth = trial.suggest_int('max_depth', -60, -40)
learning_rate = trial.suggest_uniform('suggest_uniform ', 0.075, 0.076)
min_child_weight = trial.suggest_uniform('min_child_weight', 0.1, 0.8)
num_leaves = trial.suggest_int('num_leaves', 2, 3)
n_estimators = trial.suggest_int('n_estimators', 100, 180)
subsample_for_bin = trial.suggest_int('subsample_for_bin', 450000, 600000)
model = xgb.XGBRegressor(min_child_samples = min_child_samples,min_child_weight = min_child_weight,
num_leaves = num_leaves,subsample_for_bin = subsample_for_bin,learning_rate = learning_rate,
n_estimators = n_estimators)
#Learning
model.fit(x_train, t_train)
#Return rating
return (1 - model.score(x_test, t_test))
First, let's turn it 100 times.
#Specify the number of trials
study = optuna.create_study()
study.optimize(objective, n_trials=100)
print('Hyperparameters:', study.best_params)
print('accuracy:', 1 - study.best_value)
Let's see the score for each
0.690093409305073
0.663908038217022
0.6966901697205284
0.667797061960107
0.6972461315076879
0.6669948080176482
Although it is slight, ② seems to be the most accurate.
It's decided that you should try every possible possibility, but which one is best if you narrow down to one from here?
(2) and (3) are almost the same, but (2) is the verification data and the score is good, and (3) is the training and the score is good.
From here, when making fine adjustments with optuna
to improve accuracy
――Do you think that the higher the training score, the higher the growth margin?
――The higher the verification score, the less overfitting, so we will improve the accuracy.
Which one should I think about?
It may not be a problem to say either, but I would appreciate it if anyone who knows which is the right path in general can tell me.
This time, we will proceed with the person who masters ②.
Once you have a wide range of hyperparameters, increase the number of trials to 1000. Narrow the range near the hyperparameters for the best results obtained, and try again 1000 times. The result obtained by this is as follows.
0.6962221939011508
0.6685252235753019
We got the best verification results so far. I've tried all the accuracy improvement methods I know, so I'll stop here this time.
Finally, let's take a look at the inference result and the histogram of the actual values.
It would have been nice to know what the mountains around 1600 were actually pulled from. I don't know what changed from the first low accuracy, so I wonder if I made a mistake in the type of graph to plot ...
Hard to see, orange: real likes, green: inferred likes, blue: error I tried to plot with.
Basically, you infer low. After all, Oba Hana is tweeting more than AI predicted ... (Your model is not something that can be called AI) Fin
Recommended Posts