This time, we will talk about creating a model that predicts the number of views from the title of Jaru Jaru's video. Since NLP is a complete amateur, I tried to imitate it by referring to other people's articles.
Jaru Jaru is ** the most vigorous comedy combination ** composed of Junpei Goto and Shusuke Fukutoku, who belong to Yoshimoto Kogyo Tokyo Headquarters. Currently, I am posting daily news items on the JARU JARU TOWER project on the Jaru Jaru Official YouTube Channel. I will.
It is very costly to check the material posted on Youtube every day. Also, Jaru Jaru videos tend to grow more easily with titles that don't make sense (very subjective). For example, "The guy who was made a bad customer by a bad clerk" "[The dictator's egg guy](https: // www. youtube.com/watch?v=RPXFYBRJVMw) ". If the title contains words such as "Dangerous, crazy", I feel that the number of views is generally high. On the contrary, the story "The guy who sees the story of Chara Bancho" tends to be played less frequently, and everyone in the video with this title * It is an annual event to give a low rating ** without asking questions.
This time, we will use the Youtube Data API to collect video titles and views as a set. This article "Using YouTube Data api v3 from Python to get videos of a specific channel [^ 1]" and "Using YouTube Data api v3 from Python to get the number of videos viewed gently [^ 2] I referred to the article. Also, since you need an API key to use the Youtube API, "How to get the YouTube API API key [^ 3]" I got the key by referring to this article. First, collect the title and video ID of the code video below (required to get the number of views of the video).
jarujaru_scraping1.py
import os
import time
import requests
import pandas as pd
API_KEY = os.environ['API_KEY']#Bring the ID added to the environment variable
CHANNEL_ID = 'UChwgNUWPM-ksOP3BbfQHS5Q'
base_url = 'https://www.googleapis.com/youtube/v3'
url = base_url + '/search?key=%s&channelId=%s&part=snippet,id&maxResults=50&order=date'
infos = []
while True:
time.sleep(30)
response = requests.get(url % (API_KEY, CHANNEL_ID))
if response.status_code != 200:
print('Ends with an error')
print(response)
break
result = response.json()
infos.extend([
[item['id']['videoId'], item['snippet']['title'], item['snippet']['description'], item['snippet']['publishedAt']]
for item in result['items'] if item['id']['kind'] == 'youtube#video'
])
if 'nextPageToken' in result.keys():
if 'pageToken' in url:
url = url.split('&pageToken')[0]
url += f'&pageToken={result["nextPageToken"]}'
else:
print('Successful completion')
break
videos = pd.DataFrame(infos, columns=['videoId', 'title', 'description', 'publishedAt'])
videos.to_csv('data/video1.csv', index=None)
After collecting the video titles and IDs, use the code below to collect the number of views.
jarujaru_scraping2.py
import os
import time
import requests
import pandas as pd
API_KEY = os.environ['API_KEY']
videos = pd.read_csv('videos.csv')
base_url = 'https://www.googleapis.com/youtube/v3'
stat_url = base_url + '/videos?key=%s&id=%s&part=statistics'
len_block = 50
video_ids_per_block = []
video_ids = videos.videoId.values
count = 0
end_flag = False
while not end_flag:
start = count * len_block
end = (count + 1) * len_block
if end >= len(video_ids):
end = len(video_ids)
end_flag = True
video_ids_per_block.append(','.join(video_ids[start:end]))
count += 1
stats = []
for block in video_ids_per_block:
time.sleep(30)
response = requests.get(stat_url % (API_KEY, block))
if response.status_code != 200:
print('error')
break
result = response.json()
stats.extend([item['statistics'] for item in result['items']])
pd.DataFrame(stats).to_csv('data/stats.csv', index=None)
videos = pd.read_csv('videos.csv')
stasas = pd.read_csv('stats.csv')
pd.merge(videos, stasas, left_index=True, right_index=True).to_csv('data/jarujaru_data.csv')
The following data will be saved.
This time, I will divide the number of views into three stages and make it a classification problem. The histogram of the number of views is as follows. From the graph below, we will label with overwhelming subjectivity. The number of views is less than 100,000, 100,000 or more and less than 250,000, and 250,000 or more.
The code below is a code that takes only the control name from the labeling and the title of the video. Jaru Jaru's Tale videos always use the name of the Tale enclosed in "".
jarujaru_scraping3.py
import re
import pandas as pd
info = []
df = pd.read_csv('data/jarujaru_data.csv')
for row, item in df.iterrows():
if '『' in item['title']:
title = 'x' + item['title']
title = re.split('[『』]', title)[1]
if item['viewCount'] >= 250000:
label = 2
elif 100000 <= item['viewCount'] < 250000:
label = 1
elif item['viewCount'] < 100000:
label = 0
info.extend([[title, item['viewCount'], item['likeCount'], item['dislikeCount'], item['commentCount'], label]])
pd.DataFrame(info, columns=['title', 'viewCount', 'likeCount', 'dislikeCount', 'commentCount', 'label']).to_csv('data/jarujaru_norm.csv')
Refer to this [^ 4] article to morphologically analyze the title of the control and convert the title to a feature vector (Bag-of-words format). Below is part of the code. All implementations will be posted on GitHub [^ 5].
jarujaru.py
import analysis #I will post it on my own code, GitHub.
import pandas as pd
from gensim import corpora
from gensim import matutils
def vec2dense(vec, num_terms):
return list(matutils.corpus2dense([vec], num_terms=num_terms).T[0])
df = pd.read_csv('data/jarujaru_norm.csv')
words = analysis.get_words(df['title']) #Enter the morphologically parsed title here
#Make a dictionary
dictionary = corpora.Dictionary(words)
dictionary.filter_extremes(no_below=2, keep_tokens=['Chara','Man','Bancho'])
dictionary.save('data/jarujaru.dict')
courpus = [dictionary.doc2bow(word) for word in words]
# Bag-of-Convert to words format
data_all = [vec2dense(dictionary.doc2bow(words[i]),len(dictionary)) for i in range(len(words))]
This time, we adopted SVM as the model because the number of data is small. Divide the data into training data and test data and plunge into the model.
jarujaru.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
#Training / test data settings
train_data = data_all
X_train, X_test, y_train, y_test = train_test_split(train_data, df['label'], test_size=0.2, random_state=1)
#Data standardization
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#Creating a learning model
clf = SVC(C = 1, kernel = 'rbf')
clf.fit(X_train_std, y_train)
import pickle
with open('data/model.pickle', mode='wb') as fp:
pickle.dump(clf, fp)
Let's evaluate the model.
jarujaru.py
score = clf.score(X_test_std, y_test)
print("{:.3g}".format(score))
predicted = clf.predict(X_test_std)
The accuracy was 53%. It's 33%, so I'm able to learn (although it's terrible). Let's also look at the confusion matrix. It seems that most of the videos are a rugged model that predicts more than 100,000 playbacks.
This time I made a model that predicts the number of views from the title of Jaru Jaru's video. Being an NLP amateur, I didn't know much about vectorization of sentences, but I was able to create a model until the end. All implementations will be posted on GitHub [^ 5]. Next time, I will use this model to develop "LINE bot that will notify you if it is worth watching when a Jaru Jaru video is posted". Also, I would like to study the method of vectorizing sentences and models that handle time series data (LSTM, etc.).
[^ 1]: Get videos for a specific channel using YouTube Data api v3 from Python [^ 2]: Use YouTube Data api v3 from Python to get the number of video views gently [^ 3]: How to get YouTube API API key [^ 4]: Predict the classification of news articles by machine learning [^ 5]: Source code of this time
Recommended Posts