Hello. This is Kushima, the first year since joining NTT DoCoMo. In this article on the 16th day of the Advent calendar, we will explain in detail how to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning. The programming language used is Python.
This article consists of the following two parts.
-** How to use YouTube Data API in Python ** -** Classification of video thumbnails based on deep learning **
"How to use YouTube Data API in Python" describes from the preparation required to acquire the information of YouTube video to the explanation of the code to actually acquire. By referring to this article,
** You can actually get the number of views of YouTube videos and thumbnail images. **
"Classifying video thumbnails based on deep learning" provides a brief description of the Convolutional Neural Network (CNN), a type of deep learning, to a description of the code that classifies video thumbnails based on it. By referring to this article,
** You can actually build a classification model on CNN and apply it to video thumbnails. **
We hope that it will be helpful for those who want to analyze YouTube video data using the YouTube Data API and those who want to classify images for the time being by deep learning.
Finally, we considered what kind of images are often viewed from the results of image classification.
Thumbnails that are easy to play on YouTube had the following features.
** Features of thumbnails that are easy to play on YouTube **
-** High color saturation ** -** Many colors ** -** Many telop characters ** -** Faces of people and characters are shown **
YouTube Data API is an API that can acquire information on videos posted on YouTube.
Official YouTube Data API documentation: https://developers.google.com/youtube/v3/getting-started?hl=ja
Below is an example of information about videos that can be obtained with the YouTube Data API.
-** Title ** -** Channel name ** -** Views ** -** Highly rated ** -** Thumbnail URL **
In this article, we used views and thumbnails.
The following preparations are required to use the YouTube Data API.
--Create a Google account -** Creating a new project ** -** API and service activation ** -** Get API key **
I will explain each preparation in detail.
Go to the following and click Create Project to create a new project with any name. http://console.developers.google.com/project
When you create a project, you can access the project management screen from notifications.
Once you have access to the project management screen, click Go to API Overview, then Enable APIs and Services to access the API Library. In the API library, you can search for the API you want to enable. Search for "YouTube Data API" etc., select "YouTube Data API v3" in the search results, and click "Enable" to complete the activation of YouTube Data API.
There is a tab called "Credentials" on the left side of the project management screen or the YouTube Data API management screen. Click on it (after selecting YouTube Data API if you clicked from the project management screen), then click "Create Credentials" and then "API Key" to create the API key.
This completes the preparation for using the YouTube Data API.
To use the YouTube Data API in Python, install the library in advance with the following pip command.
python
pip install google-api-python-client
You can get the information of YouTube video by executing the following code.
python
from apiclient.discovery import build
YOUTUBE_API_KEY = '{Obtained API key}'
youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)
search_response = youtube.search().list(
part='snippet',
#Search query
q='Game commentary',
#Most viewed
order='viewCount',
type='video',
).execute()
The details of the acquired information can be seen by checking the contents of search_response
as shown below.
python
search_response['items'][0]
Example of elements that can be confirmed:
--videoId
: Video ID
--channelId
: Channel ID
--title
: Video title
-- description
: Video description
--thumbnails
: Video thumbnail (URL information)
--channelTitle
: Channel name
--publishTime
: Posted date
You can also check the number of views and the number of high ratings of the video by executing the following code using the above videoId
.
python
statistics = youtube.videos().list(
#Statistics
part = 'statistics',
id = {Video id of video}
).execute()['items'][0]['statistics']
Example of elements that can be confirmed:
--viewCount
: Number of views
--likeCount
: Highly rated number
If you want to check the information that can be obtained and how to use the API in more detail, please see the official document. Official YouTube Data API documentation: https://developers.google.com/youtube/v3/getting-started?hl=ja
Also, if you want to store the information of multiple videos in a data frame and analyze it, there are the following methods.
First, specify the conditions to search.
In the code below, the search query uses various parameters such as game commentary
, most viewed, get 50 results, 2020/07/01 --2020/12/01 period, etc. You can specify it.
python
search_response = youtube.search().list(
part='snippet',
#Search query
q='Game commentary',
#Most viewed
order='viewCount',
type='video',
#50 cases
maxResults=50,
#Upload date is 2020/07/01 or later
publishedAfter='2020-07-01T00:00:00Z',
#Upload date is 2020/12/Before 01
publishedBefore='2020-12-01T00:00:00Z'
)
output = youtube.search().list(
part='snippet',
q='Game commentary',
order='viewCount',
type='video',
maxResults=50,
publishedAfter='2020-07-01T00:00:00Z',
publishedBefore='2020-12-01T00:00:00Z'
).execute()
Next, use the for statement to store the search results in the list.
Please note that the YouTube Data API has a limited number of uses when used for free.
There is a tab called "Assignment" on the left side of the YouTube Data API management screen, and if you check that, you will see the notation 10,000 Queries / day
.
It has not been verified how much it will be consumed by one code execution, but please note that there is a limit for free use.
python
#Number of loops
num = 20
#List to store video information
video_list = []
for i in range(num):
video_list = video_list + output['items']
search_response = youtube.search().list_next(search_response, output)
output = search_response.execute()
Finally, convert the list created above to a data frame.
In the following, the number of views is filtered by a variable called HighViewCount
.
python
import pandas as pd
#Function to get statistics
def get_statistics(id):
statistics = youtube.videos().list(part = 'statistics', id = id).execute()['items'][0]['statistics']
return statistics
#View count value to filter
HighViewCount = 100000
df = pd.DataFrame(video_list)
df1 = pd.DataFrame(list(df['id']))['videoId']
df2 = pd.DataFrame(list(df['snippet']))[['channelTitle','publishedAt','channelId','title','description']]
df3 = pd.DataFrame(list(pd.DataFrame(list(pd.DataFrame(list(df['snippet']))['thumbnails']))['high']))['url']
ddf = pd.concat([df1, df2, df3], axis = 1)
df_static = pd.DataFrame(list(ddf['videoId'].apply(lambda x : get_statistics(x))))
df_output = pd.concat([ddf,df_static], axis = 1)
df_output['viewCount'] = df_output['viewCount'].astype(int)
#Filter videos by views
df_highview = df_output[df_output['viewCount']>=HighViewCount]
Use the data frame obtained in the previous section to get the video thumbnail itself. Below is a code example to get a thumbnail. ** * Notes are written below. ** **
python
import requests
df_highview = df_highview.drop_duplicates()
df_highview = df_highview.reset_index(drop=True)
df_loop = df_highview
for i in range(len(df_loop)):
#Enter the URL to get the image itself
response = requests.get(df_loop.loc[i, 'url'])
image = response.content
filename = './image_' + str(i) + '.jpg'
with open(filename, "wb") as f:
f.write(image)
In this code, the thumbnail URL part of the information acquired in the previous section is extracted and the image is acquired. ** However, please note that when writing the code to access the image URL as above, please devise so as not to burden the server. The above code is an example, so please take appropriate measures such as spacing the access. ** ** If you would like to find out more about this point, please refer to the following article.
-[For beginners] Download images by specifying a URL in Python -Let's scrape images with Python
This completes the first goal of this article, "** Actually get the number of views of YouTube videos and thumbnail images **".
CNN, a type of deep learning, is a deep learning model that introduces convolutional processing into a neural network. It has a model structure suitable for image recognition and classification, and is a model often used in that field. In this article, we will apply a CNN with a general structure to video thumbnails to solve the classification problem.
In this article, we will consider solving the problem of classifying thumbnails with high views and thumbnails with low views by using the information of the number of views and thumbnails.
Specifically, for videos for which the search query was acquired as game commentary
, the number of views was 100,000 or more and 10,000 or less, and positive and negative examples were separated, and CNN's thumbnail images were used. Build the model.
HighViewCount
that appeared in the section" Getting YouTube video information "and searching, you can get thumbnails of positive and negative examples.Load the image data on the folder with the following code. The number of images used in this article is 749 for positive examples and 748 for negative examples.
python
import glob
import PIL
import keras
from keras.preprocessing import image
#Image size to resize
input_shape = (256, 256, 3)
#Number of classes
num_classes = 2
#image data
x = []
#label(1:Positive example, 0:Negative example)
y = []
#Image file name
z = []
image_list_positive = glob.glob('{Directory of regular image folders}/image_?.jpg')
for f in image_list_positive:
x.append(image.img_to_array(image.load_img(f, target_size=input_shape[:2])))
y.append(1)
z.append(f)
image_list_negative = glob.glob('{Negative image folder directory}/image_?.jpg')
for f in image_list_negative:
x.append(image.img_to_array(image.load_img(f, target_size=input_shape[:2])))
y.append(0)
z.append(f)
Apply the preprocessing to the image with the following code.
python
import numpy as np
from keras.utils import plot_model, to_categorical
from sklearn.model_selection import train_test_split
x = np.asarray(x)
x /= 255
y = np.asarray(y)
#Convert labels to categorical variables
y = keras.utils.to_categorical(y, num_classes)
#Split image dataset for training and testing
x_train, x_test, y_train, y_test, z_train, z_test = train_test_split(x, y, z, test_size=0.33, random_state= 3)
#Divide the training dataset into one for use as is in model training and one for verification
x_train_train, x_train_val, y_train_train, y_train_val, z_train_train, z_train_val = train_test_split(x_train, y_train, z_train, test_size=0.1, random_state = 3)
The volumes of each of the datasets divided as above are listed below.
python
len(x_train), len(x_test)
1002, 495
len(x_train_train), len(x_train_val)
901, 101
Build the model with the following code.
python
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPool2D
from keras.optimizers import Adam
from keras.layers import Dense, Activation, Dropout, Flatten
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=adam,
metrics=['accuracy'])
Learn by applying training data to the built model.
python
#Batch size
batch_size = 100
#Number of epochs
epochs = 100
history = model.fit(x_train_train, y_train_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_train_val, y_train_val))
Classify the test data using the trained model.
python
predictions = model.predict(x_test)
Check the classification accuracy from the classification result and the true value.
--Correct answer rate
Accuracy: 0.76
--Confusion matrix
True Positive: 178 True Negative: 196 False Positive: 78 False Negative: 43
--Recall rate, precision rate, F value
Recall: 0.81 Precision: 0.70 F-measure: 0.75
Since this article is a trial application of CNN, we can expect higher accuracy by changing the model structure, batch size, and number of epochs to more appropriate ones. Also, since it is deep learning, we should increase the number of images used for learning, but there is a background that we could not collect images more than expected due to usage restrictions when using the YouTube Data API in the free version. Please be careful if you use the YouTube Data API.
This completes the second goal of this article, "** You can actually build a classification model with CNN and apply it to video thumbnails **".
Finally, we will consider what kind of images are often viewed from the classification results of the test images.
The characteristics of images that were correct in the correct example and images that were incorrect in the negative example (that is, thumbnails that are easy to play on YouTube) are listed below.
** Features of thumbnails that are easy to play on YouTube **
-** High color saturation ** -** Many colors ** -** Many telop characters ** -** Faces of people and characters are shown **
It is a subjective evaluation to the last, but I think there is a slight tendency. By visualizing the feature map and increasing the number of images to be learned, you can see more clear features.
In this article, we have introduced in detail how to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning. Specifically, we have described the methods for achieving the following two goals.
-** You can actually get the number of views of YouTube videos and thumbnail images ** -** You can actually build a classification model on CNN and apply it to video thumbnails **
What did you think? We hope that this article will be of some help to you.
With the YouTube Data API, you can get a lot of information other than the information used in this article. In the future, I would like to use other information to analyze data and build models. In addition, since this article was a trial application of CNN construction, I would like to take on the challenge of searching for an appropriate model structure and introducing the latest methods.
-Get Youtube data in Python using Youtube Data API -Try using YouTube Data API -[For beginners] Download images by specifying a URL in Python -Let's scrape images with Python -Create a machine learning model for image classification (1) CNN from scratch -Image classification with Keras-from preprocessing to classification test-
Recommended Posts