I tried using YOUTUBE Data API V3

Overview

I tried using youtube Data API V3 in the scraping primer. I'm stuck in various ways, so I'll leave a trail.

Implemented code

youtube_api_metadata.py


import os
import logging
import csv

from apiclient.discovery import build
from pymongo import MongoClient, ReplaceOne, DESCENDING
from typing import Iterator, List

from pymongo.collection import Collection

YOUTUBE_API_KEY=os.environ['YOUTUBE_API_KEY']

logging.getLogger('googleapiclient.discovery_cache').setLevel(logging.ERROR)


def main():

    mongo_client = MongoClient('localhost', 27017)
    collections = mongo_client.youtube.videos

    query = input('Please specify the search value.:')

    for items_per_page in  search_videos(query):
        save_to_momgo(collections, items_per_page)

    save_to_csv(collections)





def search_videos(query: str, max_pages: int=5):

    youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)

    search_request = youtube.search().list(
        part='id',
        q=query,
        type='video',
        maxResults=15,
    )

    i = 0
    while search_request and i < max_pages:
        search_response = search_request.execute()

        video_ids = [item['id']['videoId'] for item in search_response['items']]
        video_response = youtube.videos().list(
            part='snippet,statistics',
            id=','.join(video_ids),
        ).execute()

        yield video_response['items']

        search_requst = youtube.search().list_next(search_request, search_response)
        i += 1


def save_to_momgo(collection: Collection, items: List[dict]):

    for item in items:
        item['_id'] = item['id']


        for key, value in item['statistics'].items():
            item['statistics'][key] = int(value)

    operation = [ReplaceOne({'_id': item['_id']}, item, upsert=True) for item in items]
    result = collection.bulk_write(operation)
    logging.info(f'Upserted {result.upserted_count} documents.')




def save_to_csv(collection: Collection):

    with open('top_videos_list.csv', 'w',newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, ['title', 'viewCount'])
        writer.writeheader()
        for item in collection.find().sort('statistics.viewCount', DESCENDING):
            writer.writerows([{'title' : item['snippet']['title'], 'viewCount': item['statistics']['viewCount']}])






if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    main()

Clog ①

After implementing the code and executing it several times, the following error occurs.

sample.py


googleapiclient.errors.HttpError: <HttpError 403 when requesting https://www.googleapis.com/youtube/v3/videos?part=snippet%2Cstatistics&id=wXeR58bjCak%2CujxXyCrDnU0%2CQhhUVI0sxCc%2CKZz-7KSMjZA%2CWq-WeVQoE6U%2CG-qWwfG9mBE%2CYqUwEPSZQGQ%2CIuopzT_TWPQ%2CmcFspy1WhL8%2Ck9dcl7F6IFY%2C--Z5cvZ4JEw%2C3hidJgc9Zyw%2CdYSmEkcM_8s%2Ch6Hc4RuK8D8%2CRQfN2re3u4w&key=<YOUTUBE_API_KEY>&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.">

At first I wasn't sure, so I searched for what was the cause.

As a result, it seems that the cause was simply the restriction on the use of "quota". Create a new project from GCP and issue a new API key from the Authentication tab. Set the key to the environment variable below and run the python file again.

set YOUTUBE_API_KEY=<YOUTUBE_API_KEY>

As a result, it was executed without any problem and a CSV file was output.

(scraping3.7) C:\Users\user\scraping3.7\files>python  save_youtube_videos_matadata.py
Please specify the search value.:Hikakin
INFO:root:Upserted 15 documents.
INFO:root:Upserted 0 documents.
INFO:root:Upserted 0 documents.
INFO:root:Upserted 0 documents.
INFO:root:Upserted 0 documents.

image.png

Clog ②

Of the above code, the following are not very clear. I couldn't help but stop developing.

operation = [ReplaceOne({'_id': item['_id']}, item, upsert=True) for item in items]

As a result of official investigation, it was found that the argument is received in the following format. Official

ReplaceOne(filter, replacement, options)

In the following cases

ReplaceOne({'city': 'Tokyo'}, {'city':'Gunma'}, upsert=True)

If data of'city':'Tokyo' exists, update to city':'Gunma'. If'city':'Tokyo' does not exist, insert a new'city':'Gunma'.

Reference book

amazon Python Crawling & Scraping [Augmented and Revised Edition] -Practical Development Guide for Data Collection and Analysis

Recommended Posts

I tried using YOUTUBE Data API V3
I tried to search videos using Youtube Data API (beginner)
[Python] I tried to get various information using YouTube Data API!
I tried using the API of the salmon data project
I tried using the checkio API
[Python] I tried using YOLO v3
[Python] I tried collecting data using the API of wikipedia
Play with YouTube Data API v3 using Google API Python Client
I tried using Twitter api and Line api
Get Youtube data in Python using Youtube Data API
I tried using UnityCloudBuild API from Python
I tried using the BigQuery Storage API
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
I tried using AWS Rekognition's Detect Labels API
I tried using Remote API on GAE / J
[Python] Get all comments using Youtube Data API
I tried using the Google Cloud Vision API
I tried DBM with Pylearn 2 using artificial data
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using BigQuery ML
Upload videos using YouTube API
I tried using Amazon Glacier
I tried clustering ECG data using the K-Shape method
I tried APN (remote notification) using Parse.com REST API
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
I tried reading data from a file using Node.js.
I tried using AWS Chalice
I tried using Slack emojinator
I tried using Microsoft's Cognitive Services facial expression recognition API
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to get data from AS / 400 quickly using pypyodbc