I'm a beginner with Python for 2 weeks, but I want to get Google search results for my seminar research, so this article "[Get Google search results using Custom Search API](https: // qiita) .com / zak_y / items / 42ca0f1ea14f7046108c # 1-api% E3% 82% AD% E3% 83% BC% E3% 81% AE% E5% 8F% 96% E5% BE% 97) ” ..
Although it overlaps with the reference article, I would like to publish how it was made.
environment Windows10 python3.7 Anaconda Navigator
** Target ** Obtained previous research on the seminar research theme "What are the determinants that influence the increase and decrease in the number of foreign visitors to Japan?" → Create a file that lists the titles and URLs of the acquired articles
Open the navigation menu of Google Cloud Platform and click "APIs and Services" → "Credentials".
Create an API key from "Create Credentials".
I will use the obtained API key later, so copy it and paste it somewhere.
Open the navigation menu of Google Cloud Platform and click "APIs and Services" → "Library".
Select "Custom Search API" from "Other" at the bottom of the page to open the details page. Click "Activate".
① Go to the Custom Search Engine page and click "Add".
② ・ Enter the URL of some site under "Site to search" (anything is fine) ・ Language is set to "Japanese" ・ Enter the name of the search engine ・ Click "Create"
③ Select the name of the search engine you created earlier from the options under "Edit search engine" and edit it. What is this page -Copy the "search engine ID" and paste it somewhere and save it. ・ Select Japanese for "Language" -Delete the site displayed in "Sites to search" ・ Turn on "Search the entire web" ・ Click "Update"
Install "Google API Python Client" by referring to "Google API Client Library for Python".
I have created a virtual environment with virtualenv and then installed the library.
Now write the code and run it ... then an error occurs!
** Cause ** Reference article: Causes and workarounds of UnicodeEncodeError (cp932, Shift-JIS encoding) when using Python3 on Windows
** Workaround ** Specify encoding to ʻutf-8` in the argument of Open function.
scrape.py
with open(os.path.join(save_response_dir, 'response_' + today + '.json'), mode='w', encoding='utf-8') as response_file:
response_file.write(jsonstr)
With a little tinkering, the final code looks like this:
scrape.py
import os
import datetime
import json
from time import sleep
from googleapiclient.discovery import build
GOOGLE_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CUSTOM_SEARCH_ENGINE_ID = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
DATA_DIR = 'data'
def makeDir(path):
if not os.path.isdir(path):
os.mkdir(path)
def getSearchResponse(keyword):
today = datetime.datetime.today().strftime("%Y%m%d")
timestamp = datetime.datetime.today().strftime("%Y/%m/%d %H:%M:%S")
makeDir(DATA_DIR)
service = build("customsearch", "v1", developerKey=GOOGLE_API_KEY)
page_limit = 10
start_index = 1
response = []
for n_page in range(0, page_limit):
try:
sleep(1)
response.append(service.cse().list(
q=keyword,
cx=CUSTOM_SEARCH_ENGINE_ID,
lr='lang_ja',
num=10,
start=start_index
).execute())
start_index = response[n_page].get("queries").get("nextPage")[
0].get("startIndex")
except Exception as e:
print(e)
break
#Save the response in json format
save_response_dir = os.path.join(DATA_DIR, 'response')
makeDir(save_response_dir)
out = {'snapshot_ymd': today, 'snapshot_timestamp': timestamp, 'response': []}
out['response'] = response
jsonstr = json.dumps(out, ensure_ascii=False)
with open(os.path.join(save_response_dir, 'response_' + today + '.json'), mode='w', encoding='utf-8') as response_file:
response_file.write(jsonstr)
if __name__ == '__main__':
target_keyword = 'Foreign Visitors in Japan Factor Research'
getSearchResponse(target_keyword)
When I run it this time, a "response" folder is created under the "data" folder, and a json file is created under that!
The code is below.
prettier.py
import os
import datetime
import json
import pandas as pd
DATA_DIR = 'data'
def makeDir(path):
if not os.path.isdir(path):
os.mkdir(path)
def makeSearchResults():
today = datetime.datetime.today().strftime("%Y%m%d")
response_filename = os.path.join(
DATA_DIR, 'response', 'response_' + today + '.json')
response_file = open(response_filename, 'r', encoding='utf-8')
response_json = response_file.read()
response_tmp = json.loads(response_json)
ymd = response_tmp['snapshot_ymd']
response = response_tmp['response']
results = []
cnt = 0
for one_res in range(len(response)):
if 'items' in response[one_res] and len(response[one_res]['items']) > 0:
for i in range(len(response[one_res]['items'])):
cnt += 1
display_link = response[one_res]['items'][i]['displayLink']
title = response[one_res]['items'][i]['title']
link = response[one_res]['items'][i]['link']
snippet = response[one_res]['items'][i]['snippet'].replace(
'\n', '')
results.append({'ymd': ymd, 'no': cnt, 'display_link': display_link,
'title': title, 'link': link, 'snippet': snippet})
save_results_dir = os.path.join(DATA_DIR, 'results')
makeDir(save_results_dir)
df_results = pd.DataFrame(results)
df_results.to_csv(os.path.join(save_results_dir, 'results_' + ymd + '.tsv'), sep='\t',
index=False, columns=['ymd', 'no', 'display_link', 'title', 'link', 'snippet'])
if __name__ == '__main__':
makeSearchResults()
When executed, it was organized in the order of date, number, site URL, title, article URL, and details!
If you open it in Excel, it looks like this ↓
The article I referred to this time ([Get Google search results using Custom Search API](https://qiita.com/zak_y/items/42ca0f1ea14f7046108c#1-api%E3%82%AD%E3%] 83% BC% E3% 81% AE% E5% 8F% 96% E5% BE% 97)) was so nice and easy to understand that even beginners could easily implement it! I have to understand the meaning of the code well, but I'm happy to create a program that can be used in everyday life for the time being: satisfied: However, it seems that there are various restrictions on the Custom Search API if it is a free frame (Google Custom Search JSON API), so I will use it again in the future Sometimes you have to be careful.
Recommended Posts