Even if you do a Google search by typing keywords as you can think of, it tends to be inefficient, such as duplicate searches if you do not record what you searched for and how.
However, it is troublesome to take notes one by one, so I think that many people save Google's search result page in the form of Page Source or Web archive and consider it again slowly.
However, that is also a little annoying.
Therefore, we have summarized the process from entering search keywords to saving HTML in one step.
-If you give (1) search keywords and (2) the number of results displayed per page, the search result page will be saved as an HTML file in the working directory (CWD). --The links to the second and subsequent pages and the "Next" link displayed at the bottom of the page are invalid (because they are relative paths). ――For this point, (2) increase the number of results displayed per page (maximum 100 results / page).
You will be prompted twice, so
--Enter the query (search keyword) at the first prompt, --If you want to add search options (site: go.jp, filetype: pdf, etc.), enter them together at this stage. * c.f. * Improve the accuracy of web search --At the following prompt, enter the number of results to be displayed per search result page.
――It is uselessly a class due to various reasons, but please understand ... ――This time, when you get html_text, you immediately drop it in an HTML file, but of course you can use it as it is without dropping it in a file. --The setting of my_headers is not mandatory, but the returned HTML will be slightly different depending on the presence or absence of it. This makes sense in the "utilization" scene above.
――When you look at the HTML, the link destination of the result appears four times per case, changing its appearance and shape. ――There is only the headquarter of structuring (?), So it's really well done ...
google_fetcher.py
import os
from urllib.parse import quote_plus, urlunsplit
import requests
import re
PROJECT_ROOT_PATH = '.'
class GoogleResultsPage:
'''Query text, Results number per page -> search results response'''
def __init__(self, query, rslts_num):
self.__qry = query
self.__num = rslts_num
query_string = 'q='+quote_plus(self.__qry)+'&num='+str(self.__num)
search_string = urlunsplit(
('https', 'www.google.com', '/search', query_string, ''))
self.__sstr = search_string
def page_fetcher(self):
'''Fetch the result page and return as a text response'''
my_headers = {'user-agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6)\
AppleWebKit/537.36 (KHTML, like Gecko)\
Chrome/84.0.4147.105 Safari/537.36'}
response = requests.get(self.__sstr,
headers=my_headers, timeout=(3.05, 27))
return response.text
################################
# Output to a file.
def html_to_file(html_text):
'''Text response content to a HTML file.'''
output_file_name = re.sub(r'[\/.:;*?"<>| ]', r'_', query)+'.html'
output_file_path = os.path.join(PROJECT_ROOT_PATH, output_file_name)
with open(output_file_path, 'w') as f:
f.write(html_text)
print('Done! ', end='')
print('File path:', output_file_path)
if __name__ == '__main__':
query = input('Query? >> ')
rslts_num = input('Results per page (upto 100)? >> ')
html_text = GoogleResultsPage(query, rslts_num).page_fetcher()
html_to_file(html_text)
Recommended Posts