When I was working on seo writing before, I was working on manually collecting the top10 urls and titles of search words. At that time, I was able to save a lot of work by using scraping, so I will describe how to do it.
For those who want to write their own blog and earn money, what kind of title will make it easier to access, and since it will be possible to jump to the url from excel, the writing work can be greatly reduced.
** 1. Launch google **
** 2. Enter the word you want to search from the search box and enter **
** 3. Get url from search results **
** 4. Access each url and get the title and description attributes **
** 5. Export the summarized data as an excel file **
import time #Required to use sleep
from selenium import webdriver #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
#Change here if you like
from selenium.webdriver.chrome.options import Options
class Chrome_search:
def __init__(self):
self.url = "https://www.google.co.jp/search"
self.search_word = input("Please enter a search word:")
self.search_num = int(input("How many cases do you get:"))#How many acquisitions
self.options = Options()
#self.options.add_argument('--headless') ##Stop launching browser
#self.options.add_argument('--no-sandbox') ##Access restrictions disappear, but it is dangerous because any program will be downloaded
self.options.add_argument('--disable-dev-shm-usage')#Because chrome does not flash back because it can use full memory
def search(self):
driver = webdriver.Chrome(options=self.options) #mac users comment out this line
driver.get(self.url)
search = driver.find_element_by_name('q') #Search box in HTML(name='q')To specify
search.send_keys(self.search_word) #Send a search word
search.submit() #Perform a search
time.sleep(1)
#Create a storage box
title_list = [] #Store title
url_list = [] #Store url
description_list = [] #meta-Store description
##Get html
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')
#Get only the link and remove the extra part
if self.search_num<=len(link_elem01): #If the number of urls is less than the number of searches, analyze only the number of urls
for i in range(self.search_num):
url_text = link_elem01[i].get('href').replace('/url?q=', '')
url_list.append(url_text)
elif self.search_num > len(link_elem01):
for i in range(len(link_elem01)):
url_text = link_elem01[i].get('href').replace('/url?q=','')
url_list.append(url_text)
time.sleep(1)
#At this stage the url creation is complete
#url_Get titles one after another from list
for i in range(len(url_list)):
driver.get(url_list[i])
##Get html
html2 = driver.page_source.encode('utf-8')
##Perth for Beautiful Soup
soup2 = BeautifulSoup(html2, "html.parser")
#Get title
title_list.append(driver.title)
#Get Description
try:
description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
description_list.append(description)
except:
description_list.append("")
#Return the browser once
driver.back()
time.sleep(0.3)
#Can you save it once here?
print(url_list)
print(title_list)
print(description_list)
search_ranking = np.arange(1,len(url_list)+1)
my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
my_file = pd.DataFrame(my_list)
driver.quit()
my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
df = pd.read_excel(self.search_word+".xlsx")
return df
if __name__ == '__main__':
se = Chrome_search()
df=se.search()
df.head()
I will explain the code.
import time #Required to use sleep
from selenium import webdriver #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
#Change here if you like
from selenium.webdriver.chrome.options import Options
The description of the key library is as follows.
time: Time control library selenium: A library that controls the browser beautifulsoup: Scraping library chromedriver_binary: Make selenium browser google chrome
The following is the place to launch the browser on google.
driver = webdriver.Chrome(options=self.options) #mac users comment out this line
driver.get(self.url)
driver = webdriver.Chrome(options=self.options)
Is a declaration to launch selenium.
driver.get ("url you want to search for ")
You can access the url you want to search in chrome with.
This time, it is self.url = google browser
.
The code to search from the google search form using selenium is as follows.
search = driver.find_element_by_name('q') #Search box in HTML(name='q')To specify
search.send_keys(self.search_word) #Send a search word
search.submit() #Perform a search
time.sleep(1) #Stop for 1 second
You can extract the element that has the name attribute with driver.find_element_by_name ()
.
After find_element_by_, you can also specify class, id, etc.
You can enter the word you want to put in the element specified by driver.fine ... with .send_keys ("word you want to put ")
.
submit ()
acts as enter.
You can stop execution for the specified number of seconds with time.sleep (seconds)
. By using it when loading the browser, you can wait until the screen is displayed, and lag errors due to communication are less likely to occur.
In step 2, the screen actually changes and the search results are displayed.
The code to get the url from the search results is as follows.
#Create a storage box
title_list = [] #Store title
url_list = [] #Store url
description_list = [] #meta-Store description
##Get html
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')
#Get only the link and remove the extra part
if self.search_num<=len(link_elem01): #If the number of urls is less than the number of searches, analyze only the number of urls
for i in range(self.search_num):
url_text = link_elem01[i].get('href').replace('/url?q=', '')
url_list.append(url_text)
elif self.search_num > len(link_elem01):
for i in range(len(link_elem01)):
url_text = link_elem01[i].get('href').replace('/url?q=','')
url_list.append(url_text)
time.sleep(1)
The most important parts of this code are:
##Get html
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')
driver.page_source.encode ('utf-8')
forcibly sets the character code to utf-8.
Remember that BeautifulSoup (html," html.parser ")
is a declaration, so it's like a spell.
You can extract the specified element from the css selector with soup.select ()
.
The code after that says link_elem01 [i] .get ('href')
, but it reads the href attribute of the data obtained by soup.select.
The code to get the title and description is below.
for i in range(len(url_list)):
driver.get(url_list[i])
##Get html
html2 = driver.page_source.encode('utf-8')
##Perth for Beautiful Soup
soup2 = BeautifulSoup(html2, "html.parser")
#Get title
title_list.append(driver.title)
#Get Description
try:
description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
description_list.append(description)
except:
description_list.append("")
#Return the browser once
driver.back()
time.sleep(0.3)
We will search with selenium based on the list of urls obtained in 3.
The code is completed with the knowledge of beautiful soup and selenium that have been released so far.
driver.back ()
is a command to back the browser.
I made a list with url, title, description up to 4. Finally, we will use pandas to format the data and write it to an excel file. The corresponding code is below.
my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
my_file = pd.DataFrame(my_list)
driver.quit()
my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
df = pd.read_excel(self.search_word+".xlsx")
No special explanation is needed for the operation of pandas.
Finally, shut down the browser with driver.quit ()
.
The source code can be obtained from the following github. Please use it for writing. https://github.com/marumaru1019/python_scraping/tree/master
Recommended Posts