Introduction

When I was working on seo writing before, I was working on manually collecting the top10 urls and titles of search words. At that time, I was able to save a lot of work by using scraping, so I will describe how to do it.

For those who want to write their own blog and earn money, what kind of title will make it easier to access, and since it will be possible to jump to the url from excel, the writing work can be greatly reduced.

flow

The whole picture of the code

Commentary

** 1. Launch google **

webdriver.Chrome() --driver.get ("url you want to search")

** 2. Enter the word you want to search from the search box and enter **

driver.find_element_by_name() --.send_keys ("words you want to put")
.submit() --time.sleep (seconds)

** 3. Get url from search results **

driver.page_source.encode('utf-8')
BeautifulSoup(html,"html.parser")
soup.select()

** 4. Access each url and get the title and description attributes **

driver.back()

** 5. Export the summarized data as an excel file **

driver.quit()

Source code

flow

Launch google
Enter the word you want to search from the search box and enter
Get the url from the search results
Access each url and get the title and description attributes
Export the summarized data as an excel file

The whole code

import time                                 #Required to use sleep
from selenium import webdriver              #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np                               
#Change here if you like
from selenium.webdriver.chrome.options import Options

class Chrome_search:
    def __init__(self):
        self.url = "https://www.google.co.jp/search"
        self.search_word = input("Please enter a search word:")
        self.search_num = int(input("How many cases do you get:"))#How many acquisitions
        
        self.options = Options()
        #self.options.add_argument('--headless') ##Stop launching browser
        #self.options.add_argument('--no-sandbox')　 ##Access restrictions disappear, but it is dangerous because any program will be downloaded
        self.options.add_argument('--disable-dev-shm-usage')#Because chrome does not flash back because it can use full memory
        
    def search(self):
        driver = webdriver.Chrome(options=self.options) #mac users comment out this line
        driver.get(self.url)
        search = driver.find_element_by_name('q')  #Search box in HTML(name='q')To specify
        search.send_keys(self.search_word)        #Send a search word
        search.submit()                         #Perform a search
        time.sleep(1)    
        #Create a storage box
        title_list = []     #Store title
        url_list = []     #Store url
        description_list = []   #meta-Store description
        ##Get html
        html = driver.page_source.encode('utf-8') 
        soup = BeautifulSoup(html, "html.parser")        
        #Get search result titles and links
        link_elem01 = soup.select('.yuRUbf > a')
        #Get only the link and remove the extra part
        if self.search_num<=len(link_elem01):       #If the number of urls is less than the number of searches, analyze only the number of urls
            for i in range(self.search_num):
                url_text = link_elem01[i].get('href').replace('/url?q=', '')
                url_list.append(url_text)  
        elif self.search_num > len(link_elem01):
            for i in range(len(link_elem01)):
                url_text = link_elem01[i].get('href').replace('/url?q=','')
                url_list.append(url_text)
        
        time.sleep(1)
        
        #At this stage the url creation is complete
        #url_Get titles one after another from list
        for i in range(len(url_list)):
            driver.get(url_list[i])
            ##Get html
            html2 = driver.page_source.encode('utf-8') 
            ##Perth for Beautiful Soup
            soup2 = BeautifulSoup(html2, "html.parser")
            #Get title
            title_list.append(driver.title)
            #Get Description
            try:
                description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
                description_list.append(description)
            except:
                description_list.append("")
            #Return the browser once
            driver.back()
            time.sleep(0.3)
        #Can you save it once here?
        print(url_list)
        print(title_list)
        print(description_list)

        search_ranking = np.arange(1,len(url_list)+1)
        
        my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
        my_file = pd.DataFrame(my_list)
        driver.quit()
        my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
        df = pd.read_excel(self.search_word+".xlsx")
        return df
    
    
if __name__ == '__main__':
    se = Chrome_search()
    df=se.search()
    df.head()

Commentary

I will explain the code.

Loading the library

import time                                 #Required to use sleep
from selenium import webdriver              #Automatically operate the web browser (python-m pip install selenium)
from selenium.webdriver.common.keys import Keys
import chromedriver_binary
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np  
#Change here if you like
from selenium.webdriver.chrome.options import Options

The description of the key library is as follows.

time: Time control library selenium: A library that controls the browser beautifulsoup: Scraping library chromedriver_binary: Make selenium browser google chrome

1. Launch Google

The following is the place to launch the browser on google.

driver = webdriver.Chrome(options=self.options) #mac users comment out this line
driver.get(self.url)

driver = webdriver.Chrome(options=self.options) Is a declaration to launch selenium.

driver.get ("url you want to search for ") You can access the url you want to search in chrome with.

This time, it is self.url = google browser.

2. Enter the word you want to search from the search box and enter

The code to search from the google search form using selenium is as follows.

search = driver.find_element_by_name('q')  #Search box in HTML(name='q')To specify
search.send_keys(self.search_word)        #Send a search word
search.submit()                         #Perform a search
time.sleep(1)                          #Stop for 1 second

You can extract the element that has the name attribute with driver.find_element_by_name (). After find_element_by_, you can also specify class, id, etc. You can enter the word you want to put in the element specified by driver.fine ... with .send_keys ("word you want to put "). submit () acts as enter. You can stop execution for the specified number of seconds with time.sleep (seconds). By using it when loading the browser, you can wait until the screen is displayed, and lag errors due to communication are less likely to occur.

3. Get the url from the search results

In step 2, the screen actually changes and the search results are displayed.

The code to get the url from the search results is as follows.

#Create a storage box
title_list = []     #Store title
url_list = []     #Store url
description_list = []   #meta-Store description
##Get html
html = driver.page_source.encode('utf-8') 
soup = BeautifulSoup(html, "html.parser")        
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')
#Get only the link and remove the extra part
if self.search_num<=len(link_elem01):       #If the number of urls is less than the number of searches, analyze only the number of urls
    for i in range(self.search_num):
        url_text = link_elem01[i].get('href').replace('/url?q=', '')
        url_list.append(url_text)  
elif self.search_num > len(link_elem01):
    for i in range(len(link_elem01)):
        url_text = link_elem01[i].get('href').replace('/url?q=','')
        url_list.append(url_text)

time.sleep(1)

The most important parts of this code are:

##Get html
html = driver.page_source.encode('utf-8') 
soup = BeautifulSoup(html, "html.parser")        
#Get search result titles and links
link_elem01 = soup.select('.yuRUbf > a')

driver.page_source.encode ('utf-8') forcibly sets the character code to utf-8. Remember that BeautifulSoup (html," html.parser ") is a declaration, so it's like a spell. You can extract the specified element from the css selector with soup.select (). The code after that says link_elem01 [i] .get ('href'), but it reads the href attribute of the data obtained by soup.select.

4. Get title and description

The code to get the title and description is below.

for i in range(len(url_list)):
    driver.get(url_list[i])
    ##Get html
    html2 = driver.page_source.encode('utf-8') 
    ##Perth for Beautiful Soup
    soup2 = BeautifulSoup(html2, "html.parser")
    #Get title
    title_list.append(driver.title)
    #Get Description
    try:
        description = driver.find_element_by_xpath(("//meta[@name='description']")).get_attribute("content")
        description_list.append(description)
    except:
        description_list.append("")
    #Return the browser once
    driver.back()
    time.sleep(0.3)

We will search with selenium based on the list of urls obtained in 3. The code is completed with the knowledge of beautiful soup and selenium that have been released so far. driver.back () is a command to back the browser.

5. Export the summarized data as an excel file

I made a list with url, title, description up to 4. Finally, we will use pandas to format the data and write it to an excel file. The corresponding code is below.

my_list = {"url": url_list,"ranking":search_ranking, "title": title_list,"description":description_list}
my_file = pd.DataFrame(my_list)
driver.quit()
my_file.to_excel(self.search_word+".xlsx",self.search_word,startcol=2,startrow=1)
df = pd.read_excel(self.search_word+".xlsx")

No special explanation is needed for the operation of pandas. Finally, shut down the browser with driver.quit ().

Source code

The source code can be obtained from the following github. Please use it for writing. https://github.com/marumaru1019/python_scraping/tree/master

[Python scraping] I tried google search top10 using Beautifulsoup & selenium

Introduction

table of contents

flow

The whole picture of the code

Commentary

Source code

flow

The whole code

Commentary

Loading the library

1. Launch Google

2. Enter the word you want to search from the search box and enter

3. Get the url from the search results

4. Get title and description

5. Export the summarized data as an excel file

Source code