ElasticSearch + Kibana + Selenium + Python for SEO

background#

The superior's remark that "we will take SEO measures" asked us to create a batch that examines and visualizes the ranking of our company and competitors in Google search. I didn't even understand the word SEO, but my seniors used Elastic Search and Kibana, so I decided to try it. There is no particular reason to implement it in Python. Because I like Python.

By the way, the company surveyed this time is appropriate, Made "drone maker" . The reason is that I'm interested in drones lately.

environment#

Python : 3.8.3 Selenium : 3.141.0 Elastic Search : 7.10.1-SNAPSHOT Kibana : 7.10.1

Regarding ElasticSearch, from Json's "number" returned by curl -XGET "http: localhost: 9200" . For Kibana, I could simply go with kibana -V .

Preparation

rank_observer.json


{
    "mappings": {
        "properties": {
            "keyword": {
                "type": "keyword"
            },
            "ranking": {
                "type": "integer"
            },
            "target_domain": {
                "type": "keyword"
            },
            "get_date": {
                "type": "date",
                "format": "yyyy/MM/dd"
            }
        }
    }
}
```console $ brew install elasticsearch ``` ```console $ brew install kibana ``` ```console $ pip3 install chromedriver-binary==87.1234.56 #The version is appropriate. Please choose the one that suits your environment $ pip3 install selenium ``` ``` $ elasticsearch $ curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/sample_index?pretty" -d @rank_observer.json     # sample_An index named index is created $ kibana ``` Elastic Search is http: // localhost: 9200 and Kibana is http: // You can access it on localhost: 5601 . Create an index with Elastic Search first.

Implementation#

I made three files. The flow of the program is "Get the keyword you want to look up from keyword.txt, get the domain of the competitor you want to look up from domain.txt, and go crazy with rank_observer.py. And automate it with CRON" Feeling like that.

keyword.txt###

keyword.txt


programmable drone
drone company

→ I chose the keywords appropriately like this. The larger the number, the slower the process.

domain.txt###

domain.txt


www.dji.com
www.parrot.com
www.yuneec.com
www.kespry.com
www.skydio.com
www.insitu.com
www.delair.aero
www.ehang.com

→ Regarding domains, I selected several from the top of the "10 Recommended Drone Makers" sites. Investigate how these manufacturers (domains?) Change their search rankings over time.

rank_observer.py###

rank_observer.py


import datetime
import os
import traceback
import smtplib
from lxml import html
import unicodedata
from selenium import webdriver
import chromedriver_binary
from selenium.webdriver.chrome.options import Options
from elasticsearch import Elasticsearch

#Connect to Chrome and search by argument
def search(driver, word):
    driver.get("https://www.google.com")
    search = driver.find_element_by_name('q')
    search.send_keys(word)
    search.submit()
    return driver.page_source

#Perspective the first page
def analyze(source):
    path_to_link = "//div[@class='yuRUbf']/a/@href"
    root = html.fromstring(source)
    #Returns a list with addresses
    address = root.xpath(path_to_link)
    return address

#Jump to the link destination of the "Next" button at the bottom of the list display page after searching, and return the link of the jumped destination
def next_page_source(source, driver):
    path_to_next_page = "//td[@class='d6cvqb']/a[@id='pnnext']/@href"
    root = html.fromstring(source)
    address = root.xpath(path_to_next_page)
    if address is None:
        return 0
    else:
        driver.get("https://www.google.com/" + str(address[0]))
        return driver.page_source

#Export search results as a csv file
def write(filename, keyword, line, i):
    today = datetime.datetime.today()
    path = "./data/"
    Y = str(today.year) + "/"
    M = str(today.month)
    MM = M.zfill(2) + "/"

    if os.path.exists(path + Y):
        if os.path.exists(path + Y + MM):
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+"\n")
        else:
            os.makedirs(path + Y + MM)
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+'\n')
    else:
        os.makedirs(path + Y)
        if os.path.exists(path + Y + MM):
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+"\n")
        else:
            os.makedirs(path + Y + MM)
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+'\n')

#Get keywords from a text file that contains the keywords you want to look up
def get_keyword():
    with open("./keyword.txt", "r", encoding="utf-8") as f:
        line = f.read()
        keyword = line.splitlines()
    return keyword

#Get the domain from a text file that contains the domain you want to look up
def get_domain():
    with open("./domain.txt", "r", encoding="utf-8") as f:
        line = f.read()
        domain = line.splitlines()
    return domain

#Returns only the addresses on the domain list
def check_domain(address, domain):
    ok = False
    for d in domain:
        if d in address:
            ok = True
    return ok

#Changed to dictionary form to associate domain with rank
# args = {
#   "address" :List of links,  list
#   "page_num":Current number of pages,  int
#   "keyword" :keyword,     string
#   "date"    :date,          date
#   "domain"  :Specified domain,list
# }
def sophisticate_data(address, page_num, keyword, date, domain):
    address_list = []
    if len(address) != 0:
        for i, content in enumerate(address):
            address_dict = {}
            #Is the link in the specified domain?
            print(content, domain)
            print(check_domain(content, domain))
            if check_domain(content, domain):
                #Number of pages x 10+Calculate search order in order of address
                address_dict["keyword"] = keyword
                address_dict["rank"]    = i + page_num*10 + 1
                address_dict["domain"]  = content
                address_dict["date"]    = date
                address_list.append(address_dict)
    return address_list

#Use the above function to get the search ranking of the domain specified for each keyword.
def parse():
    data = []
    #Set keywords and domains
    keyword = get_keyword()
    domain = get_domain()
    date = datetime.datetime.today().strftime("%Y/%m/%d")
    dir_title = datetime.datetime.today().strftime("%Y_%m_%d")
    #How many pages do you want to move? There are about 10 items per page
    page_num = 5

    for kw in keyword:
        time.sleep(10)
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-gpu')
        driver = webdriver.Chrome(options=options)

        #The source of the result screen that appears after searching by entering a keyword in the search field
        source = search(driver, kw)
        address = analyze(source)
        results = sophisticate_data(address, 0, kw, date, domain)

        #Loads the specified number of pages
        for i in range(1,page_num-1):
            next_source = next_page_source(source, driver)
            if next_page_source == 0:
                break
            results.extend(sophisticate_data(analyze(next_source), i, kw, date, domain))
            source = next_source
            time.sleep(10)

        driver.quit()

        #Save to file
        filename = datetime.datetime.today().strftime("%Y_%m_%d") + "_" + kw

        for item in results:
           write(filename, item["keyword"], str(item["domain"]), item["rank"])
        results = sorted(results, key=lambda x:x["rank"])

        data.extend(results)
    #ElasticSearch from here
    #Write data for each keyword
    client =  Elasticsearch("http://localhost:9200")
    for d in data:
        body = {}
        body["keyword"]       = d["keyword"]
        body["ranking"]       = d["rank"]
        body["target_domain"] = d["domain"]
        body['get_date']      = d["date"]
        client.index(index='sample_index', body=body)

#Send email
def send_mail(exception, error=True):
    program = "rank_observer.py"
    FROM = '[email protected]'
    #Set the notification destination email address
    TO = ['[email protected]']
    if error == False:
        SUBJECT = f"{exception}"
        TEXT = f"{exception}"
    else:
        SUBJECT = u'An error has been detected.'
        TEXT = f'In the monitoring server{program}The following error was detected in. Please check the log etc.\n\n {exception}'
    message = "Subject: {}\n\n{}".format(SUBJECT, TEXT)
    s = smtplib.SMTP()
    s.connect()
    s.sendmail(FROM, TO, message.encode("utf-8"))
    s.close()

#Run
#Exception handling

try:
    parse()
except Exception as e:
    #Send email only when there is an error
    send_mail(traceback.format_exc())

It's a code that feels like readability, but I've passed the check of my seniors by adding more comments.

I sent an email if there was an error. Also, I don't know if it works even if I copy it. I think it's okay if you play with the pass or email address.

0 1 * * * sudo python3 ~/path/to/rank_observer.py >> ~/path/to/data/log.log 2>&1

↑ This is the contents of crontab. The program runs at midnight every day. Exception stacks and races are now written to a log file with a funny name called log.log at any time.

error#

1、elasticsearch.exceptions.SSLError [SSL:WROMG_VERSION_NUMBER] It's clear from the log that this error came out, but I forgot how I fixed it. Certainly

rank_observer.py


client =  Elasticsearch("http://localhost:9200")

I feel like I deleted the argument about SSL here. It looks like it was use_ssl = True ,.

2、ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: I couldn't install chromedriver-binary for the rest of my life, but I was able to go with this article .

Remarks

Source code

I think it's a redundant way of writing, so I should read the readable code carefully. Next, I would like to study stock prices by making another BI tool or myself. How to visualize with Kibana is omitted. I will put only the image. スクリーンショット 2021-01-04 19.58.22.png You can see which manufacturers have recently been ranked high or low. This may be a stepping stone for SEO measures.

Recommended Posts

ElasticSearch + Kibana + Selenium + Python for SEO
[Python + Selenium] Tips for scraping
Traffic monitoring with Kibana, ElasticSearch and Python
Overwrite download file for python selenium Chrome
2016-10-30 else for Python3> for:
python [for myself]
[Python / Selenium] XPath
Selenium + WebDriver (Chrome) + Python | Building environment for scraping
About Python for loops
Studying Footprint Automation for Matching Apps (Python, Selenium, BeautifulSoup,)
Python basics ② for statement
Elasticsearch Reindex in Python
About Python, for ~ (range)
python textbook for beginners
Refactoring tools for Python
ScreenShot with Selenium (Python)
python for android Toolchain
Scraping with Selenium [Python]
Python web scraping selenium
OpenCV for Python beginners
Install Python (for Windows)
[Python] for statement error
Python environment for projects
python selenium chromedriver beautifulsoup
Python memo (for myself): Array
About Fabric's support for Python 3
Python list, for statement, dictionary
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Python for Data Analysis Chapter 4
Modern Python for intermediate users
Learning flow for Python beginners
Python 3.6 installation procedure [for Windows]
BigQuery integration for Python users
Python learning plan for AI learning
Set Up for Mac (Python)
Python: Working with Firefox with selenium
Search for strings in Python
Python Selenium Dynamic download wait
Python Tkinter notes (for myself)
OpenCV3 installation for Python3 @macOS
Python code memo for yourself
[Python] xmp tag for photos
Scraping with Selenium in Python
Start to Selenium using python
Python environment construction For Mac
Selenium WebDriver + Firefox49 (provisional) (Python)
Techniques for sorting in Python
pp4 (python power for anything)
Python3 environment construction (for beginners)
Roadmap for publishing Python packages
Python 3 series installation for mac
Python #function 2 for super beginners
Python template for Codeforces-manual test-
Web scraping using Selenium (Python)
Basic Python grammar for beginners
3 months note for starting Python
Qt for Python app self-update
Python for Data Analysis Chapter 2
Scraping with Selenium + Python Part 2
100 Pandas knocks for Python beginners