background#

The superior's remark that "we will take SEO measures" asked us to create a batch that examines and visualizes the ranking of our company and competitors in Google search. I didn't even understand the word SEO, but my seniors used Elastic Search and Kibana, so I decided to try it. There is no particular reason to implement it in Python. Because I like Python.

By the way, the company surveyed this time is appropriate, Made "drone maker" . The reason is that I'm interested in drones lately.

environment#

Python : 3.8.3 Selenium : 3.141.0 Elastic Search : 7.10.1-SNAPSHOT Kibana : 7.10.1

Regarding ElasticSearch, from Json's "number" returned by curl -XGET "http: localhost: 9200" . For Kibana, I could simply go with kibana -V .

Preparation

Create json for index creation

`rank_observer.json`


{
    "mappings": {
        "properties": {
            "keyword": {
                "type": "keyword"
            },
            "ranking": {
                "type": "integer"
            },
            "target_domain": {
                "type": "keyword"
            },
            "get_date": {
                "type": "date",
                "format": "yyyy/MM/dd"
            }
        }
    }
}

Install ElasticSearch

```console $ brew install elasticsearch ```

Install Kibana

```console $ brew install kibana ```

Install selenium, chromedriver-binary

```console $ pip3 install chromedriver-binary==87.1234.56 #The version is appropriate. Please choose the one that suits your environment $ pip3 install selenium ```

Start ElasticSearch, Kibana

``` $ elasticsearch $ curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/sample_index?pretty" -d @rank_observer.json 　　　　# sample_An index named index is created $ kibana ``` Elastic Search is http: // localhost: 9200 and Kibana is http: // You can access it on localhost: 5601 . Create an index with Elastic Search first.

Implementation#

I made three files. The flow of the program is "Get the keyword you want to look up from keyword.txt, get the domain of the competitor you want to look up from domain.txt, and go crazy with rank_observer.py. And automate it with CRON" Feeling like that.

keyword.txt###

`keyword.txt`


programmable drone
drone company

→ I chose the keywords appropriately like this. The larger the number, the slower the process.

domain.txt###

`domain.txt`


www.dji.com
www.parrot.com
www.yuneec.com
www.kespry.com
www.skydio.com
www.insitu.com
www.delair.aero
www.ehang.com

→ Regarding domains, I selected several from the top of the "10 Recommended Drone Makers" sites. Investigate how these manufacturers (domains?) Change their search rankings over time.

rank_observer.py###

`rank_observer.py`


import datetime
import os
import traceback
import smtplib
from lxml import html
import unicodedata
from selenium import webdriver
import chromedriver_binary
from selenium.webdriver.chrome.options import Options
from elasticsearch import Elasticsearch

#Connect to Chrome and search by argument
def search(driver, word):
    driver.get("https://www.google.com")
    search = driver.find_element_by_name('q')
    search.send_keys(word)
    search.submit()
    return driver.page_source

#Perspective the first page
def analyze(source):
    path_to_link = "//div[@class='yuRUbf']/a/@href"
    root = html.fromstring(source)
    #Returns a list with addresses
    address = root.xpath(path_to_link)
    return address

#Jump to the link destination of the "Next" button at the bottom of the list display page after searching, and return the link of the jumped destination
def next_page_source(source, driver):
    path_to_next_page = "//td[@class='d6cvqb']/a[@id='pnnext']/@href"
    root = html.fromstring(source)
    address = root.xpath(path_to_next_page)
    if address is None:
        return 0
    else:
        driver.get("https://www.google.com/" + str(address[0]))
        return driver.page_source

#Export search results as a csv file
def write(filename, keyword, line, i):
    today = datetime.datetime.today()
    path = "./data/"
    Y = str(today.year) + "/"
    M = str(today.month)
    MM = M.zfill(2) + "/"

    if os.path.exists(path + Y):
        if os.path.exists(path + Y + MM):
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+"\n")
        else:
            os.makedirs(path + Y + MM)
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+'\n')
    else:
        os.makedirs(path + Y)
        if os.path.exists(path + Y + MM):
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+"\n")
        else:
            os.makedirs(path + Y + MM)
            with open(os.path.abspath(path + Y + MM + keyword + ".csv"), "a") as f:
                f.write(str(int(i)+1)+","+line+'\n')

#Get keywords from a text file that contains the keywords you want to look up
def get_keyword():
    with open("./keyword.txt", "r", encoding="utf-8") as f:
        line = f.read()
        keyword = line.splitlines()
    return keyword

#Get the domain from a text file that contains the domain you want to look up
def get_domain():
    with open("./domain.txt", "r", encoding="utf-8") as f:
        line = f.read()
        domain = line.splitlines()
    return domain

#Returns only the addresses on the domain list
def check_domain(address, domain):
    ok = False
    for d in domain:
        if d in address:
            ok = True
    return ok

#Changed to dictionary form to associate domain with rank
# args = {
#   "address" :List of links,  list
#   "page_num":Current number of pages,  int
#   "keyword" :keyword,     string
#   "date"    :date,          date
#   "domain"  :Specified domain,list
# }
def sophisticate_data(address, page_num, keyword, date, domain):
    address_list = []
    if len(address) != 0:
        for i, content in enumerate(address):
            address_dict = {}
            #Is the link in the specified domain?
            print(content, domain)
            print(check_domain(content, domain))
            if check_domain(content, domain):
                #Number of pages x 10+Calculate search order in order of address
                address_dict["keyword"] = keyword
                address_dict["rank"]    = i + page_num*10 + 1
                address_dict["domain"]  = content
                address_dict["date"]    = date
                address_list.append(address_dict)
    return address_list

#Use the above function to get the search ranking of the domain specified for each keyword.
def parse():
    data = []
    #Set keywords and domains
    keyword = get_keyword()
    domain = get_domain()
    date = datetime.datetime.today().strftime("%Y/%m/%d")
    dir_title = datetime.datetime.today().strftime("%Y_%m_%d")
    #How many pages do you want to move? There are about 10 items per page
    page_num = 5

    for kw in keyword:
        time.sleep(10)
        options = Options()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-gpu')
        driver = webdriver.Chrome(options=options)

        #The source of the result screen that appears after searching by entering a keyword in the search field
        source = search(driver, kw)
        address = analyze(source)
        results = sophisticate_data(address, 0, kw, date, domain)

        #Loads the specified number of pages
        for i in range(1,page_num-1):
            next_source = next_page_source(source, driver)
            if next_page_source == 0:
                break
            results.extend(sophisticate_data(analyze(next_source), i, kw, date, domain))
            source = next_source
            time.sleep(10)

        driver.quit()

        #Save to file
        filename = datetime.datetime.today().strftime("%Y_%m_%d") + "_" + kw

        for item in results:
           write(filename, item["keyword"], str(item["domain"]), item["rank"])
        results = sorted(results, key=lambda x:x["rank"])

        data.extend(results)
    #ElasticSearch from here
    #Write data for each keyword
    client =  Elasticsearch("http://localhost:9200")
    for d in data:
        body = {}
        body["keyword"]       = d["keyword"]
        body["ranking"]       = d["rank"]
        body["target_domain"] = d["domain"]
        body['get_date']      = d["date"]
        client.index(index='sample_index', body=body)

#Send email
def send_mail(exception, error=True):
    program = "rank_observer.py"
    FROM = '[email protected]'
    #Set the notification destination email address
    TO = ['[email protected]']
    if error == False:
        SUBJECT = f"{exception}"
        TEXT = f"{exception}"
    else:
        SUBJECT = u'An error has been detected.'
        TEXT = f'In the monitoring server{program}The following error was detected in. Please check the log etc.\n\n {exception}'
    message = "Subject: {}\n\n{}".format(SUBJECT, TEXT)
    s = smtplib.SMTP()
    s.connect()
    s.sendmail(FROM, TO, message.encode("utf-8"))
    s.close()

#Run
#Exception handling

try:
    parse()
except Exception as e:
    #Send email only when there is an error
    send_mail(traceback.format_exc())

It's a code that feels like readability, but I've passed the check of my seniors by adding more comments.

I sent an email if there was an error. Also, I don't know if it works even if I copy it. I think it's okay if you play with the pass or email address.

0 1 * * * sudo python3 ~/path/to/rank_observer.py >> ~/path/to/data/log.log 2>&1

↑ This is the contents of crontab. The program runs at midnight every day. Exception stacks and races are now written to a log file with a funny name called log.log at any time.

error#

１、elasticsearch.exceptions.SSLError [SSL:WROMG_VERSION_NUMBER] It's clear from the log that this error came out, but I forgot how I fixed it. Certainly

`rank_observer.py`


client =  Elasticsearch("http://localhost:9200")

I feel like I deleted the argument about SSL here. It looks like it was use_ssl = True ,.

２、ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: I couldn't install chromedriver-binary for the rest of my life, but I was able to go with this article .

Remarks

Source code

I think it's a redundant way of writing, so I should read the readable code carefully. Next, I would like to study stock prices by making another BI tool or myself. How to visualize with Kibana is omitted. I will put only the image. スクリーンショット 2021-01-04 19.58.22.png You can see which manufacturers have recently been ranked high or low. This may be a stepping stone for SEO measures.

ElasticSearch + Kibana + Selenium + Python for SEO

background#

environment#

Preparation

rank_observer.json

Implementation#

keyword.txt

domain.txt

rank_observer.py

error#

rank_observer.py

Remarks

`rank_observer.json`

`keyword.txt`

`domain.txt`

`rank_observer.py`

`rank_observer.py`