Re: Life in Heroku starting from scratch with Flask ~ Selenium & PhantomJS & Beautifulsoup ~

Introduction

For the first time, I made an API that returns scraped information with json using Flask and Heroku of Python, so I would like to summarize the method I did at that time.

Things to use up to Hello World on heroku and Python environment construction are the first part Re: Heroku life - Environment start with Flask from zero and the Hello World ~ At It's the second part until the program to be created this time is deployed on Heroku Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~ Since it is written in, please also see

What to do this time

If you study, you can reinvent the wheel

This time, I will write how to scrape using Selenium and PhantomJS using SlideShare as a theme.

*** Since it became long when I put it together in one article, the flow of deploying to Heroku is divided into the second part. *** ***

Specific movement

[HerokuURL] / api / [Search word] / [Number of pages] Example: ~ herokuapp.com/api/python/2 When you access

  1. PhantomJS works on Heroku and opens Slideshare search page

  2. Enter the [Search word] of the URL in the search field of the search page to search.

  3. Change the language setting of search results to Japanese

  4. Extract slide information in web pages by scraping スクリーンショット 2016-10-17 22.10.28.png

  5. Click Next on the pager below the number of pages in the URL and repeat scraping. スクリーンショット 2016-10-17 22.22.28.png

  6. After scraping, put it in json format and throw it!

I would like to create an API that does that.

What I used this time

Preparation

Environment construction to be done before this time

Re: Heroku life - Environment start with Flask from zero and the Hello World ~ Installed by the time you do Hello World Please include *** Flask *** and *** Gunicorn *** Please prepare your favorite environment such as pyenv-virtualenv.

Building an additional environment to do this time

PhantomJS Put PhantomJS locally to check the operation locally before running it on Heroku. I think it's okay to recognize that the browser does not have a GUI that can be operated from code. Reference: Try various things with PhantomJS

$  brew install phantomjs

Selenium It seems to be a cross-browser, cross-platform UI testing tool. With normal scraping, you can only do what is displayed at the specified URL, but by using selenium you can press the button to go to the next page or enter characters and press the search button. Wow

Ruby, but a helpful article about what it looks like: Web UI Test Automation-Try Selenium

$ pip install selenium

beautifulsoup It is used when processing the acquired Web page data. Reference: Scraping with Python and Beautiful Soup

$ pip install beautifulsoup4

lxml Used in combination with beautiful soup.

$ pip install lxml

A guy who avoids cross-domain constraints with cors

Even if you create an API that normally returns Json, it is troublesome to use an API that has not been dealt with in Chrome due to cross-domain restrictions, so I will take measures anyway. I will put a link in the code explanation.

$ pip install -U flask-cors

What was made

https://github.com/ymgn/SlideShare-API

api.py


# -*- coding: utf-8 -*-

import json
#Scraping required from here
from bs4 import BeautifulSoup
#From here, you need to operate the browser with selenium
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys #Used when entering characters

#From here you need the flask
import os
from flask import Flask

#From here for cross-domain constraints when using cors settings ajax in flask
from flask_cors import CORS, cross_origin
app = Flask(__name__)
CORS(app)

@app.route('/')
def index():
    return "How to use: /api/Word to search/Number of pages acquired"

@app.route('/api/<string:word>/<int:page>') #Search word/Receive the number of pages from the path into a variable
def slide(word,page):
 
    driver = webdriver.PhantomJS() #Use PhantomJS
    driver.set_window_size(1124, 850) #Specify the size of PhantomJS
    driver.implicitly_wait(20) #If the specified element does not exist, the driver will automatically wait for up to 20 seconds until it comes out.

    URL = "http://www.slideshare.net/search/"
    driver.get(URL) #Access the slideshare URL
    data_list = [] #An array that collects data for all pages

    search = driver.find_element_by_id("nav-search-query") #Get search field element
    search.send_keys(word) #Enter a search word
    search.submit() #Submit search

    lang = driver.find_element_by_xpath("//select[@id='slideshows_lang']/option[@value='ja']") #Extract the Japanese part of the language selection list
    lang.click() #Select Japanese as the language selection

    for i in range(0,page): 
        print(str(i+1) + u"Page page")
        data = driver.page_source.encode('utf-8') #UTF the information in the page-Prepare in 8
        soup = BeautifulSoup(data,"lxml") #Make it in lxml format for easy processing
        slide_list = soup.find_all("div",class_="thumbnail-content") #Extract by slide
        for slide in slide_list:
            slide_in = {} #Organize slide information in dictionary format
            
            #Get the name of the poster of the slide
            name = slide.find("div",class_="author").text
            slide_in["name"] = name.strip() # strip()Eliminates whitespace and line breaks at both ends
            
            #Get the title of the slide
            title = slide.find("a",class_="title title-link antialiased j-slideshow-title").get("title") #Specified tag&Issue a title in the class
            slide_in["title"] = title

            #Get slide links
            link = slide.find("a",class_="title title-link antialiased j-slideshow-title").get("href") #Specified tag&Issue href in class
            slide_in["link"] = "http://www.slideshare.net" + link
            
            #Get slide thumbnail links
            imagetag = slide.find("a",class_="link-bg-img").get("style") #Specified tag&Put out the style in the class
            image = imagetag[imagetag.find("url(")+4:imagetag.find(");")] #Remove unnecessary parts
            slide_in["image"] = image
            
            #Get slides and likes, which are the number of pages of slides
            info = slide.find("div",class_="small-info").string #Get the strings of slides and likes
            slides = info[7:info.find("slides")] #Extract the slides part
            slide_in["slides"] = slides.strip() # strip()Eliminates whitespace and line breaks at both ends
            if "likes" in info:
                likes = info[info.find(", ")+2:info.find("likes")] #Extract the likes part
            else:
                likes = "0"
            slide_in["likes"] = likes.strip() # strip()Eliminates whitespace and line breaks at both ends

            data_list.append(slide_in) # data_Summarize the contents of one page in list

        driver.execute_script('window.scrollTo(0, 3000)') #Move down with pager
        next = driver.find_element_by_xpath("//li[@class='arrow']/a[@rel='next']") #Extract the NEXT element of the pager
        next.click() #Click the Next button

    driver.close() #End browser operation
    jsonstring = json.dumps(data_list,ensure_ascii=False,indent=2) #Output the created array in json format
    return jsonstring
 
#Determine if you hit with bash or put in with import
if __name__ == '__main__':
    app.run()

Code commentary

I also wrote comments in the code, but I would like to explain the important parts from the top. Import is as written, so omitted CORS Programs using APIs do not work on Chrome etc.! I think you have the experience. Since we are creating an API with much effort, let's take measures.

from flask_cors import CORS, cross_origin

app = Flask(__name__)
CORS(app)

It seems that if you write, it will take measures with a kettle. Benri Reference: https://flask-cors.readthedocs.io/en/latest/

Take an argument from the Flask path

If you write in Flask's route and write the variable name in () of def, you can receive the contents of the path as an argument.

@app.route('/api/<string:word>/<int:page>') #Search word/Receive the number of pages from the path into a variable
def slide(word,page):

Reference: Let's master Flask

Determine the browser size of PhantomJS

driver.set_window_size(1124, 850)

If you don't decide on a browser size, you won't be able to pick or scroll elements well. The reason is unknown because the numerical value of the size is as it was written when I checked it.

Waiting for element to be read

driver.implicitly_wait(20)

By writing like this, when you specify the ID and class of `driver.find ~~` and get & operate the element, wait for up to 10 seconds, and execute immediately when reading is completed. It will be in a convenient state to do. It is very convenient when operating with selenium because you do not have to explicitly wait for the expected waiting time such as ``` time.sleep (3) . Reference: It is written in the Implicit Waits section of this site

Enter the text in the form box and submit

search = driver.find_element_by_id("nav-search-query") #Get search field element
search.send_keys(word) #Enter a search word
search.submit() #Submit search

After getting the input element etc. by id from the browser, you can enter the value with `send_keys (" hoge ")` etc. If the element is in a form, you can submit it by adding `.submit ()`.

Select a dropdown

lang = driver.find_element_by_xpath("//select[@id='slideshows_lang']/option[@value='ja']") #Extract the Japanese part of the language selection list
lang.click() #Select Japanese as the language selection

This time, the element specification method is specified by XPATH instead of id or class. The reason is that when selecting a child element that has multiple ids or only the parent has an id, it is necessary to specify it in a part other than id and class.

By the way, if you don't switch to Japanese like this, even if you get Japanese ones locally, Heroku will get slides in all languages.

When there are multiple ids

lang = driver.find_elements_by_id("slideshows_lang")
lang[1].find_elements_by_tag_name("option")
#When extracting multiple, it will be from element to elements

If only the parent has an id

lang = driver.find_element_by_id("slideshows_lang")
lang.find_element_by_tag_name("option")

Please refer to the reference for how to write XPATH and other extraction methods. Reference: Locating Elements

Get ready for scraping

data = driver.page_source.encode('utf-8') #UTF the information in the page-Prepare in 8
soup = BeautifulSoup(data,"lxml") #Make it in lxml format for easy processing

After encoding the page data of the website obtained by webdriver with utf-8, use lxml which is compatible with BeautifulSoup to make it easy to scrape. I put it in for because it needs to be loaded every time the page changes.

scroll

driver.execute_script('window.scrollTo(0, 3000)') #Move down with pager

Now you can scroll PhantomJS down 3000 pixels with JavaScript. If PhantomJS doesn't have a GUI, scrolling doesn't make sense, right? You might think, but if you don't scroll, you'll get an error. I wanted to go to the bottom because I set it to 3000, so I set it to 3000 for the time being.

Measures against a tag that has neither id nor class under multiple class names

next = driver.find_element_by_xpath("//li[@class='arrow']/a[@rel='next']") #Extract the NEXT element of the pager
next.click() #Click the Next button

When I tried to press Next on the Slideshare pager part, both Previous and Next had `` `class =" arrow ", and the a tag in it had neither id nor class. .. Since I wrote rel =" next "``` in the a tag of the child element, that part is set to XPATH that can be specified including the parent and rel.

Convert the created array to json format

jsonstring = json.dumps(data_list,ensure_ascii=False,indent=2) #Output the created dictionary in json format
return jsonstring
json.dumps(Array,Dictionary data,False if Japanese is included,Organize by indentation)

If you pass an array or a dictionary, it will be in json format. Indent is optional, and if you set indent = 2, it will be indented with two single-byte spaces to make it easier to see. Reference: [Python] Handle JSON

Flow from confirming the movement to deploying to Heroku

Launch Flask locally and check

Suppose you have everything you need installed.

Preparing folders and Flask

$ mkdir slide
$ cd slide

$ touch api.py Procfile
#Create a file to write the flask file and settings

Files needed to launch Flask

Procfile

web: gunicorn hello:app --log-file=-

api.py


See above

First, let's check the movement with Flask

$ python api.py
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

http://127.0.0.1:5000/api/python/2 Access with this path and get two pages searched by python on slideshare. result スクリーンショット 2016-10-16 17.52.34.png In Safari, json looks like a list, but if you put JSONView etc. in Chrome, you can see it beautifully.

reference

Various libraries that can be used for crawling and scraping in Python

Deploy to Heroku

The flow of deploying to Heroku is the second part Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~ Since it is written in, thank you.

Afterword

For the time being, basic browser operations with Selenium (character input, submit, drop-down list selection, element click, XPATH specification) and scraping (text, image, URL, character string processing and jsonization) I think I was able to write how to do it, so I hope it helps someone.

We would appreciate it if you could point out any improvements or mistakes in the comments section. Twitter:@ymgn_ll

Recommended Posts

Re: Life in Heroku starting from scratch with Flask ~ Selenium & PhantomJS & Beautifulsoup ~
Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~
Re: Heroku life begin with Flask from zero - Environment and Hello world -
Business efficiency starting from scratch with Python
Selenium, Phantomjs & BeautifulSoup4
Microservices with GCP on RoR starting from scratch
Machine learning starting from scratch (machine learning learned with Kaggle)
[Tweepy] Re: Twitter Bot development life starting from zero # 1 [python]
From environment construction to deployment for flask + Heroku with Docker
Re: Competitive programming life starting from zero In order for beginners to get as high a performance as possible ~ ABC154 ~ 156 with impressions ~
Re: Competitive Programming Life Starting from Zero Chapter 1.3 "Sedo Tea"
Django starting from scratch (part: 2)
Django starting from scratch (part: 1)
Scraping with selenium in Python
Touch Flask + run with Heroku
Scraping with Selenium in Python
PySpark life starting with Docker
Re: Competitive Programming Life Starting from Zero Chapter 1.2 "Python of Tears"