For the first time, I made an API that returns scraped information with json using Flask and Heroku of Python, so I would like to summarize the method I did at that time.
Things to use up to Hello World on heroku and Python environment construction are the first part Re: Heroku life - Environment start with Flask from zero and the Hello World ~ At It's the second part until the program to be created this time is deployed on Heroku Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~ Since it is written in, please also see
This time, I will write how to scrape using Selenium and PhantomJS using SlideShare as a theme.
*** Since it became long when I put it together in one article, the flow of deploying to Heroku is divided into the second part. *** ***
[HerokuURL] / api / [Search word] / [Number of pages] Example: ~ herokuapp.com/api/python/2 When you access
PhantomJS works on Heroku and opens Slideshare search page
Enter the [Search word] of the URL in the search field of the search page to search.
Change the language setting of search results to Japanese
Extract slide information in web pages by scraping
Click Next on the pager below the number of pages in the URL and repeat scraping.
After scraping, put it in json format and throw it!
I would like to create an API that does that.
Re: Heroku life - Environment start with Flask from zero and the Hello World ~ Installed by the time you do Hello World Please include *** Flask *** and *** Gunicorn *** Please prepare your favorite environment such as pyenv-virtualenv.
PhantomJS Put PhantomJS locally to check the operation locally before running it on Heroku. I think it's okay to recognize that the browser does not have a GUI that can be operated from code. Reference: Try various things with PhantomJS
$ brew install phantomjs
Selenium It seems to be a cross-browser, cross-platform UI testing tool. With normal scraping, you can only do what is displayed at the specified URL, but by using selenium you can press the button to go to the next page or enter characters and press the search button. Wow
Ruby, but a helpful article about what it looks like: Web UI Test Automation-Try Selenium
$ pip install selenium
beautifulsoup It is used when processing the acquired Web page data. Reference: Scraping with Python and Beautiful Soup
$ pip install beautifulsoup4
lxml Used in combination with beautiful soup.
$ pip install lxml
Even if you create an API that normally returns Json, it is troublesome to use an API that has not been dealt with in Chrome due to cross-domain restrictions, so I will take measures anyway. I will put a link in the code explanation.
$ pip install -U flask-cors
https://github.com/ymgn/SlideShare-API
api.py
# -*- coding: utf-8 -*-
import json
#Scraping required from here
from bs4 import BeautifulSoup
#From here, you need to operate the browser with selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys #Used when entering characters
#From here you need the flask
import os
from flask import Flask
#From here for cross-domain constraints when using cors settings ajax in flask
from flask_cors import CORS, cross_origin
app = Flask(__name__)
CORS(app)
@app.route('/')
def index():
return "How to use: /api/Word to search/Number of pages acquired"
@app.route('/api/<string:word>/<int:page>') #Search word/Receive the number of pages from the path into a variable
def slide(word,page):
driver = webdriver.PhantomJS() #Use PhantomJS
driver.set_window_size(1124, 850) #Specify the size of PhantomJS
driver.implicitly_wait(20) #If the specified element does not exist, the driver will automatically wait for up to 20 seconds until it comes out.
URL = "http://www.slideshare.net/search/"
driver.get(URL) #Access the slideshare URL
data_list = [] #An array that collects data for all pages
search = driver.find_element_by_id("nav-search-query") #Get search field element
search.send_keys(word) #Enter a search word
search.submit() #Submit search
lang = driver.find_element_by_xpath("//select[@id='slideshows_lang']/option[@value='ja']") #Extract the Japanese part of the language selection list
lang.click() #Select Japanese as the language selection
for i in range(0,page):
print(str(i+1) + u"Page page")
data = driver.page_source.encode('utf-8') #UTF the information in the page-Prepare in 8
soup = BeautifulSoup(data,"lxml") #Make it in lxml format for easy processing
slide_list = soup.find_all("div",class_="thumbnail-content") #Extract by slide
for slide in slide_list:
slide_in = {} #Organize slide information in dictionary format
#Get the name of the poster of the slide
name = slide.find("div",class_="author").text
slide_in["name"] = name.strip() # strip()Eliminates whitespace and line breaks at both ends
#Get the title of the slide
title = slide.find("a",class_="title title-link antialiased j-slideshow-title").get("title") #Specified tag&Issue a title in the class
slide_in["title"] = title
#Get slide links
link = slide.find("a",class_="title title-link antialiased j-slideshow-title").get("href") #Specified tag&Issue href in class
slide_in["link"] = "http://www.slideshare.net" + link
#Get slide thumbnail links
imagetag = slide.find("a",class_="link-bg-img").get("style") #Specified tag&Put out the style in the class
image = imagetag[imagetag.find("url(")+4:imagetag.find(");")] #Remove unnecessary parts
slide_in["image"] = image
#Get slides and likes, which are the number of pages of slides
info = slide.find("div",class_="small-info").string #Get the strings of slides and likes
slides = info[7:info.find("slides")] #Extract the slides part
slide_in["slides"] = slides.strip() # strip()Eliminates whitespace and line breaks at both ends
if "likes" in info:
likes = info[info.find(", ")+2:info.find("likes")] #Extract the likes part
else:
likes = "0"
slide_in["likes"] = likes.strip() # strip()Eliminates whitespace and line breaks at both ends
data_list.append(slide_in) # data_Summarize the contents of one page in list
driver.execute_script('window.scrollTo(0, 3000)') #Move down with pager
next = driver.find_element_by_xpath("//li[@class='arrow']/a[@rel='next']") #Extract the NEXT element of the pager
next.click() #Click the Next button
driver.close() #End browser operation
jsonstring = json.dumps(data_list,ensure_ascii=False,indent=2) #Output the created array in json format
return jsonstring
#Determine if you hit with bash or put in with import
if __name__ == '__main__':
app.run()
I also wrote comments in the code, but I would like to explain the important parts from the top. Import is as written, so omitted CORS Programs using APIs do not work on Chrome etc.! I think you have the experience. Since we are creating an API with much effort, let's take measures.
from flask_cors import CORS, cross_origin
app = Flask(__name__)
CORS(app)
It seems that if you write, it will take measures with a kettle. Benri Reference: https://flask-cors.readthedocs.io/en/latest/
If you write
@app.route('/api/<string:word>/<int:page>') #Search word/Receive the number of pages from the path into a variable
def slide(word,page):
Reference: Let's master Flask
driver.set_window_size(1124, 850)
If you don't decide on a browser size, you won't be able to pick or scroll elements well. The reason is unknown because the numerical value of the size is as it was written when I checked it.
driver.implicitly_wait(20)
By writing like this, when you specify the ID and class of `driver.find ~~`
and get & operate the element, wait for up to 10 seconds, and execute immediately when reading is completed. It will be in a convenient state to do.
It is very convenient when operating with selenium because you do not have to explicitly wait for the expected waiting time such as ``` time.sleep (3)
.
Reference: It is written in the Implicit Waits section of this site
search = driver.find_element_by_id("nav-search-query") #Get search field element
search.send_keys(word) #Enter a search word
search.submit() #Submit search
After getting the input element etc. by id from the browser, you can enter the value with `send_keys (" hoge ")`
etc.
If the element is in a form, you can submit it by adding `.submit ()`
.
lang = driver.find_element_by_xpath("//select[@id='slideshows_lang']/option[@value='ja']") #Extract the Japanese part of the language selection list
lang.click() #Select Japanese as the language selection
This time, the element specification method is specified by XPATH instead of id or class. The reason is that when selecting a child element that has multiple ids or only the parent has an id, it is necessary to specify it in a part other than id and class.
By the way, if you don't switch to Japanese like this, even if you get Japanese ones locally, Heroku will get slides in all languages.
When there are multiple ids
lang = driver.find_elements_by_id("slideshows_lang")
lang[1].find_elements_by_tag_name("option")
#When extracting multiple, it will be from element to elements
If only the parent has an id
lang = driver.find_element_by_id("slideshows_lang")
lang.find_element_by_tag_name("option")
Please refer to the reference for how to write XPATH and other extraction methods. Reference: Locating Elements
data = driver.page_source.encode('utf-8') #UTF the information in the page-Prepare in 8
soup = BeautifulSoup(data,"lxml") #Make it in lxml format for easy processing
After encoding the page data of the website obtained by webdriver with utf-8, use lxml which is compatible with BeautifulSoup to make it easy to scrape. I put it in for because it needs to be loaded every time the page changes.
driver.execute_script('window.scrollTo(0, 3000)') #Move down with pager
Now you can scroll PhantomJS down 3000 pixels with JavaScript. If PhantomJS doesn't have a GUI, scrolling doesn't make sense, right? You might think, but if you don't scroll, you'll get an error. I wanted to go to the bottom because I set it to 3000, so I set it to 3000 for the time being.
next = driver.find_element_by_xpath("//li[@class='arrow']/a[@rel='next']") #Extract the NEXT element of the pager
next.click() #Click the Next button
When I tried to press Next on the Slideshare pager part, both Previous and Next had `` `class =" arrow ", and the a tag in it had neither id nor class. .. Since I wrote
rel =" next "``` in the a tag of the child element, that part is set to XPATH that can be specified including the parent and rel.
jsonstring = json.dumps(data_list,ensure_ascii=False,indent=2) #Output the created dictionary in json format
return jsonstring
json.dumps(Array,Dictionary data,False if Japanese is included,Organize by indentation)
If you pass an array or a dictionary, it will be in json format. Indent is optional, and if you set indent = 2
, it will be indented with two single-byte spaces to make it easier to see.
Reference: [Python] Handle JSON
Suppose you have everything you need installed.
$ mkdir slide
$ cd slide
$ touch api.py Procfile
#Create a file to write the flask file and settings
Procfile
web: gunicorn hello:app --log-file=-
api.py
See above
$ python api.py
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
http://127.0.0.1:5000/api/python/2 Access with this path and get two pages searched by python on slideshare. result In Safari, json looks like a list, but if you put JSONView etc. in Chrome, you can see it beautifully.
Various libraries that can be used for crawling and scraping in Python
The flow of deploying to Heroku is the second part Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~ Since it is written in, thank you.
For the time being, basic browser operations with Selenium (character input, submit, drop-down list selection, element click, XPATH specification) and scraping (text, image, URL, character string processing and jsonization) I think I was able to write how to do it, so I hope it helps someone.
We would appreciate it if you could point out any improvements or mistakes in the comments section. Twitter:@ymgn_ll
Recommended Posts