As the title suggests, the information scraped using requests and lxml is saved in MySQL. Since there were few sites that explained cross-cuttingly, record it as a memorandum. Obviously, scraping puts a burden on the other server, so it is necessary to generate delays and check robot.txt.
--requests 2.13.0 --HTTP library --lxml 3.7.3 --XML library --PyMySQL 0.7.10 --MySQL driver --fake-useragent 0.1.5 --request_header creation library
Other libraries commonly used for scraping include urllib, htmllib, BeautifulSoup and Scrapy. However, this time I went with this configuration. Since the fake-useragent was only in pip, I used the pip environment this time.
Scraping the article title of the article, the url of the link to the individual article, the category name, and the article creation date from the IT category of Yahoo News (http://news.yahoo.co.jp/list/?c=computer).
As for the page structure, each index page has 20 individual articles, and the index page itself has 714 pages. The url structure of the index page is http://news.yahoo.co.jp/list/?c=computer&p=1, like http://news.yahoo.co.jp/list/?c=computer&p=. It is the one with the number added. Each individual article is divided into three types: articles with thumbnails, articles without thumbnails, and deleted articles.
Article example with thumbnail
Article example without thumbnail
Deleted article example
lxml uses xPath format to specify the part you want to parse. Use the chrome verification function as a method to get the xPath of the information you want to get. Go to the index page in chrome and select Validate from the right-click menu. After that, open the html tab while referring to the highlights on the page. When you finally reach the information you want to scrape, you can select that information, bring up the right-click menu, and select Copy> Copy XPath to get the xPath.
--Example article with thumbnail (first from top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [2] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [3] / span [1] --Article created: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [3] / span [2] --Example article with thumbnail (20th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [2] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [3] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [3] / span [2] --Example article without thumbnails (8th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [1] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [2] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [2] / span [2] --Example of deleted article (9th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [1] --Link: None --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [2] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [2] / span [2]
Consider the xPath regularity of each piece of information from the xPath samples collected based on the individual article categories.
--The position of the individual article is specified by li [num] at the numth position from the top. --The span [num] immediately after / a / changes to 2,3,3,1,2,2 depending on the presence or absence of thumbnails. --For deleted articles, the / a part changes to / div.
Etc. will be considered. Based on this consideration, we will generate an actual xPath and pass it to lxml.
requests,fake-useragent
import requests #Get html by GET request to url
from fake_useragent import UserAgent #Generate header
def make_scraping_url(page_num):
scraping_url = "http://news.yahoo.co.jp/list/?c=computer&p=" + str(page_num)
return scraping_url
counter_200 = 0
counter_404 = 0
def make_html_text(scraping_url):
global counter_200
global counter_404
ua = UserAgent() # fakeuser-agent object
ran_header = {"User-Agent": ua.random} #Generate a random header each time it is called
html = requests.get(scraping_url, headers=ran_header) #HTML object obtained by requests
html_text = html.text
if html.status_code == 200:
counter_200 += 1
elif html.status_code == 400:
counter_404 += 1
return html_text
The header of requests is randomly generated by fake_useragent. Prepare counter_200
as a global variable as a counter when the get request of requests is successful and counter_404
as a counter when the end processing and get request are not successful.
lxml
import lxml.html #xml parser
#Generated by converting Document Object Model and html to xml format
def make_dom(html_text):
dom = lxml.html.fromstring(html_text)
return dom
#Takes dom and xpath and returns text
def make_text_from_xpath(dom, xpath):
text = dom.xpath(xpath)[0].text
return text
#Takes dom and xpath and returns link
def make_link_from_xpath(dom, xpath):
link = dom.xpath(xpath)[0].attrib["href"]
return link
#The text you want to get is text()If it is contained in
def make_text_from_xpath_function(dom, xpath):
text = dom.xpath(xpath)[1]
return text
You can create a DOM (Documents Object Model) by passing html in text format to lxml. You can get each information by passing XPath to this object. The method of specifying XPath differs depending on the type of information. It seems good to check the XPath description method each time. If you pass XPath, it will basically be returned in list format, so you need to check where the information is in list by turning a for loop or the like.
def debug_check_where_link_number(page_num, article_num):
scraping_url = make_scraping_url(page_num)
html_text = make_html_text(scraping_url)
dom = make_dom(html_text)
xpath = Xpath(article_num)
thumbnail_link = check_thumbnail_link(dom, article_num)
if thumbnail_link[0] is True and thumbnail_link[1] is True:
xpath.with_thumbnail()
elif thumbnail_link[0] is False:
xpath.no_thumbnail()
elif thumbnail_link[1] is False:
xpath.deleted()
get_link = make_link_from_xpath(dom, xpath.info_list[1])
l_get_link = get_link.split("/")
counter = 0
for item in l_get_link:
print(counter)
counter += 1
print(item)
thumbnail_link
will be explained a little further down.
#A class that summarizes the information items you want to scrape and their xpaths, and first pass the number of the individual article from the top
class Xpath:
def __init__(self, article_num):
self.article_num = article_num
self.info_list = []
#For articles with thumbnails
def with_thumbnail(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]", # title
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[2]"] # timestamp
#For articles without thumbnails
def no_thumbnail(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[1]", # title
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[2]"] # timestamp
#For deleted articles
def deleted(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[1]", # title
None, # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[2]"] # timestamp
Create an XPath class based on the XPath discussion we made earlier.
#Determine if there is a thumbnail or link for an individual article
def check_thumbnail_link(dom, article_num):
#An array showing the presence or absence of thumbnails and links
thumbnail_link = [True, True]
#Throw the xpath of the article category without thumbnails and get an error if thumbnails exist
try: #Not an error, that is, the thumbnail does not exist
make_text_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a/span[2]/span[1]")
thumbnail_link[0] = False
except IndexError: #Error, ie thumbnail exists
pass
#Throw the xpath of the link of the article with thumbnails, and if the link does not exist, an error will be returned
try: #No error, i.e. link exists
make_link_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a")
except IndexError: #Error, that is, link does not exist
thumbnail_link[1] = False
return thumbnail_link
Determine if you try throwing an XPath and return an error. It's quite a skill, so I'm thinking of a better way. PyMySQL
$ mysql.server start #Start MySQL server
$ mysql -u root -p #Launch MySQL Cursor
mysql> show databases; #Check the database
mysql> create database database_name #Creating a database
mysql> use database_name; #Select database to use
mysql> show tables; #Checking the table in the database
mysql> show columns from table_name; #Check column in table
mysql> select * from table_name; #Show all elements in the table
mysql> delete from table_name; #Delete all elements in the table, use when scraping wrong
mysql> create table table_name(column_name_1 column_type_1, column_name_2 column_type_1, ...); #Creating a table
#Table to be created this time
mysql> create table yahoo_news(page_num int not null,
title varchar(255),
link_num varchar(255),
category varchar(255),
article_time varchar(255));
I created a database named scraping and a table named yahoo_news. The reason why link_num
is varchar type is that if link_num
starts with 0, 0 disappears when it is put in int type.
import pymysql.cursors #MySQL client
#MySQL connection information
connection = pymysql.connect(host="localhost",
user="root",
password="hogehoge",
db="scraping",
charset="utf8")
#Cursor for throwing a query and each statement
cur = connection.cursor()
sql_insert = "INSERT INTO yahoo_news VALUES (%s, %s, %s, %s, %s)" #Match the number of substitution strings with the number of columns
sql_check = "SELECT * FROM yahoo_news WHERE page_num = %s"
# page_Check num and skip if it is included in the database
cur.execute(sql_check, page_num)
check_num = cur.fetchall() #Get the information that came back
if check_num: #When it is not an empty string, that is, when it is included in the database
continue
#Pass information for 20 items to MySQL
cur.executemany(sql_insert, l_all_get_text) #execute many when passing a list containing tuples, execute when passing just a tuple
connection.commit() #Insert is not done without commit
#Cursor and database termination
cur.close()
connection.close()
Using the substitution string $ s
makes it easier to create statements. It was surprisingly easy to use ʻexecute when passing int type, str type, and tuple type, and ʻexecutemany
when passing list type.
# coding: utf-8
import requests #Get html by GET request to url
from fake_useragent import UserAgent #Generate header
import lxml.html #xml parser
import pymysql.cursors #MySQL client
import time #For delay generation
def make_scraping_url(page_num):
scraping_url = "http://news.yahoo.co.jp/list/?c=computer&p=" + str(page_num)
return scraping_url
counter_200 = 0
counter_404 = 0
def make_html_text(scraping_url):
global counter_200
global counter_404
ua = UserAgent() # fakeuser-agent object
ran_header = {"User-Agent": ua.random} #Generate a random header each time it is called
html = requests.get(scraping_url, headers=ran_header) #HTML object obtained by requests
html_text = html.text
if html.status_code == 200:
counter_200 += 1
elif html.status_code == 400:
counter_404 += 1
return html_text
#Generated by converting Document Object Model and html to xml format
def make_dom(html_text):
dom = lxml.html.fromstring(html_text)
return dom
#Takes dom and xpath and returns text
def make_text_from_xpath(dom, xpath):
text = dom.xpath(xpath)[0].text
return text
#Takes dom and xpath and returns link
def make_link_from_xpath(dom, xpath):
link = dom.xpath(xpath)[0].attrib["href"]
return link
#The text you want to get is text()If it is contained in
def make_text_from_xpath_function(dom, xpath):
text = dom.xpath(xpath)[1]
return text
#A class that summarizes the information items you want to scrape and their xpaths, and first pass the number of the individual article from the top
class Xpath:
def __init__(self, article_num):
self.article_num = article_num
self.info_list = []
#For articles with thumbnails
def with_thumbnail(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]", # title
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[2]"] # timestamp
#For articles without thumbnails
def no_thumbnail(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[1]", # title
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[2]"] # timestamp
#For deleted articles
def deleted(self):
self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[1]", # title
None, # link
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[1]", # category
"//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[2]"] # timestamp
#Determine if there is a thumbnail or link for an individual article
def check_thumbnail_link(dom, article_num):
#An array showing the presence or absence of thumbnails and links
thumbnail_link = [True, True]
#Throw the xpath of the article category without thumbnails and get an error if thumbnails exist
try: #Not an error, that is, the thumbnail does not exist
make_text_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a/span[2]/span[1]")
thumbnail_link[0] = False
except IndexError: #Error, ie thumbnail exists
pass
#Throw the xpath of the link of the article with thumbnails, and if the link does not exist, an error will be returned
try: #No error, i.e. link exists
make_link_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a")
except IndexError: #Error, that is, link does not exist
thumbnail_link[1] = False
return thumbnail_link
#crawling operation, dom and page_num and article_Receives num and returns a tuple summarizing scraping information
def crawling(dom, page_num, article_num):
list_get_text = [] #List for collecting information for one article, tuple type is used for insert into MySQL, so convert it later
list_get_text.append(page_num)
#Create an Xpath object
xpath = Xpath(article_num)
#Check for thumbnails and links, info_Generate list
thumbnail_link = check_thumbnail_link(dom, article_num)
if thumbnail_link[0] is True and thumbnail_link[1] is True:
xpath.with_thumbnail()
elif thumbnail_link[0] is False:
xpath.no_thumbnail()
elif thumbnail_link[1] is False:
xpath.deleted()
for index, xpath_info in enumerate(xpath.info_list):
if index == 1 and thumbnail_link[1] is True: #For link
get_link = make_link_from_xpath(dom, xpath_info)
l_get_link = get_link.split("/")
list_get_text.append(l_get_link[4]) #The index of the link here is actually debug_check_where_link_number()Find out in
elif index == 1 and thumbnail_link[1] is False:
list_get_text.append(None) #Pass NULL in MySQL as None type
else: #For text
get_text = make_text_from_xpath(dom, xpath_info)
list_get_text.append(get_text)
tuple_get_text = tuple(list_get_text)
return tuple_get_text
#MySQL connection information
connection = pymysql.connect(host="localhost",
user="root",
password="hogehoge",
db="scraping",
charset="utf8")
#Cursor for throwing a query and each statement
cur = connection.cursor()
sql_insert = "INSERT INTO yahoo_news VALUES (%s, %s, %s, %s, %s)" #Match the number of substitution strings with the number of columns
sql_check = "SELECT * FROM yahoo_news WHERE page_num = %s"
def main():
global counter_200
start_page = 1
end_page = 5
for page_num in range(start_page, end_page + 1):
# page_Check num and skip if it is included in the database
cur.execute(sql_check, page_num)
check_num = cur.fetchall() #Get the information that came back
if check_num: #When it is not an empty string, that is, when it is included in the database
continue
#End processing
if counter_404 == 5:
break
l_all_get_text = [] #List for collecting information for 20 articles
#Various generation and access to url are also done here
scraping_url = make_scraping_url(page_num)
html_text = make_html_text(scraping_url)
dom = make_dom(html_text)
for article_num in range(1, 21):
tuple_get_text = crawling(dom, page_num, article_num)
l_all_get_text.append(tuple_get_text)
#Pass information for 20 items to MySQL
cur.executemany(sql_insert, l_all_get_text) #execute many when passing a list containing tuples, execute when passing just a tuple
connection.commit() #Insert is not done without commit
#Delay generation for each crawling
time.sleep(3) #Specified in seconds
#Delay generation for each crawling multiple times
if counter_200 == 10:
counter_200 = 0
time.sleep(60) #Specified in seconds
#Cursor and database termination
cur.close()
connection.close()
def debug_check_where_link_number(page_num, article_num):
scraping_url = make_scraping_url(page_num)
html_text = make_html_text(scraping_url)
dom = make_dom(html_text)
xpath = Xpath(article_num)
thumbnail_link = check_thumbnail_link(dom, article_num)
if thumbnail_link[0] is True and thumbnail_link[1] is True:
xpath.with_thumbnail()
elif thumbnail_link[0] is False:
xpath.no_thumbnail()
elif thumbnail_link[1] is False:
xpath.deleted()
get_link = make_link_from_xpath(dom, xpath.info_list[1])
l_get_link = get_link.split("/")
counter = 0
for item in l_get_link:
print(counter)
counter += 1
print(item)
if __name__ == "__main__":
main()
If anything, it became an implementation to create a site map within the site. It seems that each individual page should be accessed using the link_num
obtained by this script.
It would be helpful if you could point out any mistakes or improvements.
Recommended Posts