Recently, I've been getting a lot of scraping work in my work. I used to implement scraping with a PHP library called simplehtml.
--Automatic form submission --There is a dedicated CLI --Simply popular
For that reason, I've been doing scraping work recently using Python Scrapy. (PHP is easy, but I also have a personal desire to graduate from PHP.)
The main reasons why Scrapy is good are as follows:
――You can make complicated scraping --You can try out with the command tool of CLI
I think that can be mentioned. Until now, scraping used to read URL patterns. Scrapy provides methods for screen transitions that allow you to submit forms, for example, with much less memory resources than browser automation with *** selenium ***.
*** Installing scrapy ***
$pip install scrapy
*** Start a scrapy Spider project ***
$scrapy startproject [project_name] [project_dir]
*** List the created Spider projects ***
$scrapy list
*** Create a new Spider in the created project ***
#Add domain name
$scrapy genspider [spider_name] mydomain.com
** Specify URL when executing command line **
$scrapy crawl -a start_urls="http://example1.com,http://example2.com" [spider_name]
*** Output in CSV ***
$scrapy crawl -o csv_file_name.csv [spider_name]
*** Output as JSON ***
$scrapy crawl -o json_file_name.json [spider_name]
** Launch the Scrapy shell **
$ scrapy shell [URL]
** Show all pages **
#response can be used without definition
response.body
** Get all links **
for link in response.css('a::attr(href)'):
print link.get()
** Use regular expressions **
#When a specific file in the href of the a tag matches
matched = response.css('a::attr(href)').re(r'detail\.php')
if len(matched) > 0:
print 'matched'
#When a specific Japanese in the character string of the a tag matches
matched = response.css('a::text').re(u'Summary')
if len(matched) > 0:
print 'matched'
** Get Tag **
#get a tag
response.css('a')
** Get with selector **
#get a tag
response.css('a.link')
#Get multiple classes<li class="page next"></li>
response.css('li.page.next')
** Convert relative path to URL **
for link in response.css('a::attr(href)'):
print response.urljoin(link.get())
** Submit form information **
scrapy.FormRequest(response,formdata={"username":"login_username","password":"login_password"}
** Iterative processing of child elements of the element acquired by XPath **
#Get DIV element
divs = response.xpath('//div')
#Repeat P element in DIV
for p in divs.xpath('.//p'):
print(p.get())
** Transition to another page **
#self.parse(self,response)As a callback function
yield scrapy.Request([url],callback=self.parse)
** Create Item (Edit items.py directly under Project) ** Original story
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
** Go to the detail page until there are no more items in the list (it will not work as it is, so please put it in the class) **
def parse(self, response):
title = a.css('::text').extract_first()
title_match = a.css('::text').re(u'training')
if len(title_match) > 0:
"title":title,
"url":response.urljoin(link_param)
}
ptn = re.search("\/jinzaiikusei\/\w+\/",url)
if ptn:
self.scraping_list.append(url)
yield scrapy.Request(self.scraping_list[0],callback=self.parse_detail)
pass
def parse_detail(self, response):
for item in response.css('a'):
title = item.css('::text').extract_first()
url = item.css('::attr(href)').extract_first()
title_matched = item.css('::text').re(u'training')
url_matched = item.css('::attr(href)').re(r'jinzaiikusei\/.*\/.*\.html')
if url_matched:
item = {
"title":title,
"url":url
}
yield item
self.current_index = self.current_index + 1
if self.current_index < len(self.scraping_list):
yield scrapy.Request(self.scraping_list[self.current_index],callback=self.parse_detail)
else:
pass
--2019/12/06 Newly created --2019/12/07 Added library techniques --2019/12/09 Added library techniques (form input, etc.) --2019/12/16 Added chapter about items --2019/12/21 Added command --Added to the shell part on 2020/1/20 --Added 2020/2/12 url join --2020/2/13 Added sample
Recommended Posts