I tried web scraping with python.

at first

I like perl as a lightweight language, but somehow I've been interested in python recently, so For the time being, I tried scraping the title and text in order from the list of Yahoo! News while studying. http://news.yahoo.co.jp/list/

environment

Python 2.7.11 lxml requests selenium phantomjs * El Capitan cannot be entered from brew. Reference: http://qiita.com/labeneko/items/e3b790e06778900f5719

npm install phantom phantomjs -g

Implement immediately

Zudong

#coding: UTF-8
import json
import lxml.html
import requests
from datetime import datetime
from selenium import webdriver
from time import sleep

#--------------------------------------------------
# WebSpider
class WebSpider:
    def __init__(self, rootUrl):
        self._webDriver = webdriver.PhantomJS()
        self._pageSourceMap = {}
        self._expireTime = (60 * 60) * 1
        self._rootUrl = rootUrl
    
    def __del__(self):
        del self._webDriver
        self._pageSourceMap.clear()
    
    def eachContents(self, url, selector, proc):
        for content in self.getContents(url, selector):
            proc(content)
    
    def getContents(self, url, selector):
        self._releaseCaches()
        if self._hasCachedPage(url) and self._rootUrl != url:
            print "> [!] use cached source: " + url
            return self._pageSourceMap[url][1].cssselect(selector)
        sleep(1)
        self._webDriver.get(url)
        pageSource = lxml.html.fromstring(self._webDriver.page_source)
        self._pageSourceMap[url] = (self._getCurrentUnixTime(), pageSource)
        print "> [i] cached page source: " + url
        return self._pageSourceMap[url][1].cssselect(selector)
    
    def _hasCachedPage(self, url):
        return self._pageSourceMap.has_key(url)
    
    def _releaseCaches(self):
        for key, value in self._pageSourceMap.items():
            isExpire = (self._getCurrentUnixTime() - value[0]) >= long(self._expireTime)
            if isExpire:
                print "> [!!!] pop cached source: " + key
                self._pageSourceMap.pop(key, None)

    def _getCurrentUnixTime(self):
        return long(datetime.now().strftime("%s"))

#--------------------------------------------------
# create instance
rootUrl = "http://news.yahoo.co.jp/list/"
webSpider = WebSpider(rootUrl)

#--------------------------------------------------
# eachProcs
def pickUpContents(content):
    webSpider.eachContents(content.attrib["href"], "#link", summaryContents)

def summaryContents(content):
    webSpider.eachContents(content.attrib["href"], "#ym_newsarticle > div.hd > h1", titleContents)
    webSpider.eachContents(content.attrib["href"], "#ym_newsarticle > div.articleMain > div.paragraph > p", mainTextContents)

def titleContents(content):
    print content.text.encode("utf_8")

def mainTextContents(content):
    print lxml.html.tostring(content, encoding="utf-8", method="text")

#--------------------------------------------------
# run
webSpider.eachContents(rootUrl, "#main > div.mainBox > div.backnumber > div.listArea > ul > li > a", pickUpContents)

del webSpider

[i] cached page source: http://news.yahoo.co.jp/pickup/6215860 [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000132-spnannex-base Pinch hitter Otani doubles ... Ham loses to Seibu and waits for the result of Soft B [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000132-spnannex-base

◇ Pacific League Nippon-Ham 0-3 Seibu (September 27, 2016 Seibu Prince)

Nippon-Ham, whose magic number to win is "1", lost to Seibu 0-3. (Omitted ...) [i] cached page source: http://news.yahoo.co.jp/pickup/6215858 [i] cached page source: http://headlines.yahoo.co.jp/hl?a=20160927-00000361-oric-ent Nana Mizuki Concert Explains the process on the official website of Koshien's natural grass damage [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000361-oric-ent

Voice actor artist Nana Mizuki updated her official website on the 27th. Regarding the case where it was reported in part that the natural grass of the stadium was damaged after Mizuki's concert held at the Hanshin Koshien Stadium in Hyogo on the 22nd. (Omitted ...) ...

it is complete. ~~ ** That's hard to see the python code, right? ** ~~

Zackli commentary

Reference article: http://qiita.com/beatinaniwa/items/72b777e23ef2390e13f8 It's simple to do, specify the target URL and CSS selector in ** webSpider.eachContents **, Put the resulting data in ** eachContents ** and nest ...

The title and body of the last (following 2 links from the list page of Yahoo! News) are acquired and displayed. I would say it myself, but I think it was overwhelmingly easier to read if I wrote it in a ** for statement. ** **

As I noticed later, the selector specification implemented in ** summaryContents ** is not enough. Since data of some patterns cannot be acquired (systems with embedded videos, etc.), To get all the data, you need to prepare some patterns of selectors.

I also added cache and expire in vain, but this code alone doesn't make much sense. I'm trying to expand it further, store data in mongoDB, and play with MeCab and wordVec2.

end.

I tried web scraping with python.

at first

environment

Implement immediately

Zackli commentary

Impressions