I tried web scraping with python.

at first

I like perl as a lightweight language, but somehow I've been interested in python recently, so For the time being, I tried scraping the title and text in order from the list of Yahoo! News while studying. http://news.yahoo.co.jp/list/

environment

Python 2.7.11 lxml requests selenium phantomjs * El Capitan cannot be entered from brew. Reference: http://qiita.com/labeneko/items/e3b790e06778900f5719

npm install phantom phantomjs -g

Implement immediately

Zudong

#coding: UTF-8
import json
import lxml.html
import requests
from datetime import datetime
from selenium import webdriver
from time import sleep

#--------------------------------------------------
# WebSpider
class WebSpider:
    def __init__(self, rootUrl):
        self._webDriver = webdriver.PhantomJS()
        self._pageSourceMap = {}
        self._expireTime = (60 * 60) * 1
        self._rootUrl = rootUrl
    
    def __del__(self):
        del self._webDriver
        self._pageSourceMap.clear()
    
    def eachContents(self, url, selector, proc):
        for content in self.getContents(url, selector):
            proc(content)
    
    def getContents(self, url, selector):
        self._releaseCaches()
        if self._hasCachedPage(url) and self._rootUrl != url:
            print "> [!] use cached source: " + url
            return self._pageSourceMap[url][1].cssselect(selector)
        sleep(1)
        self._webDriver.get(url)
        pageSource = lxml.html.fromstring(self._webDriver.page_source)
        self._pageSourceMap[url] = (self._getCurrentUnixTime(), pageSource)
        print "> [i] cached page source: " + url
        return self._pageSourceMap[url][1].cssselect(selector)
    
    def _hasCachedPage(self, url):
        return self._pageSourceMap.has_key(url)
    
    def _releaseCaches(self):
        for key, value in self._pageSourceMap.items():
            isExpire = (self._getCurrentUnixTime() - value[0]) >= long(self._expireTime)
            if isExpire:
                print "> [!!!] pop cached source: " + key
                self._pageSourceMap.pop(key, None)

    def _getCurrentUnixTime(self):
        return long(datetime.now().strftime("%s"))

#--------------------------------------------------
# create instance
rootUrl = "http://news.yahoo.co.jp/list/"
webSpider = WebSpider(rootUrl)

#--------------------------------------------------
# eachProcs
def pickUpContents(content):
    webSpider.eachContents(content.attrib["href"], "#link", summaryContents)

def summaryContents(content):
    webSpider.eachContents(content.attrib["href"], "#ym_newsarticle > div.hd > h1", titleContents)
    webSpider.eachContents(content.attrib["href"], "#ym_newsarticle > div.articleMain > div.paragraph > p", mainTextContents)

def titleContents(content):
    print content.text.encode("utf_8")

def mainTextContents(content):
    print lxml.html.tostring(content, encoding="utf-8", method="text")

#--------------------------------------------------
# run
webSpider.eachContents(rootUrl, "#main > div.mainBox > div.backnumber > div.listArea > ul > li > a", pickUpContents)

del webSpider

[i] cached page source: http://news.yahoo.co.jp/pickup/6215860 [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000132-spnannex-base Pinch hitter Otani doubles ... Ham loses to Seibu and waits for the result of Soft B [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000132-spnannex-base

◇ Pacific League Nippon-Ham 0-3 Seibu (September 27, 2016 Seibu Prince)

Nippon-Ham, whose magic number to win is "1", lost to Seibu 0-3. (Omitted ...) [i] cached page source: http://news.yahoo.co.jp/pickup/6215858 [i] cached page source: http://headlines.yahoo.co.jp/hl?a=20160927-00000361-oric-ent Nana Mizuki Concert Explains the process on the official website of Koshien's natural grass damage [!] use cached source: http://headlines.yahoo.co.jp/hl?a=20160927-00000361-oric-ent

Voice actor artist Nana Mizuki updated her official website on the 27th. Regarding the case where it was reported in part that the natural grass of the stadium was damaged after Mizuki's concert held at the Hanshin Koshien Stadium in Hyogo on the 22nd. (Omitted ...) ...

it is complete. ~~ ** That's hard to see the python code, right? ** ~~

Zackli commentary

Reference article: http://qiita.com/beatinaniwa/items/72b777e23ef2390e13f8 It's simple to do, specify the target URL and CSS selector in ** webSpider.eachContents **, Put the resulting data in ** eachContents ** and nest ...

The title and body of the last (following 2 links from the list page of Yahoo! News) are acquired and displayed. I would say it myself, but I think it was overwhelmingly easier to read if I wrote it in a ** for statement. ** **

As I noticed later, the selector specification implemented in ** summaryContents ** is not enough. Since data of some patterns cannot be acquired (systems with embedded videos, etc.), To get all the data, you need to prepare some patterns of selectors.

I also added cache and expire in vain, but this code alone doesn't make much sense. I'm trying to expand it further, store data in mongoDB, and play with MeCab and wordVec2.

end.

Impressions

8.gif

Recommended Posts

I tried web scraping with python.
I tried scraping with Python
I tried scraping with python
I tried scraping Yahoo News with Python
I tried fp-growth with python
Web scraping with python + JupyterLab
I tried gRPC with Python
Web scraping beginner with python
I tried web scraping using python and selenium
Scraping with Python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Python
I tried scraping
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
I tried a functional language with Python
I tried recursion with Python ② (Fibonacci sequence)
Getting Started with Python Web Scraping Practice
I tried scraping Yahoo weather (Python edition)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
#I tried something like Vlookup with Python # 2
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
I tried Python> autopep8
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
I tried Python> decorator
Scraping RSS with Python
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried "smoothing" the image with Python + OpenCV
I tried web scraping to analyze the lyrics.
I tried hundreds of millions of SQLite with python
I tried "differentiating" the image with Python + OpenCV
I tried L-Chika with Raspberry Pi 4 (Python edition)
I tried Jacobian and partial differential with python
I tried to get CloudWatch data with Python
I tried using mecab with python2.7, ruby2.3, php7
I tried function synthesis and curry with python
I tried to output LLVM IR with Python
I tried "binarizing" the image with Python + OpenCV
I tried running faiss with python, Go, Rust
I tried to automate sushi making with python
I tried playing mahjong with Python (single mahjong edition)
I tried running Deep Floor Plan with Python 3.6.10.
I tried sending an email with SendGrid + Python
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
I tried using "Streamlit" which can do the Web only with Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1