Getting Started with Python Web Scraping Practice

I would like to write an introduction to the practice of web scraping with Python.

Except for the general part, I would like to go with a style that understands by feeling.

Addition This article is a bit difficult, but I think it's very valuable in terms of learning. After reading this article, you can do it quickly by looking at this technique section. Python Web scraping technique collection "There is no value that cannot be obtained" JavaScript support

things to do

Eventually, "Access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv."

I would like to make a program.

Caution

It is a note. Read it carefully. [Okazaki Municipal Central Library Case (Librahack Case) --Wikipedia](https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B % E4% B8% AD% E5% A4% AE% E5% 9B% B3% E6% 9B% B8% E9% A4% A8% E4% BA% 8B% E4% BB% B6) List of precautions for web scraping

What do you use

Language: Python 2.7.12 Libraries: urllib2, BeautifulSoup, csv, datetime, time

urllib2 is required to access the URL. BeautifulSoup is like an xml parser that opens an accessed and retrieved file This library is required when working with csv files. datetime is a library for getting time

Library installation

urllib2 is installed when you install Python. Use the pip command to install BeautifulSoup

`shell.sh`


$ pip install beautifulsoup4

Let's get the page title of the Nihon Keizai Shimbun as a starting point!

First of all, access the Nihon Keizai Shimbun with Python and get the HTML.

After that, make it into a form that can be handled by Beautiful Soup,

The page title is acquired from the form that can be handled and output.

Also, this time it may be difficult to get the image if you get only the title of the page, so I would like to get the title element and get the title from the title elements.

`getNikkeiWebPageTitle.py`


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Get the title element →<title>Economic, stock, business and political news:Nikkei electronic version</title>
title_tag = soup.title

#Get the element string → Economic, Stock, Business, Political News:Nikkei electronic version
title = title_tag.string

#Output title element
print title_tag

#Output the title as a character string
print title

Doing this will return the following results:

`shell.sh`


$ python getNikkeiWebPageTitle.py
<title>Economic, stock, business and political news:Nikkei electronic version</title>
Economic, stock, business and political news:Nikkei electronic version

By the way

`print.py`


print soup.title.string

Similar results can be obtained in this case as well.

I think you have a rough idea of this.

Practice!

The goal this time is to "access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv." If you check the program procedure

Access the Nikkei Stock Average page of the Nihon Keizai Shimbun and get the HTML
Get the Nikkei Stock Average using Beautiful Soup
Write the date, time and Nikkei Stock Average in one record in csv

Header is not used for csv.

Let's do it.

Access the Nikkei Stock Average page

First of all, access the Nikkei Stock Average page.

The theory is to look up the URL yourself from the browser in advance.

If you look it up, you can find it on the page of "Nikkei → Market → Stocks".

I will use the previous program

`getNikkeiHeikin.py`


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/markets/kabu/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Output the title as a character string
print soup.title.string

This will output the title.

`shell.sh`


$ python getNikkiHeikin.py
>>Stocks: Market: Nikkei electronic version

Nikkei Stock Average acquisition

Next, get the Nikkei Stock Average.

Let's open Nikkei> Market> Stocks in your browser.

The Nikkei Stock Average is listed slightly below the top of this page.

To get this, you need to find the location of this data in HTML.

Right-click on the Nikkei Stock Average and press "Verify".

Then I think that the screen will look like this

スクリーンショット 2016-12-01 17.59.17.png

Class = "mkc-stock_prices" in the span element.

Now you know the position.

Let's actually print with Beautiful Soup.

`getNikeiHeikin.py`


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup

#URL to access
url = "http://www.nikkei.com/markets/kabu/"

#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)

#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")

#Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
span = soup.find_all("span")

#Declare it first so that it does not cause an error when printing.
nikkei_heikin = ""
#Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
for tag in span:
    #Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
    try:
        #class from tag="n"Extract the n string of. Since multiple classes may be set
        #The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
        # <span class="hoge" class="foo">  →   ["hoge","foo"]  →   hoge
        string_ = tag.get("class").pop(0)

        #Mkc in the extracted class string-stock_Check if it is set as prices
        if string_ in "mkc-stock_prices":
            # mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
            nikkei_heikin = tag.string
            #Since the extraction is completed, I will exit for minutes
            break
    except:
        #Path → Do nothing
        pass

#Outputs the extracted Nikkei Stock Average.
print nikkei_heikin

result

`shell.sh`


$ python getNikeiHeikin.py
>>18,513.12

The explanation of the code is basically inserted in the comment

To express the flow simply

Go to Nihon Keizai Shimbun> Market> Stocks and pick up HTML
Since the Nikkei Stock Average is surrounded by span elements, all span elements inside HTML are extracted.
Make sure that "mkc-stock_prices" is set to class for each span element.
When the set class is found, get the value with .string and end for minutes
Print the obtained value

It is a flow.

The flow of this program can be used in most situations The advantage is that it is not so difficult and can be applied in most situations. As a caveat, if the span element changes to another element or the content of the class changes, it cannot be output.

Repeat and csv output

Output this result to csv and repeat it every hour

`getNikeiHeikin.py`


# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
from datetime import datetime
import csv
import time

time_flag = True

#Let it run forever
while True:
    #If the time is other than 59 minutes, wait 58 seconds
    if datetime.now().minute != 59:
        #1 minute because it's not 59 minutes(58 seconds)Wait for a while(誤差がないとは言い切れないので58 secondsです)
        time.sleep(58)
        continue
    
    #Open csv in append mode → Open csv here because it takes time to open csv when the file becomes large
    f = open('nikkei_heikin.csv', 'a')
    writer = csv.writer(f, lineterminator='\n')

    #It's 59 minutes, but I can't get out until 59 seconds at second intervals to measure at the correct time.
    while datetime.now().second != 59:
            #It's not 00 seconds, so wait 1 second
            time.sleep(1)
    #The process finishes quickly and repeats twice, so wait here for a second.
    time.sleep(1)

    #Create a record to describe in csv
    csv_list = []

    #Get the current time in year, month, day, hour, minute, second
    time_ = datetime.now().strftime("%Y/%m/%d %H:%M:%S")
    #Insert time in the first column
    csv_list.append(time_)

    #URL to access
    url = "http://www.nikkei.com/markets/kabu/"

    #Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
    html = urllib2.urlopen(url)

    #Handle html with Beautiful Soup
    soup = BeautifulSoup(html, "html.parser")

    #Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
    span = soup.find_all("span")

    #Declare it first so that it does not cause an error when printing.
    nikkei_heikin = ""
    #Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
    for tag in span:
        #Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
        try:
            #class from tag="n"Extract the n string of. Since multiple classes may be set
            #The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
            # <span class="hoge" class="foo">  →   ["hoge","foo"]  →   hoge
            string_ = tag.get("class").pop(0)

            #Mkc in the extracted class string-stock_Check if it is set as prices
            if string_ in "mkc-stock_prices":
                # mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
                nikkei_heikin = tag.string
                #Since the extraction is completed, I will exit for minutes
                break
        except:
            #Path → Do nothing
            pass

    #The extracted Nikkei Stock Average is output over time.
    print time_, nikkei_heikin
    #Record the Nikkei 225 in the second column
    csv_list.append(nikkei_heikin)
    #Add to csv
    writer.writerow(csv_list)
    #Close to prevent file corruption
    f.close()

Speaking in a flow

Wait until n: 00 seconds
Open csv
Create a record
Get the Nikkei Stock Average
Add to record
Write record to csv

That's why

If you keep doing this, it will access once an hour and get the Nikkei 225 and record it.

You can do anything by applying this

For example, you can add carts at high speed (so-called scripting) at the time of sale on a long river in South America. .. ..

I don't recommend it very much

Then!

Also here

Python Web scraping technique collection "There is no value that cannot be obtained" JavaScript support [10,000 requests per second !?] Explosive web scraping starting with Go language [Golang] [For beginners] Re: Genetic algorithm starting from zero [Artificial intelligence]