I would like to write an introduction to the practice of web scraping with Python.
Except for the general part, I would like to go with a style that understands by feeling.
Eventually, "Access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv."
I would like to make a program.
It is a note. Read it carefully. [Okazaki Municipal Central Library Case (Librahack Case) --Wikipedia](https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B % E4% B8% AD% E5% A4% AE% E5% 9B% B3% E6% 9B% B8% E9% A4% A8% E4% BA% 8B% E4% BB% B6) List of precautions for web scraping
Language: Python 2.7.12 Libraries: urllib2, BeautifulSoup, csv, datetime, time
urllib2 is required to access the URL. BeautifulSoup is like an xml parser that opens an accessed and retrieved file This library is required when working with csv files. datetime is a library for getting time
urllib2 is installed when you install Python. Use the pip command to install BeautifulSoup
shell.sh
$ pip install beautifulsoup4
First of all, access the Nihon Keizai Shimbun with Python and get the HTML.
After that, make it into a form that can be handled by Beautiful Soup,
The page title is acquired from the form that can be handled and output.
Also, this time it may be difficult to get the image if you get only the title of the page, so I would like to get the title element and get the title from the title elements.
getNikkeiWebPageTitle.py
# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
#URL to access
url = "http://www.nikkei.com/"
#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)
#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")
#Get the title element →<title>Economic, stock, business and political news:Nikkei electronic version</title>
title_tag = soup.title
#Get the element string → Economic, Stock, Business, Political News:Nikkei electronic version
title = title_tag.string
#Output title element
print title_tag
#Output the title as a character string
print title
Doing this will return the following results:
shell.sh
$ python getNikkeiWebPageTitle.py
<title>Economic, stock, business and political news:Nikkei electronic version</title>
Economic, stock, business and political news:Nikkei electronic version
By the way
print.py
print soup.title.string
Similar results can be obtained in this case as well.
I think you have a rough idea of this.
The goal this time is to "access the Nihon Keizai Shimbun every hour and record the Nikkei Stock Average at that time in csv." If you check the program procedure
Header is not used for csv.
Let's do it.
First of all, access the Nikkei Stock Average page.
The theory is to look up the URL yourself from the browser in advance.
If you look it up, you can find it on the page of "Nikkei → Market → Stocks".
I will use the previous program
getNikkeiHeikin.py
# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
#URL to access
url = "http://www.nikkei.com/markets/kabu/"
#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)
#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")
#Output the title as a character string
print soup.title.string
This will output the title.
shell.sh
$ python getNikkiHeikin.py
>>Stocks: Market: Nikkei electronic version
Next, get the Nikkei Stock Average.
Let's open Nikkei> Market> Stocks in your browser.
The Nikkei Stock Average is listed slightly below the top of this page.
To get this, you need to find the location of this data in HTML.
Right-click on the Nikkei Stock Average and press "Verify".
Then I think that the screen will look like this
Class = "mkc-stock_prices" in the span element.
Now you know the position.
Let's actually print with Beautiful Soup.
getNikeiHeikin.py
# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
#URL to access
url = "http://www.nikkei.com/markets/kabu/"
#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)
#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")
#Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
span = soup.find_all("span")
#Declare it first so that it does not cause an error when printing.
nikkei_heikin = ""
#Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
for tag in span:
#Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
try:
#class from tag="n"Extract the n string of. Since multiple classes may be set
#The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
# <span class="hoge" class="foo"> → ["hoge","foo"] → hoge
string_ = tag.get("class").pop(0)
#Mkc in the extracted class string-stock_Check if it is set as prices
if string_ in "mkc-stock_prices":
# mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
nikkei_heikin = tag.string
#Since the extraction is completed, I will exit for minutes
break
except:
#Path → Do nothing
pass
#Outputs the extracted Nikkei Stock Average.
print nikkei_heikin
result
shell.sh
$ python getNikeiHeikin.py
>>18,513.12
The explanation of the code is basically inserted in the comment
To express the flow simply
It is a flow.
The flow of this program can be used in most situations The advantage is that it is not so difficult and can be applied in most situations. As a caveat, if the span element changes to another element or the content of the class changes, it cannot be output.
Output this result to csv and repeat it every hour
getNikeiHeikin.py
# coding: UTF-8
import urllib2
from bs4 import BeautifulSoup
from datetime import datetime
import csv
import time
time_flag = True
#Let it run forever
while True:
#If the time is other than 59 minutes, wait 58 seconds
if datetime.now().minute != 59:
#1 minute because it's not 59 minutes(58 seconds)Wait for a while(誤差がないとは言い切れないので58 secondsです)
time.sleep(58)
continue
#Open csv in append mode → Open csv here because it takes time to open csv when the file becomes large
f = open('nikkei_heikin.csv', 'a')
writer = csv.writer(f, lineterminator='\n')
#It's 59 minutes, but I can't get out until 59 seconds at second intervals to measure at the correct time.
while datetime.now().second != 59:
#It's not 00 seconds, so wait 1 second
time.sleep(1)
#The process finishes quickly and repeats twice, so wait here for a second.
time.sleep(1)
#Create a record to describe in csv
csv_list = []
#Get the current time in year, month, day, hour, minute, second
time_ = datetime.now().strftime("%Y/%m/%d %H:%M:%S")
#Insert time in the first column
csv_list.append(time_)
#URL to access
url = "http://www.nikkei.com/markets/kabu/"
#Html to access URL is returned →<html><head><title>Economic, stock, business and political news:Nikkei electronic version</title></head><body....
html = urllib2.urlopen(url)
#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")
#Extract all span elements → All span elements are put back in the array →[<span class="m-wficon triDown"></span>, <span class="l-h...
span = soup.find_all("span")
#Declare it first so that it does not cause an error when printing.
nikkei_heikin = ""
#Class from all span elements in for minutes="mkc-stock_prices"Look for the one that has become
for tag in span:
#Elements for which class is not set are tags.get("class").pop(0)Avoid the error with try as it will result in an error because you cannot do
try:
#class from tag="n"Extract the n string of. Since multiple classes may be set
#The get function returns as an array. So the array function pop(0)Removes the very beginning of the sequence by
# <span class="hoge" class="foo"> → ["hoge","foo"] → hoge
string_ = tag.get("class").pop(0)
#Mkc in the extracted class string-stock_Check if it is set as prices
if string_ in "mkc-stock_prices":
# mkc-stock_Since prices are set, the character string enclosed in tags.Aburidashi with string
nikkei_heikin = tag.string
#Since the extraction is completed, I will exit for minutes
break
except:
#Path → Do nothing
pass
#The extracted Nikkei Stock Average is output over time.
print time_, nikkei_heikin
#Record the Nikkei 225 in the second column
csv_list.append(nikkei_heikin)
#Add to csv
writer.writerow(csv_list)
#Close to prevent file corruption
f.close()
Speaking in a flow
That's why
If you keep doing this, it will access once an hour and get the Nikkei 225 and record it.
You can do anything by applying this
For example, you can add carts at high speed (so-called scripting) at the time of sale on a long river in South America. .. ..
I don't recommend it very much
Then!
Python Web scraping technique collection "There is no value that cannot be obtained" JavaScript support [10,000 requests per second !?] Explosive web scraping starting with Go language [Golang] [For beginners] Re: Genetic algorithm starting from zero [Artificial intelligence]
Recommended Posts