Hello. I work as an accountant in an IT company. I'm working for an IT company, so I want to make use of programming in my back office work ...! I tried to touch Python with that feeling.
--People who want to know about scraping ――People who are looking for something they can do with a little bit of programming --People in the management department who are considering whether to automate their work
――What can you do with scraping? ――How does scraping work? --Sample ① --Sample ② (application example)
** Arbitrary information can be extracted and processed ** from the acquired HTML. Specifically, you can use scraping to do this.
--Automate the work of accessing the same website every day and posting information to a csv file --Download all images hit by a specific keyword on SNS --Create a sales list by extracting companies that apply to a specific keyword
Roughly speaking ** Acquisition of information on the website → Extraction of information → Output of results ** There are 3 steps.
In this article, I will show you how to scrape using Python. This time, I used the following two.
--BeautifulSoup library (information extraction, result output) --A module called urllib (acquisition of information on the website)
Specifically, the following processing is performed.
When specifying HTML elements, find methods and select methods are provided. This time, I will focus on find methods.
--find (): returns only one element --findAll (): Returns a list of elements that meet the conditions
Let's take a look at the sample immediately.
① Download python https://www.javadrive.jp/python/install/index1.html
② Install beautiful soup You can install it with pip.
$ pip install beautifulsoup4
③ Install urllib This can also be installed with pip.
$ pip install urllib3
I will try to get the title from our homepage.
sample.py
import urllib.request #urllib available
from bs4 import BeautifulSoup #Beautiful Soup available
r = urllib.request.urlopen(‘https://www.is-tech.co.jp/’) #Get information Get the HTML of the page to deal with
html = r.read().decode('utf-8', 'ignore') #HTML utf-Convert to 8 and read
parser = "html.parser" #Specify the information destination in HTML
soup = BeautifulSoup(html, parser) #Completion of BeautifulSoup object
title = soup.find("title").text #Specify the characters included in the title tag in HTML as the extraction target
print(title) #Display the characters contained in the title tag
I was able to extract it.
Suppose you have a fictitious website like the one in the image below. This is a site where financial information of fictitious listed companies is summarized on each page. The sales for the last 3 years are extracted for each company and output to csv. The url is https://www.example.com/NNNN.
example.com/NNNN.html
<!DOCTYPE html>
</html>
<html>
<head>
<meta charset="utf-8">
<title>Financial information</title>
</head>
<body>
<h1 class="company">Hogehoge Co., Ltd.</h1>
<table class="information">
<tr>
<td> </td>
<td>First term</td>
<td>2 terms ago</td>
<td>3 terms ago</td>
</tr>
<tr>
<td>amount of sales</td>
<td>1,200 million yen</td>
<td>1,100 million yen</td>
<td>1,000 million yen</td>
</tr>
<tr>
<td>Ordinary income</td>
<td>240 million yen</td>
<td>220 million yen</td>
<td>200 million yen</td>
</tr>
<tr>
<td>Net income</td>
<td>120 million yen</td>
<td>110 million yen</td>
<td>100 million yen</td>
</tr>
</table>
</body>
Add the following arrangement to the sample that extracted the title from the homepage earlier.
--Repeat processing of 4-digit securities code (1000 to 9999) --Do nothing if the security code does not exist --Also use the findAll () method to specify the HTML element under the condition of what number △△ in ○○. ――Pause for 5 seconds after each process (to reduce the load on the accessed server) --Finally output to csv
sample2.py
import urllib.request
from bs4 import BeautifulSoup
import csv #csv related processing available
import time #Ability to pause processing
class Scraper:
def __init__(self, site, code): #Create class
self.site = site
self.code = code
def scrape(self): #Create an instance
url = str(self.site) + str(self.code) #Corporate url
r = urllib.request.urlopen(url)
html = r.read().decode('utf-8', 'ignore')
parser = "html.parser"
soup = BeautifulSoup(html, parser) #Completion of soup object
company = soup.find("h1") #Get company name
if "Company information page not found" in company.text: #If there is no company with the corresponding securities code, no processing will be performed.
pass
else: #Otherwise
table = soup.find("table") #Table with financial information for the last 3 terms
data = table.findAll("td") #Specify all tables with td tags
sales = data[5].text.split('One million yen')[0] #Sales in the previous term(=6th td tag in data)
sales_two_years_ago = data[6].text.split('One million yen')[0] #Sales two terms ago(=7th td tag in data)
sales_three_years_ago = data[7].text.split('One million yen')[0] #Sales 3 terms ago(=8th td tag in data)
row = [code, company, sales, sales_two_years_ago, sales_three_years_ago] #Listed for output to csv
with open('test.csv', 'a', newline ='') as f: #Export to csv
writer = csv.writer(f)
writer.writerow(row)
time.sleep(5) #Pause processing for 5 seconds
source = "https://www.example.com/" #Common part of url
for i in range(1000, 10000): #Iterative processing for securities codes 1000 to 9999
Scraper(source, i).scrape()
In this way, you can output all company information to csv.
I wrote the code to scrape Yahoo! Finance to automate the business, but it was prohibited by the terms of use ... When using scraping, please ** check the terms of use of the website in advance **. Depending on the site, the API may be open to the public, so there is also a way to use the API for processing.
Recommended Posts