Scraping is a technology that ** searches and extracts arbitrary information from websites **. In addition to retrieving data on the Web, you can also analyze the structure.
Before doing scraping, here are some things to check and things to keep in mind while working.
Whether the API exists If there is a service that provides an API, use that to get the data. If you still have problems such as inadequate data, consider scraping.
Regarding the use of the acquired data Be careful when using the acquired data. This is because the acquired data is a copyrighted work other than your own, and you need to consider it so that it does not conflict with copyright law.
Reproduction for private use (Article 30) http://www.houko.com/00/01/S45/048.HTM#030
Reproduction for information analysis, etc. (Article 47-7) http://www.houko.com/00/01/S45/048.HTM#047-6
In addition, the following are three rights that are of particular concern.
The manufacturing right is one of the rights included in the copyright and is stipulated in Article 21 of the Copyright Law. (Article 21 "The author has the exclusive right to copy the work.") Reproduction means copying, recording / recording, printing or making a photograph, copying (copying), electronically reading with a scanner, and storing. Reference: https://www.jrrc.or.jp/guide/outline.html
Translation rights and adaptation rights are copyright property rights stipulated in Article 27 of the Copyright Act. Article 27 states that "the author has the exclusive right to translate, arrange, or transform, or adapt, make a movie, and otherwise adapt the work" ("Copyright Information Center" http. (From //www.cric.or.jp/db/article/a1.html#021) is clearly stated. On the contrary, doing these without the permission of the author is a copyright infringement. Quote: http://www.iprchitekizaisan.com/chosakuken/zaisan/honyaku_honan.html
The public transmission right is a copyright property right stipulated in Article 23 of the Copyright Act. In this Article 23, "The author occupies the right to publicly transmit (including enabling transmission in the case of automatic public transmission) for the work." "The author occupies the right." It occupies the right to publicly transmit the work transmitted to the public using the receiving device. " Quote: http://www.iprchitekizaisan.com/chosakuken/zaisan/kousyusoushin.html
Also, pay attention to the above, and make sure that the code you write does not overwhelm the server when you actually perform scraping. Excessive access puts a strain on the server and is considered an attack, and in the worst case, the service may not be available for a certain period of time. In addition, there are cases where one of the users was arrested due to an access failure in the system, so please use it within the bounds of common sense. https://ja.wikipedia.org/wiki/岡崎市立中央図書館事件
With the above in mind, let's move on.
It is useful to know the basics of HTML when practicing web scraping. The reason is that ** the data is acquired by specifying the tags used in HTML (\ , \
) **.
Let me give you an example.
sample.html
<html>
<head>
<title>neet-AI</title>
</head>
<body>
<div id="main">
<p>neet-Click here for AI link</p>
<a href="https://neet-ai.com">neet-AI</a>
</div>
</body>
</html>
If you look at the above code on your browser
A page like this will appear.
Let's explain the HTML tags used on this page.
Tag name | Description |
---|---|
<html></html> | A tag that explicitly states that this is HTML code |
<head></head> | Represents basic information (character code and page title) of the page. |
<title></title> | Represents the page title. |
<body></body> | Represents the body of the page. |
<div></div> | The tag itself has no meaning, but it is often used to describe it as one content. |
<p></p> | The sentence enclosed by this tag is now represented as one paragraph. |
<a></a> | Represents a link to another page. |
There are many types other than the tags described above. Check each time you find out what kind of tag you want.
Now that you understand HTML tags, let's scrape them.
The above procedure is the basic procedure for scraping.
When web scraping with Python, we will use various libraries.
・ Requests Used to get a web page.
・ BeautifulSoup4 Analyze the acquired web page, search for tags, and format the data.
We will do web scraping using the above library.
Before scraping, you need to fetch the HTML of the web page in Python.
get_html.py
import requests
response = requests.get('http://test.neet-ai.com')
print(response.text)
Let's explain each line.
response = requests.get('http://test.neet-ai.com')
This line takes the HTML from http://test.neet-ai.com. The fetched HTML goes into the response variable.
print(response.text)
The variable response cannot be used in Beautiful Soup without text.
title_scraping.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com')
soup = BeautifulSoup(response.text,'lxml')
title = soup.title.string
print(title)
Seeing is believing, let's take a look at the program. Up to the 4th line, it is the same as "Preparing for scraping with Python". The scraping program starts from the 5th line, so let's explain each line.
soup = BeautifulSoup(response.text,'lxml')
Here, a variable called soup is prepared so that the fetched HTML data can be scraped. The'lxml'in parentheses means ** "I'll convert response.text with a tool called lxml" **.
title = soup.title.string
If you can convert the fetched HTML data, you can extract the specified data by specifying it with a fixed type of BeautifulSoup.
Let's walk through this program step by step. It looks like ** searching for the tag title in the soup variable and outputting the character string in the title tag in string format **. This is a little difficult to understand programmatically, so it may be better to understand it intuitively. It is difficult to understand as it is, so I would appreciate it if you could imagine it as follows.
Please refer to the URL below as there is not enough time to introduce more detailed formats here.
If you get the following results by running this program, you are successful.
neet-AI
First of all, the \ \ tag is used to represent a link in HTML. In this case, we want to get the URL in the a tag, so we can't use the string format.
get_link.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com')
soup = BeautifulSoup(response.text,'lxml')
link = soup.a.get('href')
print(link)
** You can get the linked href by using a function called get (). ** ** Keep in mind that the get () function is useful and will be used frequently in the future.
The page I was referring to so far had only one a tag. So how do you scrape a page with multiple a tags? First of all, let's run the previous program on a page with multiple a tags. Let's change the URL of the line that gets the page.
link_get.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com/index2.html')
soup = BeautifulSoup(response.text,'lxml')
link = soup.a.get('href')
print(link)
When I run it, I only see the neet-AI link. This is because we are only extracting the first a tag found in soup.a.get ('href'). If you want to extract all a tags, it will be as follows.
link_all_get.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com/index2.html')
soup = BeautifulSoup(response.text,'lxml')
links = soup.findAll('a')
for link in links:
print(link.get('href'))
Let's explain each line.
links = soup.findAll('a')
Here ** all a tags are extracted and once put in the list called links. ** **
for link in links:
print(link.get('href'))
Since it is a list type, you can operate it one by one by turning it with for. You can get each URL by using the get () function on the link variable that can be operated. ** Remember this method of getting all the tags once and turning them with for so that you can operate them ** as you will use them often in the future.
If you go easily ** URL scraping in the page → Requests the obtained URL destination → Scraping again ** It's like that. This is easy as long as you have the basic grammar of Python. Then, get the URL by scraping from https://test.neet-ai.com/index3.html, and it is the URL https:: Let's scrape //test.neet-ai.com/index4.html to get the Twitter and facebook links.
python:scraping_to_scraping.py:
import requests
from bs4 import BeautifulSoup
#1st scraping
response = requests.get('http://test.neet-ai.com/index3.html')
soup = BeautifulSoup(response.text,'lxml')
link = soup.a.get('href')
#Second scraping
response = requests.get(link)
soup = BeautifulSoup(response.text,'lxml')
sns = soup.findAll('a')
twitter = sns[0].get('href')
facebook = sns[1].get('href')
print(twitter)
print(facebook)
By making multiple requests and scraping **, you can scrape across sites and pages. ** **
Previously, the tag didn't mention id or class. However, on a typical site, id or class is set in the tag to make web design easier or to improve the readability of the program. Setting the id and class does not make scraping much more difficult. ** On the contrary, it may be easier when you say "I want to scrape only this content!".
index5.html
<html>
<head>
<title>neet-AI</title>
</head>
<body>
<div id="main">
<a id="neet-ai" href="https://neet-ai.com">neet-AI</a>
<a id="twitter" href="https://twitter.com/neetAI_official">Twitter</a>
<a id="facebook" href="https://www.facebook.com/Neet-AI-1116273381774200/">Facebook</a>
</div>
</body>
</html>
For example, suppose you have a site like the one above. As you can see by looking at the tag of a, id is given to all. If you want to get the Twitter URL at this time, you can write like this.
twitter_scra.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com/index5.html')
soup = BeautifulSoup(response.text,'lxml')
twitter = soup.find('a',id='twitter').get('href')
print(twitter)
You can easily get it by specifying the id name as the second of find.
Next, let's make it a class.
index6.html
<html>
<head>
<title>neet-AI</title>
</head>
<body>
<div id="main">
<a class="neet-ai" href="https://neet-ai.com">neet-AI</a>
<a class="twitter" href="https://twitter.com/neetAI_official">Twitter</a>
<a class="facebook" href="https://www.facebook.com/Neet-AI-1116273381774200/">Facebook</a>
</div>
</body>
</html>
twitter_scra_clas.py
import requests
from bs4 import BeautifulSoup
response = requests.get('http://test.neet-ai.com/index6.html')
soup = BeautifulSoup(response.text,'lxml')
twitter = soup.find('a',class_='twitter').get('href')
print(twitter)
Note that class _ **, not ** class. This is because class is registered in advance as a reserved word (a word that has a special meaning in the language specifications) in python. To avoid this, the BeautifulSoup library author probably added an underscore.
The basics of web scraping so far are HTML pages designed to facilitate web scraping. However, ** general websites are not designed for scraping, so they can have a very complex structure **.
Since it is so complicated, knowledge other than scraping such as the characteristics of web pages is required in addition to scraping.
In the advanced version, you will be able to scrape complicated sites to some extent if you get the hang of it, so let's cultivate know-how in the advanced version.
When scraping, there are many situations where the URL is scraped and the URL destination is scraped again. In this case, ** do not try to program at once, but debug each time you create one scraping program **. If you debug and all the URLs are displayed, it's like creating a scraping program after that. This may be true for programming in general.
This technique comes in handy. Let's take nifty news as an example.
For example, there is a page that can be paged by IT Category. Let's actually press the second below to turn the page.
Looking at the URL again, https://news.nifty.com/technology/2 It will be.
Next, let's move to the third page.
The URL on the third page is like this. https://news.nifty.com/technology/3
As anyone who has done server-side development knows, most of the time when creating page-by-page pages ** Enter the number of pages at the end of the URL and parameters to update the page. ** **
If you use this mechanism, you can turn pages by simply replacing the numbers in the ** URL. ** **
Try changing the end to your favorite number. I think you can jump to that number. (Although there are limits)
Now, let's create a program that scrapes the search results from the 1st page to the 10th page on the program.
paging_scraping.py
import requests
from bs4 import BeautifulSoup
for page in range(1,11):
r = requests.get("https://news.nifty.com/technology/"+str(page))
r.encoding = r.apparent_encoding
print(r.text)
It looks like this. This technique is useful when scraping search results and URLs that are serial numbers.
I think that I was scraping by playing with the URL in "Using the communication characteristics of the Web page" earlier, but if I scraped without knowing the limit, the result of exceeding the limit will be None or 404 data. It will be **. To prevent this, manually know the page limits in advance and incorporate them into your program.
Now that you have the basics and tips, let's actually scrape a large amount of data automatically on your website.
Challenge: Let's get past meteorological data from the Japan Meteorological Agency website from January 1, 2000 to December 31, 2003. http://www.data.jma.go.jp/obd/stats/etrn/view/daily_s1.php?prec_no=44&block_no=47662&year=2000&month=06&day=1&view=a2
SampleOutput
>python ●●●●.py
29
1013.8
1018.2
1012.5
19:27
--
--
--
--
--
8.1
13.8
16:24
2.6
07:16
4.9
46
28
16:24
30
1013.6
1018.0
1013.2
00:05
--
--
--
--
--
9.0
13.1
12:16
5.3
02:51
4.6
41
27
21:50
Scraping is as long as you can get the desired data. So the program doesn't have to be the same for everyone. Here, I will post the program I created.
weather_scraping.py
import requests
from bs4 import BeautifulSoup
#Since you can set the year and month with the URL%Allows you to embed the alphanumeric characters specified by s.
base_url = "http://www.data.jma.go.jp/obd/stats/etrn/view/daily_s1.php?prec_no=44&block_no=47662&year=%s&month=%s&day=1&view=a2"
#2000 in for statement~Turn 3 times until 2003.
for year in range(2000,2004):
#January with nested for~Turn 12 times in December.
for month in range(1,13):
#January 2000 with for statement...February...Since it can be turned in March, I will embed it.
r = requests.get(base_url%(year,month))
r.encoding = r.apparent_encoding
#We will scrape the target table.
soup = BeautifulSoup(r.text,'lxml')
rows = soup.findAll('tr',class_='mtx')
i = 1
#The first 1st to 3rd rows of the table are column information, so they are sliced.
rows = rows[4:]
#Get one line from the 1st to the last day with the for statement.
for row in rows:
data = row.findAll('td')
#Since there are various data in one line, I will retrieve all of them.
for d in data:
print(d.text)
print("")
We would appreciate it if you could donate! Cryptocurrency BTC 18c58m54Lf5AKFqSqhEDU548heoTcUJZk ETH 0x291d860d920c68fb5387d8949a138ee95c8d3f03 ZEC t1KtTRKy9w1Sv5zqi3MYnxZ3zugo54G4gXn REP 0x291d860d920c68fb5387d8949a138ee95c8d3f03
Recommended Posts