In the previous post, I used rllib and requests to get the elements of the web page.
Here, we will extract only the necessary data from it. To be precise, this extraction work is called scraping.
There are the following methods for scraping.
Consider HTML or XML as a simple character string and extract the necessary parts. For example, you can use the re module of the Python standard library to retrieve arbitrary strings with relative flexibility.
Basically, this method is the most used. There are multiple libraries that scrape from HTML etc. You can easily do it by using it.
The typical modules included in the Python library are as follows.
We will explain the terms used in XML and HTML using pages written in XML. XML is the same markup language as HTML and is a more extensible language. Let's get the code of the XML page of Yahoo! News as a sample.
import requests
r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml")
print(r.text)
>>>Output result
<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:blogChannel="http://backend.userland.com/blogChannelModule" version="2.0">
<channel>
<title>Yahoo!News topics-Major</title>
<link>https://news.yahoo.co.jp/</link>
<description>Yahoo!We provide the latest headlines featured in JAPAN News Topics.</description>
<language>ja</language>
<pubDate>Thu, 06 Dec 2018 19:42:33 +0900</pubDate>
<item>
<title>Measures against "brain fatigue" that you are not aware of</title>
<link>https://news.yahoo.co.jp/pickup/6305814</link>
<pubDate>Thu, 06 Dec 2018 19:35:20 +0900</pubDate>
<enclosure length="133" url="https://s.yimg.jp/images/icon/photo.gif" type="image/gif">
</enclosure>
<guid isPermaLink="false">yahoo/news/topics/6305814</guid>
</item>
......(The following is omitted)......
There is a description like
title is called element name
BeautifulSoup is a simple and easy-to-remember scraping library. I will continue to explain how to use it easily using the XML page of Yahoo! News.
#Import libraries and modules
from bs4 import BeautifulSoup
import requests
# yahoo!Get the main data of the news
r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml")
# BeautifulSoup()You cannot directly specify the file name or URL for
soup = BeautifulSoup(r.text, "xml")
The BeautifulSoup method gets the specified web page. In the first argument, specify the HTML character string as str type or bytes type. Specify the parser as the second argument. A parser is a program that performs parsing. In this case, the HTML string is parsed element by element and converted for ease of use.
The parsers that can be used with Beautiful Soup are as follows. Choose the right parser for your purpose.
Now that you have specified the appropriate parser, you are ready to parse the web page. Now, let's get any part.
There are several ways to specify an element for acquisition, but here we will use the following two.
If you specify the tag name, attribute, or both in the find method for the parsed data Get only the first appearing element that satisfies it. Also, the find_all method gets all the specified elements in the list as well.
If you specify a CSS selector with the select_one method for parsed data Get only the first appearing element that satisfies it. Also, the select method gets all the specified elements in the list as well.
import requests
from bs4 import BeautifulSoup
# yahoo!Get the main data of the news
r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml")
#Parse with xml
soup = BeautifulSoup(r.text, "xml")
#Extract only the first element of the title tag
print(soup.find("title"))
print()
#Extract all elements of the title tag
print(soup.find_all("title"))
>>>Output result
<title>Yahoo!News topics-Major</title>
[<title>Yahoo!News topics-Major</title>,
<title>Go Iyama wins 43rd term and sets new record</title>,
<title>Is Mitsuki Takahata and Sakaguchi continuing dating?</title>,
<title>Mieko Hanada remarried under 13 years old</title>,
....(The following is omitted)
CSS selector is an expression method that specifies elements to decorate such as character strings with CSS. For example, if you specify "body> h1", you will get the h1 element, which is the direct child relationship of the body element.
#(First half omitted)
#Extracts only the very first h1 element in the body element
print(soup.select_one("body > h1"))
print()
#Extracts all h1 elements in the body element
print(soup.select("body > h1"))
The h3 tag remained as it was in the information acquired in the previous section. This is because the tag was also added to the list. With text, you can retrieve only the text in each of the retrieved elements.
import requests
from bs4 import BeautifulSoup
# yahoo!Get the main data of the news
r = requests.get("https://news.yahoo.co.jp/pickup/rss.xml")
#Parse with xml
soup = BeautifulSoup(r.text, "xml")
#Extract the element of the title tag
titles = soup.find_all("title")
#Get each element from the list using a for statement
#You can use text to remove tags and output only text
for title in titles:
print(title.text)
>>>Output result
Yahoo!News topics-Major
Investigate for explosion, gross negligence, etc.
NEWS Koyama news every.Get off
Mieko Hanada remarried under 13 years old
...
So far, I've scraped only one web page, In reality, I think that you often scrape multiple pages such as "next page".
To scrape multiple pages, you need to get all the URLs of the pages you want to scrape.
The exercise web page has a page number at the bottom. Since the link destination is set for each, it seems good to get it. The URL of the link destination is described in the href attribute of the element.
for url in soup.find_all("a"):
print(url.get("href"))
import requests
from bs4 import BeautifulSoup
#Get Aidemy's practice web page
authority = "http://scraping.aidemy.net"
r = requests.get(authority)
#Parse with lxml
soup = BeautifulSoup(r.text, "lxml")
#To find links for page transitions<a>Get the element
urls = soup.find_all("a")
# -----The url you want to scrape_Get to list-----
url_list = []
# url_Add the URL of each page to the list
for url in urls:
url = authority + url.get("href")
url_list.append(url)
#Output list
print(url_list)
In the previous section, we were able to list the URLs we wanted to get.
Repeat scraping for each of the acquired URLs You can get various information such as photo name and age.
Also, if you write the acquired information to a database or write it to a file, It will be available for processing data.
import urllib.request
import requests
from bs4 import BeautifulSoup
#Get Aidemy's practice web page
authority = "http://scraping.aidemy.net"
r = requests.get(authority)
#Parse with lxml
soup = BeautifulSoup(r.text, "lxml")
#To find links for page transitions<a>Get the element
urls = soup.find_all("a")
# -----The url you want to scrape_Get to list-----
url_list = []
# url_Add the URL of each page to list
for url in urls:
url = authority + url.get("href")
url_list.append(url)
# -----Scraping photo titles-----
#Create a scraping function
def scraping(url):
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#Answer here
photos = soup.find_all("h3")
photos_list = []
#Please complete the following for statement
for photo in photos:
photo = photo.text
photos_list.append(photo)
return photos_list
for url in url_list:
print(scraping(url))
Recommended Posts