The html of websites on the Internet contains various information, and it is difficult to analyze it by yourself. Therefore, we use a library called Requests that gets html.
This time, we will learn how to use Requests by acquiring the headlines of articles in the domestic column of MSN Japan.
In [1] Import Beautiful Soup, Requests and Re
In[1]
from bs4 import BeautifulSoup
import requests
import Re
In [2] Store html information in variable urlshutoku
In[2]
urlshutoku = requests.get("https://www.msn.com/ja-jp")
In [3] Try to display the entire page
In[3]
urlshutoku.text
When In [3] is displayed, unnecessary information is more noticeable, so only the headings that are necessary information this time are displayed. For that purpose, the headline information must be obtained. That's where Google Chrome's developer tools come in.
First, right-click the heading and click Validate (I). Then, the following screen is displayed.
The information used for scraping is only alphanumeric characters on the left side of the above screen. Make sure that the heading at the top of the part where you clicked Verify earlier is blue. Next, check \ corresponding to the url of the article headline. Other headlines are the same, so \ seems to be a clue.
In [4] Analyzed with BeautifulSoup and html.parser
In[4]
soup = BeautifulSoup(urlshutoku.text,"html.parser")
Extract domestic headlines using In [5] find_all
In[5]
midashi = soup.find_all(href=re.compile("/ja-jp/news/national"))
If you type midashi on the jupyter notebook, the headline information will be displayed, but the url information is also included. Since it is difficult to see as it is, only characters can be displayed.
Display only characters using In [6] for statement and string
In[6]
for ichiran in midashi:
print(ichiran.string)
Now only the heading is displayed.
Recommended Posts