As a programming beginner, I tried scraping Yahoo News using Python, so I would like to write down the procedure. (Something like a memo for yourself) You need to install Beautiful Soup and requests in advance.
This time I will try to get the access ranking of Yahoo News. There is an "access ranking" at the bottom right of the image. Get this by scraping from 1st to 5th place.
It's a short code, but it's enough to scrape. (Beautiful Soup Shugoi)
import requests
from bs4 import BeautifulSoup
url = 'https://news.yahoo.co.jp/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser")
ranking = soup.find(class_="yjnSub_section")
items = ranking.find_all("li",class_="yjnSub_list_item")
print("【Access Ranking】")
for i,item in enumerate(items):
text = item.find("p", class_="yjnSub_list_headline")
news_url = item.find("a")
print(str(i+1) + "Rank:" + text.getText())
print(news_url.get('href'))
** Now run in the terminal ** It worked properly. (Somehow, the ranking has fluctuated a little, but it's almost the same, so Yoshi!)
You can view HTML by right-clicking → verification on Chrome, so use it.
ranking = soup.find(class_="yjnSub_section")
Since the access ranking was in the class called yjnSub_section of the section tag, we will get this part.
items = ranking.find_all("li",class_="yjnSub_list_item")
The ranking variable still contains all the news information from the 1st to 5th places, so list it and put it in the items variable. Each news was in a class called yjnSub_list_item in the li tag, so get this part.
text = item.find("p", class_="yjnSub_list_headline")
Get the title of the article. Since we were in a class called yjnSub_list_headline in the p tag, we will get this part.
news_url = item.find("a")
The URL of the article was in the a tag, so get this part.
print(str(i+1) + "Rank:" + text.getText())
print(news_url.get('href'))
Since text and news_url contain unnecessary parts, use text.getText () and news_url.get ('href') to output only the necessary parts. You can now scrape safely.
If you use Python, you can easily scrape like this. If you are interested, please try it. However, please note that some sites prohibit scraping.
Recommended Posts