I tried scraping Yahoo News with Python

Introduction

As a programming beginner, I tried scraping Yahoo News using Python, so I would like to write down the procedure. (Something like a memo for yourself) You need to install Beautiful Soup and requests in advance.

things to do

This time I will try to get the access ranking of Yahoo News. There is an "access ranking" at the bottom right of the image. Get this by scraping from 1st to 5th place. スクリーンショット 2021-01-02 12.19.31.png

Code I wrote

It's a short code, but it's enough to scrape. (Beautiful Soup Shugoi)

import requests
from bs4 import BeautifulSoup

url = 'https://news.yahoo.co.jp/'
res = requests.get(url)

soup = BeautifulSoup(res.content, "html.parser")
ranking = soup.find(class_="yjnSub_section")
items = ranking.find_all("li",class_="yjnSub_list_item")
print("【Access Ranking】")
for i,item in enumerate(items):
    text = item.find("p", class_="yjnSub_list_headline")
    news_url = item.find("a")
    print(str(i+1) + "Rank:" + text.getText())
    print(news_url.get('href'))

** Now run in the terminal ** スクリーンショット 2021-01-02 12.34.18.png It worked properly. (Somehow, the ranking has fluctuated a little, but it's almost the same, so Yoshi!)

Rough commentary

You can view HTML by right-clicking → verification on Chrome, so use it.

ranking = soup.find(class_="yjnSub_section")

Since the access ranking was in the class called yjnSub_section of the section tag, we will get this part.

items = ranking.find_all("li",class_="yjnSub_list_item")

The ranking variable still contains all the news information from the 1st to 5th places, so list it and put it in the items variable. Each news was in a class called yjnSub_list_item in the li tag, so get this part.

text = item.find("p", class_="yjnSub_list_headline")

Get the title of the article. Since we were in a class called yjnSub_list_headline in the p tag, we will get this part.

news_url = item.find("a")

The URL of the article was in the a tag, so get this part.

print(str(i+1) + "Rank:" + text.getText())
print(news_url.get('href'))

Since text and news_url contain unnecessary parts, use text.getText () and news_url.get ('href') to output only the necessary parts. You can now scrape safely.

Finally

If you use Python, you can easily scrape like this. If you are interested, please try it. However, please note that some sites prohibit scraping.

Recommended Posts

I tried scraping Yahoo News with Python

I tried scraping with Python

I tried scraping with python

I tried web scraping with python.

I tried scraping Yahoo weather (Python edition)

I tried fp-growth with python

I tried gRPC with Python

I tried running prolog with python 3.8.2.

I tried SMTP communication with Python

Scraping with Python

Scraping with Python

I tried scraping

I tried sending an email with python.

I tried non-photorealistic rendering with Python + opencv

I tried a functional language with Python

I tried recursion with Python ② (Fibonacci sequence)

#I tried something like Vlookup with Python # 2

I tried scraping the ranking of Qiita Advent Calendar with Python

Scraping with Python (preparation)

Try scraping with Python.

Scraping with Python + PhantomJS

I tried Python> autopep8

Scraping with Selenium [Python]

Scraping with Python + PyQuery

I tried Python> decorator

Scraping RSS with Python

I tried "smoothing" the image with Python + OpenCV

I tried hundreds of millions of SQLite with python

I tried web scraping using python and selenium

I tried "differentiating" the image with Python + OpenCV

I tried L-Chika with Raspberry Pi 4 (Python edition)

I tried Jacobian and partial differential with python

I tried to get CloudWatch data with Python

I tried using mecab with python2.7, ruby2.3, php7

I tried function synthesis and curry with python

I tried to output LLVM IR with Python

I tried "binarizing" the image with Python + OpenCV

I tried running faiss with python, Go, Rust

I tried to automate sushi making with python

I tried playing mahjong with Python (single mahjong edition)

I tried running Deep Floor Plan with Python 3.6.10.

I tried sending an email with SendGrid + Python

Web scraping with python + JupyterLab

Scraping with Selenium + Python Part 1

Scraping with chromedriver in python

Festive scraping with Python, scrapy

I tried Learning-to-Rank with Elasticsearch!

I tried to implement Minesweeper on terminal with python

I tried to get started with blender python script_Part 01

I made blackjack with python!

I tried to touch the CSV file with Python

I tried to draw a route map with Python

[OpenCV / Python] I tried image analysis of cells with OpenCV

I tried clustering with PyCaret

I tried to solve the soma cube with python

I made a net news notification app with Python

Scraping with Selenium in Python

I tried to get started with blender python script_Part 02

I was addicted to scraping with Selenium (+ Python) in 2020

I tried to implement an artificial perceptron with python

Scraping with Tor in Python