Extracting data scraped with Python is not useful for HTML tags or later minutes Extra information is often obtained.
In such a case, *** readability-lxml *** is all you need. I will explain here
(env)$pip install readability-lxml
Create a utility class like the one below
utils.py
# -*- coding:utf8 -*-
import lxml.html
import readability
def get_content(html):
"""
From HTML strings(title,Text)Get a tuple of.
"""
document = readability.Document(html)
content_html = document.summary()
#Remove HTML tags to get only the body text.
content_text = lxml.html.fromstring(content_html).text_content().strip()
short_title = document.short_title()
return short_title, content_text
Test if you can actually get the title and content using the utility class (I used an article from Yahoo News)
import utils
import requests
obj = requests.get('https://headlines.yahoo.co.jp/hl?a=20191230-00000310-oric-ent')
title,content = utils.get_content(obj.content)
print(title)
print(content)
Please confirm that the article is acquired as follows.
--2019/12/31 Newly created
Recommended Posts