Get only articles from web pages in Python

A library that allows you to easily extract text from web pages

Extracting data scraped with Python is not useful for HTML tags or later minutes Extra information is often obtained.

In such a case, *** readability-lxml *** is all you need. I will explain here

Install first

(env)$pip install readability-lxml

Create a utility class like the one below

`utils.py`


# -*- coding:utf8 -*-
import lxml.html
import readability
def get_content(html):
    """
From HTML strings(title,Text)Get a tuple of.
    """

    document = readability.Document(html)
    content_html = document.summary()
    #Remove HTML tags to get only the body text.
    content_text = lxml.html.fromstring(content_html).text_content().strip()
    short_title = document.short_title()
    return short_title, content_text

Test if you can actually get the title and content using the utility class (I used an article from Yahoo News)

import utils
import requests
obj = requests.get('https://headlines.yahoo.co.jp/hl?a=20191230-00000310-oric-ent')
title,content = utils.get_content(obj.content)
print(title)
print(content)

Please confirm that the article is acquired as follows.

Change log

--2019/12/31 Newly created

Recommended Posts

Get only articles from web pages in Python

Get data from Quandl in Python

Get exchange rates from open exchange rates in Python

Get battery level from SwitchBot in Python

Get Precipitation Probability from XML in Python

Get metric history from MLflow in Python

Get time series data from k-db.com in Python

Get data from GPS module at 10Hz in Python

Get YouTube Comments in Python

Get last month in python

Web scraping notes in python3

OCR from PDF in Python

Get Evernote notes in Python

Get Japanese synonyms in Python

Get your heart rate from the fitbit API in Python!

Get the value while specifying the default value from dict in Python

Hit REST in Python to get data from New Relic

Get macro constants from C (++) header file (.h) in Python

Get message from first offset with kafka consumer in python

Get Leap Motion data in Python.

python web scraping-get elements in bulk

Web application development memo in python

Get the desktop path in Python

Get web screen capture with python

Get the script path in Python

Extract text from images in Python

Get, post communication memo in Python

Get the desktop path in Python

Get the host name in Python

web coder tried excel in Python

Get started with Python in Blender

Extract strings from files in Python

How to get a string from a command line argument in python

Get US stock price from Python with Web API with Raspberry Pi

Get additional data in LDAP with python

[Python] Web application from 0! Hands-on (2) -Hello World-

[Python] Web application from 0! Hands-on (3) -API implementation-

Get a capture of the entire web page in Selenium Python VBA

Get html from element with Python selenium

[Note] Get data from PostgreSQL with Python

Get Suica balance in Python (using libpafe)

Get keystrokes from / dev / input (python evdev)

Python: Reading JSON data from web API

Revived from "no internet access" in Python

Prevent double boot from cron in Python

Get Google Fit API data in Python

How to get a value from a parameter store in lambda (using python)

How to get a stacktrace in python

Get Youtube data in Python using Youtube Data API

[Python] Web application from 0! Hands-on (4) -Data molding-

Get a token for conoha in python

Get Started with TopCoder in Python (2020 Edition)

Generate a class from a string in Python

Generate C language from S-expressions in Python

Get the EDINET code list in Python

Convert from Markdown to HTML in Python

Get Cloud Logging available in Python in 10 minutes

[Python] Web application from 0! Hands-on (0) -Environmental construction-

[Python] Get a list of folders only

[Python] Get the main color from the screenshot

Get rid of DICOM images in Python