[Python] How to deal with pandas read_html read error

[Python] How to deal with pandas read_html read error

Is it possible to read html files with the method called "read_html" of pandas? I thought, and when I tried it, I got different errors in a row ...


** ▼ 4 issues (errors) ** ①「ImportError: lxml not found, please install it」

②「ImportError: html5lib not found, please install it」

③「ImportError: BeautifulSoup4 (bs4) not found, please install it」

④「ValueError: No tables found」


** ▼ Correspondence point ** -You need to install some libraries to use it.

-Data acquisition is only for tables.

■ Examples of countermeasures

** Install the following 3 libraries ** and then You can do this by specifying ** the page containing the table ** and executing it.

① Library installation

Library installation


pip install lxml
pip install html5lib
pip install bs4

② Data acquisition execution

Data acquisition execution


import pandas as pd

url = 'https://stocks.finance.yahoo.co.jp/stocks/detail/?code=998407.O'
df = pd.read_html(url)

URL: Nikkei Stock Average [998407]: Domestic Index-Yahoo! Finance

I was able to get it safely.


### Supplement ** ▼ Why do you need to install 3? ** Page analysis is done with Beautiful Soup 4. At that time, lxml is used for table analysis. Use html5lib when parsing html with lxml fails.

html5lib has excellent analysis ability and can complement html, but it is heavy.

However, it is recommended to install html5lib as well.

** ▼ What you can do with read_html ** Can only be used with the

element. Child elements , .

Official page

Notes on HTML parsing library


## ■ Error details

First error

error


import pandas as pd

url = 'https://www.yahoo.co.jp/'
df = pd.read_html(url)

#output
# ImportError: lxml not found, please install it

It is said that lxml is not installed.

What is lxml?

A type of python library that parses html and xml.

Very light. However, the analysis may fail because it only supports strict markup.

lxml installation

pip install lxml

Second error

What is html5lib?

When I installed lxml successfully and ran it again, another error was ...

「ImportError: html5lib not found, please install it」

This is the one that analyzes html5. Higher performance than lxml. Correct markup can be automatically generated from invalid markup.

Instead, it's heavy.

It is used when html parsing fails in lxml.

Install html5lib

pip install html5lib


## Third error ### What is BeautifulSoup4 (bs4)? When I successfully installed html5lib and ran the code again, I got a new error ...

「ImportError: BeautifulSoup4 (bs4) not found, please install it」

BeautifulSoup4 (bs4) parses html and xml.

A library that serves as a masterpiece for page analysis. Lxml and html5lib are used in the backend for the analysis of this table.

Installation of BeautifulSoup4 (bs4)

pip install bs4

Fourth error

When I successfully installed bs4 and ran the code again, I got a new error ...

「ValueError: No tables found」

Apparently, only table data is available. ..

Get table data

As a site with table data, try on the Nikkei Stock Average [998407] page of yahoo finance.

https://m.finance.yahoo.co.jp/stock?code=998407.O

read_html


import pandas as pd

url = 'https://stocks.finance.yahoo.co.jp/stocks/detail/?code=998407.O'
df = pd.read_html(url)

df
View output results

python


[        0   1         2                   3
0 Nikkei Stock Average NaN 19389.43 Compared to the previous day+724.83(+3.88%),
       0             1
0 What will happen to the stock price forecast? Tomorrow's Nikkei average,
               0         1
0 Nikkei Stock Average NY Dow
1 TOPIX NASDAQ Composite
2 Jasdaq Index S & P 500
3 Hang Seng FTSE 100
4 Shanghai Composite DAX
5 Mumbai SENSEX 30 CAC 40,
                                              0  \
0 Picte Gold(With H)Other returns(1 year)19.94%   
1 Global SDGs Equity Fund Other Returns(1 year)5.27%   
2 eMAXIS Slim Global Equity(All country)Other returns(1 year)3.91%   
 
                                       1  
0 Pinebridge Capital Securities F(With H)Other returns(1 year)6.35%  
1 Risk Control Global Asset Diversification Fund Other Returns(1 year)4.81%  
2 US Stock Dividend Aristocrat(Settlement type four times a year)International stock returns(1 year)2.21%  ,
                                                    0  \
0 Overall price increase rate 1.Shinto HLD+45.45% 2.Sugai+28.30% 3.Amaga...   
 
                                                    1  
0 TSE First Section Price increase rate 1.Kaneko species+19.95% 2.Segue G+18.28% 3.Kobayashi...  ]

Recommended Posts

, ,