[Python] How to deal with pandas read_html read error
[Python] How to deal with pandas read_html read error
Is it possible to read html files with the method called "read_html" of pandas? I thought, and when I tried it, I got different errors in a row ...
** ▼ 4 issues (errors) **
①「ImportError: lxml not found, please install it」
②「ImportError: html5lib not found, please install it」
③「ImportError: BeautifulSoup4 (bs4) not found, please install it」
④「ValueError: No tables found」
** ▼ Correspondence point **
-You need to install some libraries to use it.
-Data acquisition is only for tables.
■ Examples of countermeasures
** Install the following 3 libraries ** and then
You can do this by specifying ** the page containing the table ** and executing it.
① Library installation
Library installation
pip install lxml
pip install html5lib
pip install bs4
② Data acquisition execution
Data acquisition execution
import pandas as pd
url = 'https://stocks.finance.yahoo.co.jp/stocks/detail/?code=998407.O'
df = pd.read_html(url)
URL: Nikkei Stock Average [998407]: Domestic Index-Yahoo! Finance
I was able to get it safely.
### Supplement
** ▼ Why do you need to install 3? **
Page analysis is done with Beautiful Soup 4. At that time, lxml is used for table analysis.
Use html5lib when parsing html with lxml fails.
html5lib has excellent analysis ability and can complement html, but it is heavy.
However, it is recommended to install html5lib as well.
** ▼ What you can do with read_html **
Can only be used with the
element.
Child elements , , | , .
Official page
Notes on HTML parsing library
## ■ Error details
First error
error
import pandas as pd
url = 'https://www.yahoo.co.jp/'
df = pd.read_html(url)
#output
# ImportError: lxml not found, please install it
It is said that lxml is not installed.
What is lxml?
A type of python library that parses html and xml.
Very light. However, the analysis may fail because it only supports strict markup.
lxml installation
pip install lxml
Second error
What is html5lib?
When I installed lxml successfully and ran it again, another error was ...
「ImportError: html5lib not found, please install it」
This is the one that analyzes html5.
Higher performance than lxml. Correct markup can be automatically generated from invalid markup.
Instead, it's heavy.
It is used when html parsing fails in lxml.
Install html5lib
pip install html5lib
## Third error
### What is BeautifulSoup4 (bs4)?
When I successfully installed html5lib and ran the code again, I got a new error ...
「ImportError: BeautifulSoup4 (bs4) not found, please install it」
BeautifulSoup4 (bs4) parses html and xml.
A library that serves as a masterpiece for page analysis.
Lxml and html5lib are used in the backend for the analysis of this table.
Installation of BeautifulSoup4 (bs4)
pip install bs4
Fourth error
When I successfully installed bs4 and ran the code again, I got a new error ...
「ValueError: No tables found」
Apparently, only table data is available. ..
Get table data
As a site with table data, try on the Nikkei Stock Average [998407] page of yahoo finance.
https://m.finance.yahoo.co.jp/stock?code=998407.O
read_html
import pandas as pd
url = 'https://stocks.finance.yahoo.co.jp/stocks/detail/?code=998407.O'
df = pd.read_html(url)
df
View output results summary>
python
[ 0 1 2 3
0 Nikkei Stock Average NaN 19389.43 Compared to the previous day+724.83(+3.88%),
0 1
0 What will happen to the stock price forecast? Tomorrow's Nikkei average,
0 1
0 Nikkei Stock Average NY Dow
1 TOPIX NASDAQ Composite
2 Jasdaq Index S & P 500
3 Hang Seng FTSE 100
4 Shanghai Composite DAX
5 Mumbai SENSEX 30 CAC 40,
0 \
0 Picte Gold(With H)Other returns(1 year)19.94%
1 Global SDGs Equity Fund Other Returns(1 year)5.27%
2 eMAXIS Slim Global Equity(All country)Other returns(1 year)3.91%
1
0 Pinebridge Capital Securities F(With H)Other returns(1 year)6.35%
1 Risk Control Global Asset Diversification Fund Other Returns(1 year)4.81%
2 US Stock Dividend Aristocrat(Settlement type four times a year)International stock returns(1 year)2.21% ,
0 \
0 Overall price increase rate 1.Shinto HLD+45.45% 2.Sugai+28.30% 3.Amaga...
1
0 TSE First Section Price increase rate 1.Kaneko species+19.95% 2.Segue G+18.28% 3.Kobayashi... ]
|