Check the status of your data using pandas_profiling

Overview

If you are a data engineer or a data maintenance person, you can use various tools to check the inconsistency of data, or you can hit it with SQL to check it. Recently, I often do such things. Especially when new data linkage starts, I often look at the contents of the data. That's where pandas_profiling comes in handy.

Installation method

pip install pandas-profiling[notebook]

How to use

import pandas_profiling as pdp
from sklearn.datasets import load_boston

data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

profile = pdp.ProfileReport(df, {'correlations': None})
profile.to_file("profile.html")

I often just want to know the distribution of the data, so I'm adding an option so that I don't calculate the correlation. It is also output to html for sharing with other people.

result

When you run it on Jupyter notebook, the process bar will be displayed as shown below, and you can see the processing status. You can see the data status of each item. I'm particularly interested in missing values, which is very useful because it shows the number and percentage of missing values.

スクリーンショット 2020-07-14 11.02.34.png スクリーンショット 2020-07-14 11.03.20.png

Recommended Posts

Check the status of your data using pandas_profiling
Check the type of the variable you are using
Scraping the winning data of Numbers using Docker
I tried using the API of the salmon data project
Understand the status of data loss --Python vs. R
Check the type and version of your Linux distribution
Check the memory status of the server with the Linux free command
Check the operating status of the server with the Linux top command
[Python] I tried collecting data using the API of wikipedia
Check the return value using PEP 380
Check the data summary in CASTable
Recommendation of data analysis using MessagePack
Visualize the response status of the census 2020
[Machine learning] Check the performance of the classifier with handwritten character data
Explain the mechanism of PEP557 data class
Check the behavior of destructor in Python
How to check the version of Django
The story of verifying the open data of COVID-19
Get the column list & data list of CASTable
Check the existence of the file with python
Try to get the road surface condition using big data of road surface management
Verify the accuracy of the scoring formula "RC" using actual professional baseball data
Check the path of the Python imported module
Visualize the export data of Piyo log
Awareness of using Aurora Severless Data API
Plot the environmental concentration of organofluorine compounds on a map using open data
Let's check the population transition of Matsue City, Shimane Prefecture with open data
Check the operation of OpenCV3 installed by Anaconda
[python] Check the elements of the list all, any
Shortening the analysis time of Openpose using sound
Estimating the effect of measures using propensity scores
Exclusive release of the django app using ngrok
Dynamically display epidemic data using the Grafana Dashboard
[2020July] Check the UDID of the iPad on Linux
Check the date of the flag duty with Python
[Pandas] Basics of processing date data using dt
The story of reading HSPICE data in Python
Visualized the usage status of the sink in the company
Try using the collections module (ChainMap) of python3
Python introductory study-output of sales data using tuples-
Determine the number of classes using the Starges formula
I tried using the image filter of OpenCV
The transition of baseball as seen from the data
Download the wind data of the Japan Meteorological Agency
How strong is your Qiita? Statistics on the number of Contributes seen in the data
Calculation of the shortest path using the Monte Carlo method
Easy way to check the source of Python modules
[Python] [Word] [python-docx] Simple analysis of diff data using python
Cut a part of the string using a Python slice
Big data analysis using the data flow control framework Luigi
Drawing on Jupyter using the plot function of pandas
I tried clustering ECG data using the K-Shape method
Not being aware of the contents of the data in python
Explanation of the concept of regression analysis using Python Part 1
Write data to KINTONE using the Python requests module
Post to your account using the API on Twitter
Let's use the open data of "Mamebus" in Python
Check for the existence of BigQuery tables in Java
Let's analyze the emotions of Tweet using Chainer (2nd)
Explanation of the concept of regression analysis using Python Extra 1
Study from the beginning of Python Hour8: Using packages