I made a method to automatically select and visualize an appropriate graph for pandas DataFrame

Introduction

I often wonder which graph to use when visualizing data. Therefore, last time, I summarized the graphs suitable for each type of explanatory variable and objective variable (Visualization method of data by explanatory variable and objective variable). However, I thought that I was writing "I'll forget this soon!" Therefore, I created a method that automatically determines the type of variable and draws a suitable graph.

Last summary

The appropriate seaborn methods for each type of explanatory variable and objective variable (discrete quantity or not) are as follows. For details, please refer to the previous article from the above link. sns_summary.png

Method content

Below is the code for my own method.

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_data(data, target_col):
    
    for key in data.keys():
        
        if key==target_col:
            continue
            
        length=10
        subplot_size=(length, length/2)
        
        if is_categorical(data, key) and is_categorical(data, target_col):

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)
            sns.countplot(x=key, data=data, ax=axes[0])
            sns.countplot(x=key, data=data, hue=target_col, ax=axes[1])
            plt.tight_layout()
            plt.show()

        elif is_categorical(data, key) and not is_categorical(data, target_col):

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)
            sns.countplot(x=key, data=data, ax=axes[0])
            sns.violinplot(x=key, y=target_col, data=data, ax=axes[1])
            plt.tight_layout()
            plt.show()

        elif not is_categorical(data, key) and is_categorical(data, target_col):

            fig, axes=plt.subplots(1, 2, figsize=subplot_size)
            sns.distplot(data[key], ax=axes[0], kde=False)
            g=sns.FacetGrid(data, hue=target_col)
            g.map(sns.distplot, key, ax=axes[1], kde=False)
            axes[1].legend()
            plt.tight_layout()
            plt.close()
            plt.show()

        else:

            sg=sns.jointplot(x=key, y=target_col, data=data, height=length*2/3)
            plt.show()            

The is_categorical is as follows.

def is_categorical(data, key):
    
    col_type=data[key].dtype
    
    if col_type=='int':
        
        nunique=data[key].nunique()
        return nunique<6
    
    elif col_type=="float":
        return False
    
    else:
        return True

The outline is

-Pass the data you want to visualize (pandas.DataFrame) to data and the key of the objective variable to target_col. -Use the is_categorical method to determine whether the explanatory variable and objective variable are discrete or continuous, and visualize them with the appropriate seaborn method.

It has become. When the data type is int, if there are 6 or more types of values, it is a continuous quantity, and if there are only 5 or less types of values, it is a discrete quantity. To be honest, there is room for improvement in the judgment here.

Application

Apply it to titanic data (only one copy because the result is long).

import pandas as pd

data=pd.read_csv("train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1) #Excludes eigenvalues

visualize_data(data, "Survived")

visualize_data.png I was able to automatically draw an appropriate graph for each type!

At the end

In the previously posted Method to get an overview of data with Pandas and GitHub raised. Please use it! I want to automate various preprocessing in the future.

Recommended Posts

I made a method to automatically select and visualize an appropriate graph for pandas DataFrame
[Python] I made a script that automatically cuts and pastes files on a local PC to an external SSD.
I made an action to automatically format python code
I made a tool to automatically generate a state transition diagram that can be used for both web development and application development
[Python] How to add rows and columns to a table (pandas DataFrame)
I made a POST script to create an issue on Github and register it in the Project
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
I made a tool to automatically browse multiple sites with Selenium (Python)
How to get a specific column name and index name in pandas DataFrame
[Python / Tkinter] Search for Pandas DataFrame → Create a simple search form to display
A note when looking for an alternative to pandas rolling for moving windows
I made a network to convert black and white images to color images (pix2pix)
I made a Chrome extension that displays a graph on an AMeDAS page
I made a program to input what I ate and display calories and sugar
I made a dash docset for Holoviews
How to split and save a DataFrame
I made an alternative module for japandas.DataReader
[Pandas_flavor] Add a method of Pandas DataFrame
I made a script to display emoji
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
I made a library for actuarial science
I made a tool to notify Slack of Connpass events and made it Terraform
I made a tool to easily display data as a graph by GUI operation.
I want to write an element to a file with numpy and check it.
I made an appdo command to execute a command in the context of the app
I made an image classification model and tried to move it on mobile
[For beginners] I made a motion sensor with Raspberry Pi and notified LINE!
What I thought and learned to study for 100 days at a programming school
I made an image for qemu with Yocto, but I failed and started over
I made a package to create an executable file from Hy source code
I made a tool in Python that right-clicks an Excel file and divides it into files for each sheet.
I tried to build a super-resolution method / ESPCN
I made a spare2 cheaper algorithm for uWSGI
I made a useful tool for Digital Ocean
I want to INSERT a DataFrame into MSSQL
I tried to build a super-resolution method / SRCNN ①
I made a downloader for word distributed expression
I made a tool to compile Hy natively
I tried to build a super-resolution method / SRCNN ③
I tried to build a super-resolution method / SRCNN ②
I made a tool to get new articles
I made a random number graph with Numpy
I made a peeping prevention product for telework.
I made a Docker Image that reads RSS and automatically tweets regularly and released it.
I forgot to operate VIM, so I made a video for memorization. 3 videos by level
I made a command to wait for Django to start until the DB is ready
I converted the time to an integer, factored it into prime factors, and made a bot to tweet the result (xkcd material)