Most wanted pandas functions (Part 02)

Here, I'm going to extend the last article I have written some time ago. You guys can read it in here Most wanted pandas functions (Part 01)

Applying a function. When dealing with the data, we often have to use functions. There are few ways to use a function over a DataFrame. We should avoid iterating approaches since it is slow going through the panda's rows. We can use .apply() in this case.

`pandas_funcs.ipynb`


# Add $ to values of the Fare column.
def add_dolor_mark(money_str):
    money_str = '$ '+str(money_str)
    return money_str

df['Fare'] = df['Fare'].apply(add_dolor_mark)
df.head()

pandas apply.png

Renaming columns. Sometimes it is necessary to rename columns of our DataFrame. We can pass dictionary {‘column_1’ : ’new_column_1’, ‘column_2’ : ’new_column_2’} to .rename() Let's try to rename Embarked into Port and Fare into Price.

`pandas_funcs.ipynb`


df.rename(columns = {'Embarked':'Port', 'Fare':'Price'}, inplace=True)
df.head()

Screen Shot 2021-01-15 at 15.57.56.png

Adding prefixes or suffixes to column names. In this case we can use add_prefix() and add_suffix()

`pandas_funcs.ipynb`


df = df.add_prefix('X_')
df.head()

Screen Shot 2021-01-15 at 16.01.50.png

Conditional replacing. “Where” can be used to replace the values in rows or columns based on a condition. The way where is works a bit different. It is crucial to identify the way “where” works. "Where" select values based on the condition, then the remaining values are replaced with the specified value. So "where" is behaving like a masking operation.

`pandas_funcs.ipynb`


# we need to replace survied people with True
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , True)
df.head()

Melting This is a bit infamous function. But very useful. We can use melt to convert wide DataFrame to narrow ones. Let's assume we have measured the temperature of 5 cities and for a week. And the data is looking like this.

(we can create a sample DataFrame as follows)

`pandas_funcs.ipynb`


import pandas as pd

temp = {'City': ['Ofuna','Tokyo','Yokohama','Okayama', 'Tsukuba'],
        'Day_1': [22,23,21,20,19],
        'Day_2': [23,22,24,29,18],
        'Day_3': [25,24,23,22,23],
        'Day_4': [21,20,25,21,20],
        'Day_5': [20,19,25,22,21],
        }

df = pd.DataFrame(temp, columns = ['City', 'Day_1', 'Day_2', 'Day_3', 'Day_4', 'Day_5’])

`pandas_funcs.ipynb`


df = df.melt(id_vars=['City'])
df

Screen Shot 2021-01-15 at 16.55.44.png

Save as csv In some situations, we have to save processed DataFrame. In that case, we can use the to_csv function.

`pandas_funcs.ipynb`


df.to_csv('myfile.csv')

Query Among the several filtering methods query is another filtering method that we can quickly and easily access to the subsections of the DataFrame.

`pandas_funcs.ipynb`


#letes assume that we want to get list of age of 10 and 15
seek_age = [10,15]
df.query("X_Age in @seek_age")

Memory usage

How much memory is used by a pandas DataFrame is really beneficial when dealing with large DataFrames. So we can avoid trouble with dead kernel due to out of memory

`pandas_funcs.ipynb`


df.memory_usage()

For total Memory usage we can use,

`pandas_funcs.ipynb`


df.memory_usage().sum()

Profile report It generates reports from a pandas DataFrame. The pandas df.describe() function is handy but a little basic for good exploratory data analysis. pandas_profiling extends the capability for quick data analysis.

`pandas_funcs.ipynb`


from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='Pandas Profiling Report', style={'full_width':True}, correlations={'kendall': False})
profile

Explode Let's think that our data set holds multiple entries in a single row. But for our analysis, we need to analyze them on separate rows. In that case, we can use the explode function. (First lets create example DataFrame)

`pandas_funcs.ipynb`


df_ = pd.DataFrame({'id':['a','b','c','d'],
                  'humidity %':[62,55,[29,39,81],77],
                  'day':[1,2,3,4]})
df_

`pandas_funcs.ipynb`


df_.explode('humidity %')

Crosstab It is used to group 2 or more variables and make a summary table. It is used to get insight of columns. Please note that crosstab() works with categorical data.

For example, let's think that we need to get how many people survived/not-survived according to their ports.

`pandas_funcs.ipynb`


#First lets replace True with 1 in X_Survived column.
df['X_Survived'] = df['X_Survived'].where(df['X_Survived'] == 0 , 1)

port_survived = pd.crosstab(df["X_Port"],df["X_Survived"])
port_survived

There are more many useful functions for analyzing data. I will introduce few more functions in the next article.

Hope this will help you....

This article was written by @nuwan, one of @qualitia_cdev. *This article is written by @nuwan a member of @qualitia_cdev.