Make a note of the list of basic Pandas usage

Pandas is a must-have for machine learning-related work using Python, but I often forget how to use it, so I made a note of how to use frequently used functions. In the future, I would like to update the operations I learned using Pandas as a separate article. I hope it will be helpful for those who have started using Pandas and those who want to check the operation a little.

Since this is a memorandum for beginners, there is a possibility that the content may be incorrect. If you find any mistakes, we would appreciate it if you could contact us.

Operating environment

python 3.7.4
pandas 0.24.0

Operation contents introduced in the article

The following operation methods are described in this article.

Basic operation of Pandas --[Import Library](#Import Library) --[Read csv file](Read #csv file) --[Export to csv file](Export to #csv file) -[Check data type](#Check data type) -[Display the number of data](#Display the number of data) -[Check the number of missing data](#Check the number of missing data) -[Check the basic statistics of the data](#Check the basic statistics of the data) -[Perform one-hot-encoding for category data](# Perform one-hot-encoding for category data) --[Add label and data](#Add label and data) --[Delete Label](#Delete Label) -[Fill missing data with specified value](#Fill missing data with specified value)
Use Pandas conveniently -[Extract data by specifying conditions](# Extract data by specifying conditions) -[Change data by specifying conditions](#Change data by specifying conditions) * Warning is issued, so improvement is required. -[Perform processing for groups with groupby](Perform processing for groups with #groupby)

Summary

Here are some of the features I use most often in Pandas. However, although the following are working for the time being, I am still using it with a moody understanding, so I will investigate and summarize it at another time. -[Change data by specifying conditions](#Change data by specifying conditions) -[Perform processing for groups with groupby](Perform processing for groups with #groupby)

1. Basic operation of Pandas

Import the Pandas library.

import pandas as pd

Use the read_csv method to read the csv file as a DataFrame object. This time, I'm reading a file called "student.csv" in my working directory.

data = pd.read_csv("student.csv")
display(data.head(5))

スクリーンショット 2020-01-05 20.36.47.png

[** Supplement 1: When reading csv without header **] If the csv file does not contain headers ("sex", "age", "height", "weight"), the first data (NaN, 13, 151.7, 59.1) will be read as headers. Since it will end up, specify * header = None *.

To write the DataFrame object to a csv file, use the to_csv method. In the example, it is saved in the working directory with the file name "student_out.csv".

data.to_csv("student_out.csv", index=False)

Specify * index = False * to avoid saving the index (label of the data) when saving. If you don't know what it is, you can check the csv file generated without * index = False *.

To see the types of data contained in a DataFrame, look at the dtypes attribute of the DataFrame object.

display(data.dtypes)

The result is as follows.

スクリーンショット 2020-01-05 20.38.53.png

To get the data type for each label, do as follows.

display(data["age"].dtype )

スクリーンショット 2020-01-05 20.40.20.png

Use the count method to display the number of data. The number of data is 1000, but "sex" is less than 1000 because missing data is not counted.

display(data.count())

スクリーンショット 2020-01-05 20.42.07.png

[** Supplement 1: Get the number of data for each label **] To get the number of data for each label, do as follows.

display(data["sex"].count())

スクリーンショット 2020-01-05 20.43.13.png

To check the number of missing data, use the isnull method and the sum method.

display(data.isnull().sum())

スクリーンショット 2020-01-05 20.44.05.png

[** Supplement 1: Operation of isnull method **] According to the official documentation, the isnull method returns a DataFrame object of the same size as the original DataFrame, with ** None ** and ** numpy.NaN ** set to True and the others set to False.

display(data.isnull().head(5))

スクリーンショット 2020-01-05 20.45.03.png

[** Supplement 2: operation of sum method **] The sum method returns the sum for the specified axis. In Python, True is treated as 1 and False is treated as 0, so the total value is the number of True (the number of missing data). The following is a quote from Reference 4. The quote is Python 3.8.1, but I think it is the same for other vers. .. .. Perhaps. .. .. I haven't confirmed it. .. ..

Boolean values are two constant objects False and True. These are used to represent truth values (although other values are also considered false or true). ** In a numeric processing context (for example, when used as an argument to an arithmetic operator), they behave like 0 and 1, respectively. ** For any value, if it can be interpreted as a truth value, the built-in function bool () is used to convert the value to a Boolean value (see the Truth Value Determination section above).

[reference]

To see the rough statistical data of the data contained in the DataFrame, use the describe method. The describe method is executed ignoring NaN.

display(data.describe(include="all"))

スクリーンショット 2020-01-05 20.49.08.png

[** Supplement 1: Aggregate data other than numerical data **] By default, only numerical data is aggregated, so ** include = "all" ** is also specified and executed for "sex". Also, please note that the contents of the aggregated statistics differ between numerical data and other data.

Reference 1: pandas official document describe ()

Use the get_dummies method to perform one-hot-encoding. The following is an example of performing one-hot-encoding for "sex".

# one-hot-Perform encoding.
dummydf_sex = pd.get_dummies(data, columns=["sex"], dummy_na=True)

#Original data
display(data.head(5))
# one-hot-encoding data
display(dummydf_sex.head(5))

スクリーンショット 2020-01-05 20.50.28.png

In this way, when one-hot-encoding is executed, new labels ("sex_male", "sex_female", "sex_nan") of "sex" data ("male", "female", "Nan") are created. And 0 shows what the original data was.

[** Supplement 1: Treat missing data as a label **] By default, missing data (NaN) is ignored, but "data is missing" is also good information, so specify * dummy_na = True * in the argument of the get_dummies method and also "NaN" One-hot-encoding is performed as one data.

[reference]

Pandas official document get_dummies ()

Try adding new labels and data to the DataFrame object. Let's create a BMI label as an example. An easy way is to (step 1) create a list of data and (step 2) add it as a new label, as shown below.

#Step 1: Create a list of BMIs. It has nothing to do with Pandas operations.
bmi = [ w * (h / 100)**2 for w, h in zip(data["weight"], data["height"]) ]

#Step 2: List BMI"bmi"Add it as label data.
data["bmi"] = bmi

Let's display the result.

display(data.head(5))

スクリーンショット 2020-01-05 20.53.23.png

You can see that the "bmi" label and data have been added to the DataFrame.

[** Supplement 1: How to use the assign method **] You can also add a label using the assign method. In the assign method, you can also create data with a function for creating data. Let's try adding a label for proper weight (proper_weight).

data = data.assign(proper_weight = lambda x : (x.height / 100.0)**2 * 22)
display(data.head(5))

スクリーンショット 2020-01-05 20.54.46.png

[reference]

Delete the label and the data it contains. To delete the "bmi" added by , do as follows.

#Remove label
data.drop(columns=["bmi"], inplace=True)

Let's display the result.

スクリーンショット 2020-01-05 20.56.53.png

[** Supplement 1: Reflect changes in the original DataFrame **] By default, the drop method returns a DataFrame object and makes no changes to the original DataFrame object. You can reflect the changes in the original DataFrame object by specifying * place = True *.

[reference]

Pandas Official Document drop ()

Use the fillna method to fill the missing data with the specified value. Let's fill in the missing part of "sex" with "unknown".

data.fillna(value={"sex": "unknown"}, inplace=True)

Let's check the result.

display(data.head(5))

スクリーンショット 2020-01-05 20.58.49.png

I was able to confirm that "unknown" was entered in the missing part of sex.

[reference]

Pandas Official Document fillna ()

2. Use Pandas conveniently

Specify the conditions and try to extract the data that matches the conditions. As an example, let's create a new DataFrame object (data_over) by extracting data that weighs more than the proper weight (proper_weight).

data_over = data[data.weight > data.proper_weight]
display(data_over.head(5))

スクリーンショット 2020-01-05 20.59.55.png

[reference]

Specify a condition to change only the data that matches the condition. As an example, let's set the weight of data with a height of 150 or less to 0.

data_over["weight"][data_over.height <= 150] = 0
display(data_over.head(5))

スクリーンショット 2020-01-05 21.01.12.png

I've done what I want to do for the time being, but I'm getting a Warning. I did a quick research on this, but I couldn't fully understand it, so I'd like to investigate it and write an article at a later date.

Use the groupby method when you want to process data that matches the conditions as a group in group units. As an example, let's output the average value for each gender.

display(data.groupby("sex").mean())

スクリーンショット 2020-01-05 21.07.07.png

[reference]

Reference: Pandas Official Document

Pandas Official Document