Pandas is a must-have for machine learning-related work using Python, but I often forget how to use it, so I made a note of how to use frequently used functions. In the future, I would like to update the operations I learned using Pandas as a separate article. I hope it will be helpful for those who have started using Pandas and those who want to check the operation a little.
Since this is a memorandum for beginners, there is a possibility that the content may be incorrect. If you find any mistakes, we would appreciate it if you could contact us.
The following operation methods are described in this article.
Here are some of the features I use most often in Pandas. However, although the following are working for the time being, I am still using it with a moody understanding, so I will investigate and summarize it at another time. -[Change data by specifying conditions](#Change data by specifying conditions) -[Perform processing for groups with groupby](Perform processing for groups with #groupby)
Import the Pandas library.
import pandas as pd
Use the read_csv method to read the csv file as a DataFrame object. This time, I'm reading a file called "student.csv" in my working directory.
data = pd.read_csv("student.csv")
display(data.head(5))
[** Supplement 1: When reading csv without header **] If the csv file does not contain headers ("sex", "age", "height", "weight"), the first data (NaN, 13, 151.7, 59.1) will be read as headers. Since it will end up, specify * header = None *.
To write the DataFrame object to a csv file, use the to_csv method. In the example, it is saved in the working directory with the file name "student_out.csv".
data.to_csv("student_out.csv", index=False)
Specify * index = False * to avoid saving the index (label of the data) when saving. If you don't know what it is, you can check the csv file generated without * index = False *.
To see the types of data contained in a DataFrame, look at the dtypes attribute of the DataFrame object.
display(data.dtypes)
The result is as follows.
To get the data type for each label, do as follows.
display(data["age"].dtype )
Use the count method to display the number of data. The number of data is 1000, but "sex" is less than 1000 because missing data is not counted.
display(data.count())
[** Supplement 1: Get the number of data for each label **] To get the number of data for each label, do as follows.
display(data["sex"].count())
To check the number of missing data, use the isnull method and the sum method.
display(data.isnull().sum())
[** Supplement 1: Operation of isnull method **] According to the official documentation, the isnull method returns a DataFrame object of the same size as the original DataFrame, with ** None ** and ** numpy.NaN ** set to True and the others set to False.
display(data.isnull().head(5))
[** Supplement 2: operation of sum method **] The sum method returns the sum for the specified axis. In Python, True is treated as 1 and False is treated as 0, so the total value is the number of True (the number of missing data). The following is a quote from Reference 4. The quote is Python 3.8.1, but I think it is the same for other vers. .. .. Perhaps. .. .. I haven't confirmed it. .. ..
Boolean values are two constant objects False and True. These are used to represent truth values (although other values are also considered false or true). ** In a numeric processing context (for example, when used as an argument to an arithmetic operator), they behave like 0 and 1, respectively. ** For any value, if it can be interpreted as a truth value, the built-in function bool () is used to convert the value to a Boolean value (see the Truth Value Determination section above).
[reference]
To see the rough statistical data of the data contained in the DataFrame, use the describe method. The describe method is executed ignoring NaN.
display(data.describe(include="all"))
[** Supplement 1: Aggregate data other than numerical data **] By default, only numerical data is aggregated, so ** include = "all" ** is also specified and executed for "sex". Also, please note that the contents of the aggregated statistics differ between numerical data and other data.
Reference 1: pandas official document describe ()
Use the get_dummies method to perform one-hot-encoding. The following is an example of performing one-hot-encoding for "sex".
# one-hot-Perform encoding.
dummydf_sex = pd.get_dummies(data, columns=["sex"], dummy_na=True)
#Original data
display(data.head(5))
# one-hot-encoding data
display(dummydf_sex.head(5))
In this way, when one-hot-encoding is executed, new labels ("sex_male", "sex_female", "sex_nan") of "sex" data ("male", "female", "Nan") are created. And 0 shows what the original data was.
[** Supplement 1: Treat missing data as a label **] By default, missing data (NaN) is ignored, but "data is missing" is also good information, so specify * dummy_na = True * in the argument of the get_dummies method and also "NaN" One-hot-encoding is performed as one data.
[reference]
Try adding new labels and data to the DataFrame object. Let's create a BMI label as an example. An easy way is to (step 1) create a list of data and (step 2) add it as a new label, as shown below.
#Step 1: Create a list of BMIs. It has nothing to do with Pandas operations.
bmi = [ w * (h / 100)**2 for w, h in zip(data["weight"], data["height"]) ]
#Step 2: List BMI"bmi"Add it as label data.
data["bmi"] = bmi
Let's display the result.
display(data.head(5))
You can see that the "bmi" label and data have been added to the DataFrame.
[** Supplement 1: How to use the assign method **] You can also add a label using the assign method. In the assign method, you can also create data with a function for creating data. Let's try adding a label for proper weight (proper_weight).
data = data.assign(proper_weight = lambda x : (x.height / 100.0)**2 * 22)
display(data.head(5))
[reference]
Delete the label and the data it contains. To delete the "bmi" added by
#Remove label
data.drop(columns=["bmi"], inplace=True)
Let's display the result.
[** Supplement 1: Reflect changes in the original DataFrame **] By default, the drop method returns a DataFrame object and makes no changes to the original DataFrame object. You can reflect the changes in the original DataFrame object by specifying * place = True *.
[reference]
Use the fillna method to fill the missing data with the specified value. Let's fill in the missing part of "sex" with "unknown".
data.fillna(value={"sex": "unknown"}, inplace=True)
Let's check the result.
display(data.head(5))
I was able to confirm that "unknown" was entered in the missing part of sex.
[reference]
Specify the conditions and try to extract the data that matches the conditions. As an example, let's create a new DataFrame object (data_over) by extracting data that weighs more than the proper weight (proper_weight).
data_over = data[data.weight > data.proper_weight]
display(data_over.head(5))
[reference]
Specify a condition to change only the data that matches the condition. As an example, let's set the weight of data with a height of 150 or less to 0.
data_over["weight"][data_over.height <= 150] = 0
display(data_over.head(5))
I've done what I want to do for the time being, but I'm getting a Warning. I did a quick research on this, but I couldn't fully understand it, so I'd like to investigate it and write an article at a later date.
Use the groupby method when you want to process data that matches the conditions as a group in group units. As an example, let's output the average value for each gender.
display(data.groupby("sex").mean())
[reference]
Recommended Posts