Until now, data manipulation with Pandas in Python was a stance that you should just google to find out how to use it when you need it, but when studying data analysis and machine learning, you can use Jupyter Notebook to slurp data. I feel that I can't talk if I can't operate it, so I'm starting to reorganize and study how to use Pandas from the basics. This article is the study note.

The content of this article is being tested in the Jupyter Notebook environment prepared according to the link below. Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita

In this environment, you can access port 8888 with a browser and use Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.

DataFrame overview

A Pandas DataFrame is an image of tabular data with rows and columns. Lines may be numbered starting with 0, but it seems that they can be strings. The columns have the names of the columns.

While DataFrame is tabular, there is also a column-only object called Series.

The following articles were very helpful for the contents of DataFrame and Series. My pandas.Series and DataFrame images were wrong-Qiita

Python package import

import pandas as pd

Read from CSV file

#If there is a header
df = pd.read_csv("data.csv")
#Header row becomes column name

#If there is no header
df = pd.read_csv("data.csv", header=None)
#Numbers starting with 0 become column names

#If you don't have a header and want to specify a column name
df = pd.read_csv("data.csv", names=["id", "target", "data1", "data2", "data3"])

From this point onward, CSV files created with appropriate random numbers https://github.com/suzuki-navi/sample-data/blob/master/sample-data-1.csv I am using.

(GitHub also formats and displays CSV files)

Check the contents of the data

You can easily check the contents of DataFrame objects on Jupyter Notebook.

If you want to see only part of the data

#First 5 lines
df.head()

#First 3 lines
df.head(3)
# or
df[:3]

#Last 5 lines
df.tail()

#Last 3 lines
df.tail(3)

#Extract only the 11th to 20th lines
# (Indexes starting from 0 are 10 to 19)
df[10:20]

#Extract from the 11th line to the end
# (Index starting from 0 is behind 10)
df[10:]

#Check only line 11
# (10 for indexes starting from 0)
df.loc[10]

#Extract only specific columns
df[["target", "data1"]]

#Extract only specific columns
#Become a Series instead of a DataFrame
df["data1"]
# df[["data1"]]Different from

#Extract only specific columns in a specific row range
df[["target", "data1"]][10:20]
# or
df[10:20][["target", "data1"]]

Even if you extract only some rows, the index attached to the rows is maintained.

Check the data format

df.shape
# => (300, 5)

df.columns
# => Index(['id', 'target', 'data1', 'data2', 'data3'], dtype='object')

df.dtypes
# => id          int64
#    target      int64
#    data1     float64
#    data2     float64
#    data3     float64
#    dtype: object

Operate on columns

You can perform operations on columns.

df ["data1 "] is Series, but if you write it like df ["data1 "] / 100, it will perform the operation / 100 for each element of Series and get the result in Series. I can do it.

You can also perform operations between columns.

df["data1"] + df["data2"]

Extract rows conditionally

# df["data1"] >=Generate a DataFrame consisting only of rows where 0 is True
#The row index is maintained, so it becomes a discrete number
df[df["data1"] >= 0]

#You can also query like SQL
df.query('data1 >= 30 and target == 1')

#If you want to put a string in the query""Surround with
df.query('target == "1"')

Get a list of values with duplicates removed

df["target"].unique()
# => array([3, 2, 1])

Get statistics about a column of numbers

df.describe()

sort

The following returns a DataFrame with rows sorted by data1 column.

#data1 column ascending order
df.sort_values("data1")

#data1 column descending
df.sort_values("data1", ascending=False)

#Sort by multiple columns
df.sort_values(["target", "data1"], ascending=False)

How can I make the first sort target descending and the second sorting data1 ascending?

Add column

In the following example, a new value column that is operated on an existing column is added to the right end.

df["data_sum"] = df["data1"] + df["data2"] + df["data3"]

that's all.

Try basic operations for Pandas DataFrame on Jupyter Notebook