Until now, data manipulation with Pandas in Python was a stance that you should just google to find out how to use it when you need it, but when studying data analysis and machine learning, you can use Jupyter Notebook to slurp data. I feel that I can't talk if I can't operate it, so I'm starting to reorganize and study how to use Pandas from the basics. This article is the study note.
The content of this article is being tested in the Jupyter Notebook environment prepared according to the link below. Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita
In this environment, you can access port 8888 with a browser and use Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.
A Pandas DataFrame is an image of tabular data with rows and columns. Lines may be numbered starting with 0, but it seems that they can be strings. The columns have the names of the columns.
While DataFrame is tabular, there is also a column-only object called Series.
The following articles were very helpful for the contents of DataFrame and Series. My pandas.Series and DataFrame images were wrong-Qiita
import pandas as pd
#If there is a header
df = pd.read_csv("data.csv")
#Header row becomes column name
#If there is no header
df = pd.read_csv("data.csv", header=None)
#Numbers starting with 0 become column names
#If you don't have a header and want to specify a column name
df = pd.read_csv("data.csv", names=["id", "target", "data1", "data2", "data3"])
From this point onward, CSV files created with appropriate random numbers https://github.com/suzuki-navi/sample-data/blob/master/sample-data-1.csv I am using.
(GitHub also formats and displays CSV files)
You can easily check the contents of DataFrame objects on Jupyter Notebook.
If you want to see only part of the data
#First 5 lines
df.head()
#First 3 lines
df.head(3)
# or
df[:3]
#Last 5 lines
df.tail()
#Last 3 lines
df.tail(3)
#Extract only the 11th to 20th lines
# (Indexes starting from 0 are 10 to 19)
df[10:20]
#Extract from the 11th line to the end
# (Index starting from 0 is behind 10)
df[10:]
#Check only line 11
# (10 for indexes starting from 0)
df.loc[10]
#Extract only specific columns
df[["target", "data1"]]
#Extract only specific columns
#Become a Series instead of a DataFrame
df["data1"]
# df[["data1"]]Different from
#Extract only specific columns in a specific row range
df[["target", "data1"]][10:20]
# or
df[10:20][["target", "data1"]]
Even if you extract only some rows, the index attached to the rows is maintained.
df.shape
# => (300, 5)
df.columns
# => Index(['id', 'target', 'data1', 'data2', 'data3'], dtype='object')
df.dtypes
# => id int64
# target int64
# data1 float64
# data2 float64
# data3 float64
# dtype: object
You can perform operations on columns.
df ["data1 "]
is Series, but if you write it like df ["data1 "] / 100
, it will perform the operation / 100
for each element of Series and get the result in Series. I can do it.
You can also perform operations between columns.
df["data1"] + df["data2"]
# df["data1"] >=Generate a DataFrame consisting only of rows where 0 is True
#The row index is maintained, so it becomes a discrete number
df[df["data1"] >= 0]
#You can also query like SQL
df.query('data1 >= 30 and target == 1')
#If you want to put a string in the query""Surround with
df.query('target == "1"')
df["target"].unique()
# => array([3, 2, 1])
df.describe()
The following returns a DataFrame with rows sorted by data1
column.
#data1 column ascending order
df.sort_values("data1")
#data1 column descending
df.sort_values("data1", ascending=False)
#Sort by multiple columns
df.sort_values(["target", "data1"], ascending=False)
How can I make the first sort target
descending and the second sorting data1
ascending?
In the following example, a new value column that is operated on an existing column is added to the right end.
df["data_sum"] = df["data1"] + df["data2"] + df["data3"]
that's all.
Recommended Posts