Try basic operations for Pandas DataFrame on Jupyter Notebook

Until now, data manipulation with Pandas in Python was a stance that you should just google to find out how to use it when you need it, but when studying data analysis and machine learning, you can use Jupyter Notebook to slurp data. I feel that I can't talk if I can't operate it, so I'm starting to reorganize and study how to use Pandas from the basics. This article is the study note.

The content of this article is being tested in the Jupyter Notebook environment prepared according to the link below. Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita

In this environment, you can access port 8888 with a browser and use Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.

DataFrame overview

A Pandas DataFrame is an image of tabular data with rows and columns. Lines may be numbered starting with 0, but it seems that they can be strings. The columns have the names of the columns.

While DataFrame is tabular, there is also a column-only object called Series.

The following articles were very helpful for the contents of DataFrame and Series. My pandas.Series and DataFrame images were wrong-Qiita

Python package import

import pandas as pd

Read from CSV file

#If there is a header
df = pd.read_csv("data.csv")
#Header row becomes column name

#If there is no header
df = pd.read_csv("data.csv", header=None)
#Numbers starting with 0 become column names

#If you don't have a header and want to specify a column name
df = pd.read_csv("data.csv", names=["id", "target", "data1", "data2", "data3"])

From this point onward, CSV files created with appropriate random numbers https://github.com/suzuki-navi/sample-data/blob/master/sample-data-1.csv I am using.

(GitHub also formats and displays CSV files)

Check the contents of the data

You can easily check the contents of DataFrame objects on Jupyter Notebook.

image.png

If you want to see only part of the data

#First 5 lines
df.head()

#First 3 lines
df.head(3)
# or
df[:3]

#Last 5 lines
df.tail()

#Last 3 lines
df.tail(3)

#Extract only the 11th to 20th lines
# (Indexes starting from 0 are 10 to 19)
df[10:20]

#Extract from the 11th line to the end
# (Index starting from 0 is behind 10)
df[10:]

#Check only line 11
# (10 for indexes starting from 0)
df.loc[10]

#Extract only specific columns
df[["target", "data1"]]

#Extract only specific columns
#Become a Series instead of a DataFrame
df["data1"]
# df[["data1"]]Different from

#Extract only specific columns in a specific row range
df[["target", "data1"]][10:20]
# or
df[10:20][["target", "data1"]]

Even if you extract only some rows, the index attached to the rows is maintained.

image.png

Check the data format

df.shape
# => (300, 5)

df.columns
# => Index(['id', 'target', 'data1', 'data2', 'data3'], dtype='object')

df.dtypes
# => id          int64
#    target      int64
#    data1     float64
#    data2     float64
#    data3     float64
#    dtype: object

Operate on columns

You can perform operations on columns.

df ["data1 "] is Series, but if you write it like df ["data1 "] / 100, it will perform the operation / 100 for each element of Series and get the result in Series. I can do it.

image.png

You can also perform operations between columns.

df["data1"] + df["data2"]

Extract rows conditionally

# df["data1"] >=Generate a DataFrame consisting only of rows where 0 is True
#The row index is maintained, so it becomes a discrete number
df[df["data1"] >= 0]

#You can also query like SQL
df.query('data1 >= 30 and target == 1')

#If you want to put a string in the query""Surround with
df.query('target == "1"')

Get a list of values with duplicates removed

df["target"].unique()
# => array([3, 2, 1])

Get statistics about a column of numbers

df.describe()

image.png

sort

The following returns a DataFrame with rows sorted by data1 column.

#data1 column ascending order
df.sort_values("data1")

#data1 column descending
df.sort_values("data1", ascending=False)

#Sort by multiple columns
df.sort_values(["target", "data1"], ascending=False)

How can I make the first sort target descending and the second sorting data1 ascending?

Add column

In the following example, a new value column that is operated on an existing column is added to the right end.

df["data_sum"] = df["data1"] + df["data2"] + df["data3"]

image.png

that's all.

Recommended Posts

Try basic operations for Pandas DataFrame on Jupyter Notebook
Try running Jupyter Notebook on Mac
Try SVM with scikit-learn on Jupyter Notebook
Jupyter Notebook basic operations and shortcut keys
Try Apache Spark on Jupyter Notebook (on local Docker
Shortcut key for Jupyter notebook
Try using Jupyter Notebook dynamically
High charts on Jupyter notebook
View PDF on Jupyter Notebook
Basic commands for file operations
Run Jupyter Notebook on windows
Try clustering with a mixed Gaussian model on a Jupyter Notebook
Formatting with autopep8 on Jupyter notebook
Pandas / DataFrame Tips for practical use
Snippet settings for python jupyter notebook
Jupyter Notebook essential for software development
Run azure ML on jupyter notebook
Convenient analysis with Pandas + Jupyter notebook
Try starting Jupyter Notebook ~ Esper training
Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Jupyter Notebook extension, nbextensions settings for myself
Make Jupyter Notebook a service on CentOS
Start jupyter notebook on GPU server (remote server)
Clone the github repository on jupyter notebook
GPU check of PC on jupyter notebook
Display histogram / scatter plot on Jupyter Notebook
Build jupyter notebook on remote server (CentOS)
Use vim keybindings on Docker-launched Jupyter Notebook
Run Jupyter notebook on a remote server
How to use Jupyter notebook [Super Basic]
Library for "I want to do that" of data science on Jupyter Notebook
Install matplotlib and display graph on Jupyter Notebook
Try a state-space model (Jupyter Notebook + IR kernel)
[Jupyter Notebook / Lab] 3 ways to debug on Jupyter [Pdb]
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Enable Jupyter Notebook with conda on remote server
Try using conda virtual environment with Jupyter Notebook
[Pythonocc] I tried using CAD on jupyter notebook
Simply display a line graph on Jupyter Notebook
[Python / Chrome] Basic settings and operations for scraping
(Note) Basic statistics on Python & Pandas on IBM DSX
Remotely open Jupyter notebook launched on the server
Try using Jupyter Notebook of Azure Machine Learning
Basic operation of Python Pandas Series and Dataframe (1)
jupyter notebook does not start on mac fish