Pandas in 10 minutes

Introduction

This article is a sutra copy and commentary of the official pandas tutorial "10 minutes to pandas"

I refer to the following URL https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

environment

Python3.7
Jupyter Lab

Import for the time being

import numpy as np
import pandas as pd
np
pd

OK if each module is displayed as below スクリーンショット 2020-01-25 11.51.03.png

If an error occurs

** ModuleNotFoundError: No module named'pandas' ** If you get angry, put pandas first.


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-59ab05e21164> in <module>
      1 import numpy as np
----> 2 import pandas as pd

ModuleNotFoundError: No module named 'pandas'

command python -m pip install pandas

1. Create an object

You can easily create data by putting a list in the Series class. ..


#Easy to line up
s = pd.Series(data=[1, 3, 5, np.nan, 6, 8])
s

You can use date_range () to create a line with a date for a specific time period.


#Data for 6 days from January 1, 2020
dates = pd.date_range("20200101", periods=6)
dates

[DataFrame] of pandas (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas-dataframe) By specifying the class ** argument index **, the line You can specify the index.

#Specify data from January 1, 2020 for row index
#Enter a random number for each value
df = pd.DataFrame(np.random.randn(6, 4), index=dates)
df

Also, of the DataFrame class You can set the column names by specifying the ** argument columns **.

#Set column name ABCD
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

By passing dictionary type data to the DataFrame class, the key part of the dictionary type becomes the column name.

df2 = pd.DataFrame(
    {
        "A": 1.,
        "B": pd.Timestamp("20200101"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

You can see the data attributes of each column by referring to ** dtypes attribute **.

df2.dtypes

If you are using Jupyter nootbook or Jupyter Lab, column names will be displayed in tab completion.

db2.<TAB>

2. View data

Data by using the [head () method] of the DataFrame class (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head) You can display the beginning.

df.head(2)

Similarly, by using the DataFrame class tail () You can view the tail.

df.tail(2)

By referring to ** index ** of the DataFrame class You can display the row index of that data.

df.index
df2.index

Data by using the DataFrame class to_numpy () Can be converted to data that is easy to operate with numpy.

df.to_numpy()
df2.to_numpy()

Use the DataFrame class Reference: DataFrame.describe () You can get a quick statistic for each column of data.

df2.describe()

If you refer to the T attribute of the DataFrame class, the matrix-swapped data You can access.

df.T

Also, transpose the matrix in transpose () of the DataFrame class. Can be obtained.

df.transpose()

By using the DataFrame class sort_index () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html#pandas-dataframe-sort-index) , You can sort the entire row or column.

df.sort_index()

** Set the argument axis ** to 0 or "index" to sort by row, set 1 or "columns" to sort by axis (default value 0). Also, if False is specified for the ** argument ascending **, the sort order will be descending (default value True).


df.sort_index(axis=0, ascending=False)
df.sort_index(axis=1, ascending=False)

By using the DataFrame class sort_values () [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas-dataframe-sort-values) You can sort by row or column.

df.sort_values(by="B")
df.sort_values(by="2020-01-01", axis=1)

(Added on 2020-03-07)

3. Select data

Simple data acquisition

You can get the specified row by setting ** df ["A"] ** or ** df.A **.

df["A"]
df.A

If specified in the list ** [] **, you can select columns and rows with a Python slice operation.

#First 4 columns display
df[0:3]

You can also get the reindex range.


#Displayed from January 2, 2020 to January 4, 2020
df['20200102':'20200104']

Select data by label

Index (dates in this case) to loc () of DataFrame class ) Can be specified to select the row as a column.


df.loc[dates]
df.loc[dates[0]]

Select multiple columns by using loc () can do.


df.loc[:, ["A", "B"]]

It seems that an error will occur if there is no leading colon.

loc () Multiple lines and multiples by combining slice operations You can select columns.

df.loc['20200102':'20200104', ['A', 'B']]

Single data by specifying an index in loc () Can get

df.loc[dates[0], 'A']

By using at (), you can get single data faster.

df.at[dates[0], 'A']

Select data by location (https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection-by-position)

Select data by specifying a numerical value by using iloc () of the DataFrame class. You can.

df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]

Slice (:) with the start position and end position omitted in the argument of iloc () of the DataFrame class. You can get a specific all rows or all columns by specifying (only)