Overview of dataframe connection / combination

DataFrames can be concatenated and combined in Pandas.

Concatenation: Operation to connect DataFrames in a certain direction as they are
Join: Operation to connect by referring to a specific Key

DataFrame concatenation

If the index and column match

The operation of connecting DataFrames in a certain direction as they are is called concatenation. Concatenation of DataFrames with matching indexes or columns

axis=0 #By specifying this, the list will be concatenated vertically from the beginning.

Also, if axis = 1 is specified, they will be connected horizontally.

#Connected vertically
pandas.concat([List of DataFrames], axis=0)

#Connected horizontally
pandas.concat([List of DataFrames], axis=1)
#When concatenating horizontally, even if each DataFrame has the same column, it does not distinguish
#Note that the columns will be duplicated as they will be concatenated.

When connecting in the vertical direction, connect in the same column When concatenating horizontally, concatenate with the same index.

If the index and column do not match

The point is that empty rows and columns are set once with NaN.

For more information When concatenating DataFrames whose indexes and columns do not match Cells with NaNs in rows and columns that are not common indexes or columns are automatically created.

Even if the indexes and columns do not match, the method of concatenation is the same as if they do.

pandas.concat([List of DataFrames], axis=0) 
#Connect vertically from the top of the list
# axis=If 1 is specified, they will be connected horizontally.

Specifying the label when concatenating

Since the concatenation connects the DataFrames without distinguishing them There may be duplicate labels.

For example, in the concatenation of Example 1 below, "apple", "orange", "banana" are You can see that they are duplicated as labels.

In this case, by adding the label specified by keys to pd.concat () You can avoid duplicate labels.

concat_df=pd.concat([df_data1,df_data2],axis=1,keys=["X", "Y"]) #Set with keys

The data frame after concatenation will be a Multi Index that uses multiple labels.

In the case of Example 2, you can see that the new "X" and "Y" columns have been added above the existing columns. In this case, df ["X"] can refer to the column labeled "X". You can refer to the "apple" column in the "X" column by writing df ["X", "apple"].

DataFrame Join

Describes the join. Joining is also known as merging. Join specifies a column called Key It is an operation to connect rows with matching values in the Key of two databases horizontally.

Type of bond

There are two main methods of joining, inner joining and outer joining. Join the following two DataFrames in the "fruits" column.

Inner join

Rows that do not have a common value in the Key column are discarded. Also, leave other rows that have the same column and their values do not match Alternatively, you can specify to discard it. You can see that only the common data remains in the "fruits" column data of the two DataFrames.

Outer join

Some rows have no common value in the Key column. Elements that have no value are filled with NaN. You can see that NaN is inserted in the data in the row direction of "kiwifruit" and "mango".

Inner join basics how = "inner"

Let's try to internally combine two DataFrames, df1 and df2.

pandas.merge(df1, df2, on= "Key column", how="inner")

In this case, df1 is moved to the left Rows with mismatched values in the Key column are discarded.

In addition, common columns other than the Key column whose values do not match are left. The column that belonged to the DataFrame on the left is _x The column that belonged to the right side has _y as a suffix. Unless otherwise specified, the DataFrame index is not involved in the process.

Outer join basics how = "outer"

Let's outer join two DataFrames, df1 and df2.

pandas.merge(df1, df2, on="Key column", how="outer")

In this case, the row where df1 is moved to the left and the values do not match in the Key column Elements that have no residual value are filled with NaN.

In addition, common columns with mismatched values other than the Key column are left. The column that belonged to the DataFrame on the left is _x, The column that belonged to the right side has _y as a suffix. Unless otherwise specified, the DataFrame index is not involved in the process.

Combine columns that do not have the same name as Key

Of the two DataFrames Order_df (left), one with order information Suppose the other is customer_df (right) with customer information.

In the order information, the column indicating the ID of the purchasing customer is "customer_id". In customer information, the column indicating the customer's ID is "id".

The columns of the columns you want to correspond do not match.

In this case, by specifying the column you want to use as Key You can combine columns of DataFrames with different columns in correspondence.

pandas.merge(Left DF,Right DF, left_on="Left DFのカラム", right_on="Right DF column", how="Joining method")

Join with index as Key

When the Key used to join DataFrames is an index Specify the index instead of specifying the columns of the DataFrame.

left_index=True #Left DF index

right_index=True #Index of DF on the right

import pandas as pd

#Order information
order_df = pd.DataFrame([[1000, 2546, 103],
                         [1001, 4352, 101],
                         [1002, 342, 101],
                         [1003, 1192, 102]],
                         columns=["id", "item_id", "customer_id"])
#Customer information
customer_df = pd.DataFrame([["Tanaka"],
                           ["Suzuki"],
                           ["Kato"]],
                           columns=["name"])
customer_df.index = [101, 102, 103] #Change the index value here

# order_df and customer_Combine df with customer ID to key and order_Substitute in df
order_df = pd.merge(order_df, customer_df, left_on="customer_id", right_index=True, how="inner")
# right_index=True and customer_Joins the same line as id.

print(order_df)

#output
    id  item_id  customer_id    name
0  1000     2546          103    Kato
1  1001     4352          101  Tanaka
2  1002      342          101  Tanaka
3  1003     1192          102  Suzuki

#output(customer_df.without index, how="outer"in the case of)

       id  item_id  customer_id    name
0  1000.0   2546.0          103     NaN
1  1001.0   4352.0          101     NaN
2  1002.0    342.0          101     NaN
3  1003.0   1192.0          102     NaN
3     NaN      NaN            0  Tanaka
3     NaN      NaN            1  Suzuki
3     NaN      NaN            2    Kato

Data analysis using DataFrame

Get some lines

When the amount of data to be handled is huge, the output data is limited.

df.head() #Returns a DataFrame containing only the first 5 rows.

df.tail() #Returns a DataFrame containing only the last 5 rows.

If you specify an integer value as an argument, it will be for any number of lines at the beginning or end. You can get a DataFrame. The head () and tail () methods can also be used with Series type variables.

Apply calculation process

If you pass Series or DataFrame to the function provided by NumPy You can apply the calculation process to all the elements.

If you pass a DataFrame to a function that receives a NumPy array The calculation process is performed collectively for each column.

Also, Pandas supports broadcasting like NumPy. Use the regular expression "+-* /" to calculate between Pandas and between Pandas and integers. Can be processed.

Get summary statistics

Summary statistic: A summary of statistical information such as mean, maximum, and minimum values for each column.

When the dataframe type variable is df

df.describe() 
#Includes df number per column, mean, standard deviation, minimum, quartile, maximum, etc.
#Returns a DataFrame.

Note that the summary statistics displayed will vary depending on the type of data. Also, rows and elements can be obtained by using loc and at together.

The name of the summary statistic is used for the index of the obtained DataFrame.

import numpy as np
import pandas as pd

np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]

#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
    df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)

#Of the df summary statistics"mean", "max", "min"Take out df_Substitute in des.
df_des = df.describe().loc[["mean","max","min"]]
# df.describe()Only returns a Dataframe, so you need to specify it with loc.


print(df_des)
#output
      apple  orange  banana  strawberry  kiwifruit
mean    5.1     6.9     5.6         4.1        5.3
max    10.0     9.0    10.0         9.0       10.0
min     1.0     2.0     1.0         1.0        1.0

Find the difference between rows or columns of a DataFrame

The operation to find the difference between lines is a function especially used in time series analysis. When the dataframe type variable is df

df.diff() #Returns a DataFrame that calculates the difference between rows or columns.

df.diff(periods=Row or column spacing, axis=Axis direction)
# periods=Is optional
#If the first argument is positive, the difference from the previous line is calculated, and if it is negative, the difference from the subsequent line is calculated.
#The axis is 0 for the row direction and 1 for the column direction.

import numpy as np
import pandas as pd

np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]

#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
    df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)

#For each row of df, df the DataFrame that calculated the difference from the row after 2 rows_Substitute for diff.
df_diff = df.diff(periods="-2",axis=0)
#Or df_diff = df.diff(-2, axis=0)

#df and df_Check the processing contents by comparing the diff data.
print(df)
print()
print(df_diff)

#output
apple  orange  banana  strawberry  kiwifruit
1       6       8       6           3         10
2       1       7      10           4         10
3       4       9       9           9          1
4       4       9      10           2          5
5       8       2       5           4          8
6      10       7       4           4          4
7       4       8       1           4          3
8       6       8       4           8          8
9       3       9       6           1          3
10      5       2       1           2          1

    apple  orange  banana  strawberry  kiwifruit
1     2.0    -1.0    -3.0        -6.0        9.0
2    -3.0    -2.0     0.0         2.0        5.0
3    -4.0     7.0     4.0         5.0       -7.0
4    -6.0     2.0     6.0        -2.0        1.0
5     4.0    -6.0     4.0         0.0        5.0
6     4.0    -1.0     0.0        -4.0       -4.0
7     1.0    -1.0    -5.0         3.0        0.0
8     1.0     6.0     3.0         6.0        7.0
9     NaN     NaN     NaN         NaN        NaN
10    NaN     NaN     NaN         NaN        NaN

Grouping

For databases and DataFrames Aggregate rows with the same value for a particular column This is called grouping.