DataFrames can be concatenated and combined in Pandas.
The operation of connecting DataFrames in a certain direction as they are is called concatenation. Concatenation of DataFrames with matching indexes or columns
axis=0 #By specifying this, the list will be concatenated vertically from the beginning.
Also, if axis = 1 is specified, they will be connected horizontally.
#Connected vertically
pandas.concat([List of DataFrames], axis=0)
#Connected horizontally
pandas.concat([List of DataFrames], axis=1)
#When concatenating horizontally, even if each DataFrame has the same column, it does not distinguish
#Note that the columns will be duplicated as they will be concatenated.
When connecting in the vertical direction, connect in the same column When concatenating horizontally, concatenate with the same index.
The point is that empty rows and columns are set once with NaN.
For more information When concatenating DataFrames whose indexes and columns do not match Cells with NaNs in rows and columns that are not common indexes or columns are automatically created.
Even if the indexes and columns do not match, the method of concatenation is the same as if they do.
pandas.concat([List of DataFrames], axis=0)
#Connect vertically from the top of the list
# axis=If 1 is specified, they will be connected horizontally.
Since the concatenation connects the DataFrames without distinguishing them There may be duplicate labels.
For example, in the concatenation of Example 1 below, "apple", "orange", "banana" are You can see that they are duplicated as labels.
In this case, by adding the label specified by keys to pd.concat () You can avoid duplicate labels.
concat_df=pd.concat([df_data1,df_data2],axis=1,keys=["X", "Y"]) #Set with keys
The data frame after concatenation will be a Multi Index that uses multiple labels.
In the case of Example 2, you can see that the new "X" and "Y" columns have been added above the existing columns. In this case, df ["X"] can refer to the column labeled "X". You can refer to the "apple" column in the "X" column by writing df ["X", "apple"].
Describes the join. Joining is also known as merging. Join specifies a column called Key It is an operation to connect rows with matching values in the Key of two databases horizontally.
There are two main methods of joining, inner joining and outer joining. Join the following two DataFrames in the "fruits" column.
Rows that do not have a common value in the Key column are discarded. Also, leave other rows that have the same column and their values do not match Alternatively, you can specify to discard it. You can see that only the common data remains in the "fruits" column data of the two DataFrames.
Some rows have no common value in the Key column. Elements that have no value are filled with NaN. You can see that NaN is inserted in the data in the row direction of "kiwifruit" and "mango".
Let's try to internally combine two DataFrames, df1 and df2.
pandas.merge(df1, df2, on= "Key column", how="inner")
In this case, df1 is moved to the left Rows with mismatched values in the Key column are discarded.
In addition, common columns other than the Key column whose values do not match are left. The column that belonged to the DataFrame on the left is _x The column that belonged to the right side has _y as a suffix. Unless otherwise specified, the DataFrame index is not involved in the process.
Let's outer join two DataFrames, df1 and df2.
pandas.merge(df1, df2, on="Key column", how="outer")
In this case, the row where df1 is moved to the left and the values do not match in the Key column Elements that have no residual value are filled with NaN.
In addition, common columns with mismatched values other than the Key column are left. The column that belonged to the DataFrame on the left is _x, The column that belonged to the right side has _y as a suffix. Unless otherwise specified, the DataFrame index is not involved in the process.
Of the two DataFrames Order_df (left), one with order information Suppose the other is customer_df (right) with customer information.
In the order information, the column indicating the ID of the purchasing customer is "customer_id". In customer information, the column indicating the customer's ID is "id".
The columns of the columns you want to correspond do not match.
In this case, by specifying the column you want to use as Key You can combine columns of DataFrames with different columns in correspondence.
pandas.merge(Left DF,Right DF, left_on="Left DFのカラム", right_on="Right DF column", how="Joining method")
When the Key used to join DataFrames is an index Specify the index instead of specifying the columns of the DataFrame.
left_index=True #Left DF index
right_index=True #Index of DF on the right
import pandas as pd
#Order information
order_df = pd.DataFrame([[1000, 2546, 103],
[1001, 4352, 101],
[1002, 342, 101],
[1003, 1192, 102]],
columns=["id", "item_id", "customer_id"])
#Customer information
customer_df = pd.DataFrame([["Tanaka"],
["Suzuki"],
["Kato"]],
columns=["name"])
customer_df.index = [101, 102, 103] #Change the index value here
# order_df and customer_Combine df with customer ID to key and order_Substitute in df
order_df = pd.merge(order_df, customer_df, left_on="customer_id", right_index=True, how="inner")
# right_index=True and customer_Joins the same line as id.
print(order_df)
#output
id item_id customer_id name
0 1000 2546 103 Kato
1 1001 4352 101 Tanaka
2 1002 342 101 Tanaka
3 1003 1192 102 Suzuki
#output(customer_df.without index, how="outer"in the case of)
id item_id customer_id name
0 1000.0 2546.0 103 NaN
1 1001.0 4352.0 101 NaN
2 1002.0 342.0 101 NaN
3 1003.0 1192.0 102 NaN
3 NaN NaN 0 Tanaka
3 NaN NaN 1 Suzuki
3 NaN NaN 2 Kato
When the amount of data to be handled is huge, the output data is limited.
df.head() #Returns a DataFrame containing only the first 5 rows.
df.tail() #Returns a DataFrame containing only the last 5 rows.
If you specify an integer value as an argument, it will be for any number of lines at the beginning or end. You can get a DataFrame. The head () and tail () methods can also be used with Series type variables.
If you pass Series or DataFrame to the function provided by NumPy You can apply the calculation process to all the elements.
If you pass a DataFrame to a function that receives a NumPy array The calculation process is performed collectively for each column.
Also, Pandas supports broadcasting like NumPy. Use the regular expression "+-* /" to calculate between Pandas and between Pandas and integers. Can be processed.
Summary statistic: A summary of statistical information such as mean, maximum, and minimum values for each column.
When the dataframe type variable is df
df.describe()
#Includes df number per column, mean, standard deviation, minimum, quartile, maximum, etc.
#Returns a DataFrame.
Note that the summary statistics displayed will vary depending on the type of data. Also, rows and elements can be obtained by using loc and at together.
The name of the summary statistic is used for the index of the obtained DataFrame.
import numpy as np
import pandas as pd
np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)
#Of the df summary statistics"mean", "max", "min"Take out df_Substitute in des.
df_des = df.describe().loc[["mean","max","min"]]
# df.describe()Only returns a Dataframe, so you need to specify it with loc.
print(df_des)
#output
apple orange banana strawberry kiwifruit
mean 5.1 6.9 5.6 4.1 5.3
max 10.0 9.0 10.0 9.0 10.0
min 1.0 2.0 1.0 1.0 1.0
The operation to find the difference between lines is a function especially used in time series analysis. When the dataframe type variable is df
df.diff() #Returns a DataFrame that calculates the difference between rows or columns.
df.diff(periods=Row or column spacing, axis=Axis direction)
# periods=Is optional
#If the first argument is positive, the difference from the previous line is calculated, and if it is negative, the difference from the subsequent line is calculated.
#The axis is 0 for the row direction and 1 for the column direction.
import numpy as np
import pandas as pd
np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)
#For each row of df, df the DataFrame that calculated the difference from the row after 2 rows_Substitute for diff.
df_diff = df.diff(periods="-2",axis=0)
#Or df_diff = df.diff(-2, axis=0)
#df and df_Check the processing contents by comparing the diff data.
print(df)
print()
print(df_diff)
#output
apple orange banana strawberry kiwifruit
1 6 8 6 3 10
2 1 7 10 4 10
3 4 9 9 9 1
4 4 9 10 2 5
5 8 2 5 4 8
6 10 7 4 4 4
7 4 8 1 4 3
8 6 8 4 8 8
9 3 9 6 1 3
10 5 2 1 2 1
apple orange banana strawberry kiwifruit
1 2.0 -1.0 -3.0 -6.0 9.0
2 -3.0 -2.0 0.0 2.0 5.0
3 -4.0 7.0 4.0 5.0 -7.0
4 -6.0 2.0 6.0 -2.0 1.0
5 4.0 -6.0 4.0 0.0 5.0
6 4.0 -1.0 0.0 -4.0 -4.0
7 1.0 -1.0 -5.0 3.0 0.0
8 1.0 6.0 3.0 6.0 7.0
9 NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN
For databases and DataFrames Aggregate rows with the same value for a particular column This is called grouping.
When the dataframe type variable is df
df.groupby("column") # 指定したcolumn(列)でグループ化を行います。
At this time, a GroupBy object is returned, You cannot directly view the grouped results.
You can perform the following operations on the GroupBy object.
mean() #Find the average value for each group
sum () #Perform operations such as finding the sum
Recommended Posts