If you try to add a series to a pandas dataframe, it behaves like a join, so be careful.

Before

When processing data with pandas, the process of adding columns to a data frame is frequent. There are two main ways to add columns to a dayframe.

Add column by specifying column name
Add with pd.DataFrame.assign method

** 1. Add column by specifying column name **

df['new_col'] = data

** 2. Add column using assign method **

df.assign(new_col=data)

In either case, you can pass a list of equal values, sizes, np.array, pd.Series, and so on.

Intuitive behavior that occurs when substituting a series

Prepare a series with the same number of records as any data frame.

df = pd.DataFrame(
    [[1,2,3], [4,5,6], [7,8,9]],
    columns=['a', 'b', 'c'],
    index=[1,2,3]
)
sr = pd.Series([-1, -2, -3])

df
#   	a 	b 	c
# 1 	1 	2 	3
# 2 	4 	5 	6
# 3 	7 	8 	9

sr
# 0   -1
# 1   -2
# 2   -3
# dtype: int64

If you want to add the data of sr as a new column'd' to df, you would do the following.

df = df.assign(d=sr)

I hope that such a table will be created.

	a	b	c	d
1	1	2	3	-1
2	4	5	6	-2
3	7	8	9	-3

However, in reality, such a data frame is returned.

	a	b	c	d
1	1	2	3	-2
2	4	5	6	-3
3	7	8	9	NaN

What's going on

When comparing the data frame and the series again, the indexes of both do not match. With such data, you can see that even in the case of assignment, it behaves like a join.

Workaround

This can be avoided by passing it as np.array.

df.assign(new_col=new_series.values)

Note: As far as the Official Documentation is concerned, the to_numpy method is used rather than the values method. It is recommended to do it. It looks like this to clearly distinguish the ʻExtension Arary` added in 0.24 of pandas.

Why this happens

First, if the value passed to the pd.DataFrame.assign method is not callable, the process 1 shown at the beginning is only called internally. Therefore, a phenomenon like this one occurs in either method.

(By the way, when "the passed value is callable", it corresponds to the case of calling the column of the data frame itself with a lambda expression etc.) [^ callable]

If you try to assign something like df ['X'] = hogehoge,pd.DataFrame.__ setitem__ ()will be called. As I followed the code, I found the following docstring. [^ setitem]

        """
        Add series to DataFrame in specified column.
        If series is a numpy-array (not a Series/TimeSeries), it must be the
        same length as the DataFrames index or an error will be thrown.
        Series/TimeSeries will be conformed to the DataFrames index to
        ensure homogeneity.
        """

In other words, the passed data is

A numpy-array of equal size to the data frame --Added in the same order
Series --The index is added to be consistent with that of the dataframe

It is stated that it will be processed like this. If you follow the code further, you'll see that the data is sorted along the index of the data frame before it's added. [^ reindex]

This phenomenon was caused by thinking of the series in the same way as an array. As you can see below, we also found that the size of the series didn't even have to match the records in the dataframe to add, it was completely different from the list or array.

df.assign(
    x=pd.Series([3], index=[2])
)
#  	 	a 	b 	c 	x
# 1 	1 	2 	3 	NaN
# 2 	4 	5 	6 	3.0
# 3 	7 	8 	9 	NaN

Summary

Don't add one-dimensional data of the same size to a data frame like a list or array. When assigning a series to a column of a data frame, the process will proceed without an error even if the size is different, so the risk of creating a bug without noticing it is likely to increase. I always warned that I would convert it to numpy-array and perform the assignment process.

This verification was done with pandas 1.0.3, but the behavior was the same in previous versions.