If you try to add a series to a pandas dataframe, it behaves like a join, so be careful.
When processing data with pandas, the process of adding columns to a data frame is frequent. There are two main ways to add columns to a dayframe.
pd.DataFrame.assign
method** 1. Add column by specifying column name **
df['new_col'] = data
** 2. Add column using assign method **
df.assign(new_col=data)
In either case, you can pass a list of equal values, sizes, np.array
, pd.Series
, and so on.
Prepare a series with the same number of records as any data frame.
df = pd.DataFrame(
[[1,2,3], [4,5,6], [7,8,9]],
columns=['a', 'b', 'c'],
index=[1,2,3]
)
sr = pd.Series([-1, -2, -3])
df
# a b c
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
sr
# 0 -1
# 1 -2
# 2 -3
# dtype: int64
If you want to add the data of sr
as a new column'd'
to df
, you would do the following.
df = df.assign(d=sr)
I hope that such a table will be created.
a | b | c | d | |
---|---|---|---|---|
1 | 1 | 2 | 3 | -1 |
2 | 4 | 5 | 6 | -2 |
3 | 7 | 8 | 9 | -3 |
However, in reality, such a data frame is returned.
a | b | c | d | |
---|---|---|---|---|
1 | 1 | 2 | 3 | -2 |
2 | 4 | 5 | 6 | -3 |
3 | 7 | 8 | 9 | NaN |
When comparing the data frame and the series again, the indexes of both do not match. With such data, you can see that even in the case of assignment, it behaves like a join.
This can be avoided by passing it as np.array.
df.assign(new_col=new_series.values)
Note: As far as the Official Documentation is concerned, the to_numpy
method is used rather than the values
method. It is recommended to do it.
It looks like this to clearly distinguish the ʻExtension Arary` added in 0.24 of pandas.
First, if the value passed to the pd.DataFrame.assign
method is not callable, the process 1 shown at the beginning is only called internally. Therefore, a phenomenon like this one occurs in either method.
(By the way, when "the passed value is callable", it corresponds to the case of calling the column of the data frame itself with a lambda
expression etc.) [^ callable]
If you try to assign something like df ['X'] = hogehoge
,pd.DataFrame.__ setitem__ ()
will be called. As I followed the code, I found the following docstring. [^ setitem]
"""
Add series to DataFrame in specified column.
If series is a numpy-array (not a Series/TimeSeries), it must be the
same length as the DataFrames index or an error will be thrown.
Series/TimeSeries will be conformed to the DataFrames index to
ensure homogeneity.
"""
In other words, the passed data is
It is stated that it will be processed like this. If you follow the code further, you'll see that the data is sorted along the index of the data frame before it's added. [^ reindex]
This phenomenon was caused by thinking of the series in the same way as an array. As you can see below, we also found that the size of the series didn't even have to match the records in the dataframe to add, it was completely different from the list or array.
df.assign(
x=pd.Series([3], index=[2])
)
# a b c x
# 1 1 2 3 NaN
# 2 4 5 6 3.0
# 3 7 8 9 NaN
Don't add one-dimensional data of the same size to a data frame like a list or array. When assigning a series to a column of a data frame, the process will proceed without an error even if the size is different, so the risk of creating a bug without noticing it is likely to increase. I always warned that I would convert it to numpy-array and perform the assignment process.
Recommended Posts