If you want to delete duplicates with pandas, or if you want to aggregate, you can use drop_duplicates or groupby.
How to remove duplicate elements in Pandas DataFrame or Series Python How to use Pandas groupby
However, there are times when I want to assign a group_id to each group under the same conditions as when groupby, but I didn't know how to do it, so I implemented it. (It may not be a best practice, but it was easy to implement)
#pandas import
import pandas as pd
#Preparing the data frame
df = pd.DataFrame({
'building_name': ['Building A', 'Building A', 'B building', 'C building', 'B building', 'B building', 'D building'],
'property_scale': ['large', 'large', 'small', 'small', 'small', 'small', 'large'],
'city_code': [1, 1, 1, 2, 1, 1, 1]
})
df
building_name | property_scale | city_code |
---|---|---|
Building A | large | 1 |
Building A | large | 1 |
B building | small | 1 |
C building | small | 2 |
B building | small | 1 |
B building | small | 1 |
D building | large | 1 |
#Group objectization
group_info = df.groupby(['property_scale', 'city_code'])
#Let's take a look at the contents
group_info.groups
{('large', 1): Int64Index([0, 1, 6], dtype='int64'), ('small', 1): Int64Index([2, 4, 5], dtype='int64'), ('small', 2): Int64Index([3], dtype='int64')}
#See also
group_info.get_group(('large', 1))
building_name | property_scale | city_code |
---|---|---|
Building A | large | 1 |
Building A | large | 1 |
D building | large | 1 |
# group_Granting id
df = pd.concat([
group_info.get_group(group_name).assign(group_id=group_id)
for group_id, group_name
in enumerate(group_info.groups.keys())])
df
building_name | property_scale | city_code | group_id |
---|---|---|---|
Building A | large | 1 | 0 |
Building A | large | 1 | 0 |
D building | large | 1 | 0 |
B building | small | 1 | 1 |
B building | small | 1 | 1 |
B building | small | 1 | 1 |
C building | small | 2 | 2 |
I will also make it a function
import pandas as pd
from pandas.core.frame import DataFrame
def add_group_id(df: DataFrame, by: list) -> DataFrame:
"""Group for records with duplicate by values_Give id.
Args:
df (DataFrame):Arbitrary data frame
by (list):Column name to group
Returns:
DataFrame
"""
#Already group_If the id column is included, group in by_Add id as well
if 'group_id' in df.columns:
by += ['group_id']
group_info = df.groupby(by=by)
new_df = pd.concat([
group_info.get_group(group_name).assign(group_id=group_id)
for group_id, group_name
in enumerate(group_info.groups.keys())])
return new_df
Thanks to @r_beginners for commenting, it seems that groupby has a group_id calculation function in the first place.
import pandas as pd
from pandas.core.frame import DataFrame
def add_group_id(df: DataFrame, by: list) -> DataFrame:
"""Group for records with duplicate by values_Give id.
Args:
df (DataFrame):Arbitrary data frame
by (list):Column name to group
Returns:
DataFrame
"""
#Already group_If the id column is included, group in by_Add id as well
if 'group_id' in df.columns:
by += ['group_id']
new_df = df.assign(group_id =df.groupby(by).ngroup())
return new_df
As @nkay commented, pd.factorize () seems to work as well.
pandas methods Let's study more. ..
Recommended Posts