A summary of how to check for duplicate elements in a table and how to remove duplicate values.
It is possible to perform the same processing as "duplicate removal" which is frequently used in excel.
-Row0 and row5 are completely duplicated. ・ All elements of col0 and col4 are duplicated ・ Col2 is row0 = row5, row1 = row4 ・ Col3 is row0 = row5, row1 = row3
-** Duplicate lines are True **.
-** By default, the upper duplicate line is regarded as False **, and the lines below it are regarded as True. It is also possible to set the bottom side to True as an option. └ Decide whether the target is the top or bottom when deleting. └ It is also possible to delete all duplicate lines.
-It is also possible to specify the target column by column name.
①duplicated ②duplicated(keep='last') ③duplicated(keep=False) ④duplicated(['aaa']) ⑤duplicated(['aaa','bbb'])
duplicated
df.duplicated()
#output
row0 False
row1 False
row2 False
row3 False
row4 False
row5 True
dtype: bool
Only row5 is judged as duplicate (True).
Row = 0 and row = 5 are the same in all columns. By default, the first row is False, so row5 is judged as duplicate.
Default: "keep ='first'" The following two processes are the same.
df.duplicated()
df.duplicated(keep='first')
keep=last
df.duplicated(keep='last')
#output
row0 True
row1 False
row2 False
row3 False
row4 False
row5 False
dtype: bool
row0 is a duplicate judgment (True).
keep=False
df.duplicated(keep=False)
#output
row0 True
row1 False
row2 False
row3 False
row4 False
row5 True
dtype: bool
Row0 and row5 are duplicate judgment (True).
** ▼ In case of duplicated (['col0']) **
Specify one column
df.duplicated(['col0'])
#output
row0 False
row1 True
row2 True
row3 True
row4 True
row5 True
dtype: bool
Since all the elements of col0 are 1, all except the first column are judged as duplicates.
df.duplicated(['col0','col3'])
#output
row0 False
row1 False
row2 False
row3 True
row4 False
row5 True
dtype: bool
Only row3 and row5 are True. └ row3 of col0 and col3 is True → True └ row5 of col0 and col3 is True → True └ col0 and row1 are True, row1 of col3 is False → False
The line that is judged as duplicate (True) by the duplicated method is deleted.
①drop_duplicates ②drop_duplicates(keep='last') ③drop_duplicates(keep=False) ④drop_duplicates(['aaa']) ⑤drop_duplicates(['aaa','bbb'])
drop_duplicates
df.drop_duplicates()
row5 has been deleted.
Duplicate row0 and row5. Since the top row is not judged as duplicate (default keep ='first'), row5 was deleted.
keep='last'
df.drop_duplicates(keep='last')
keep=False
df.drop_duplicates(keep=False)S
Deleted both row0 and row5.
Specify column
df.drop_duplicates(['col0'])
All elements of col0 are duplicated from row0 to row5. Since it is a duplicate judgment other than row1, it will be deleted.
Specify column
drop_duplicates(['col0'], keep='last')
If keep ='last', the bottom line will not be a duplicate check.
Specify multiple columns
df.drop_duplicates(['col2', 'col3'])
row5 deleted.
The duplicates in col2 are row1, row4, row5. The duplicates in col3 are row1 and row5. ⇛ Common is row1 and row5. Since the first duplicate row is set to be left (default keep ='first'), only row5 is deleted.
① Confirmation of duplicate line names ② Delete the duplicate line name and check
** ▼ Available options **
keep='first'
└ Do not judge duplicates for the first line name
keep='last'
└ Do not judge the last line name of duplicates as duplicates
keep=False
└ All duplicate line names are considered duplicates
└ If False, no quotation is required
Since the option of ".duplicated ()" is "keep ='first'" by default, the first duplicate line is False, and the second and subsequent lines are True.
Check for duplicate line names
df.index.duplicated()
#output
# array([False, False, False, False, True, True])
▼ Original table
The 0th and 5th, 1st and 4th line names are duplicated. ⇛ The 5th and 4th are duplicate judgments.
All duplicate line names are True
df.index.duplicated(keep=False)
#output
# array([ True, True, False, False, True, True])
How to get a list of row names by removing duplicate row names.
Used when you want to check the line name.
Get by deleting the line name
df.index.drop_duplicates()
#output
# Index(['AAA', 'BBB', 'CCC', 'DDD'], dtype='object')
The second "AAA" and "BBB" that were duplicated have been deleted.
▼ Original table
df.columns.duplicated()
df.columns.drop_duplicates()
└ ".columns": Extract the column names of the table with list.
Recommended Posts