There is an argument ʻindicator` when merging with pandas, but although it is very convenient, I feel that there are few people using it, so this is an article for the purpose of spreading.
The Documentation states as follows.
indicator : bool or str, default False If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
Generally, if you join left, all records that exist only in the data on the left will be lost at the time of merge. However, when analyzing data, it is often the case that you want to see how the records are linked and which records are not linked. If you merge normally, it will be quite troublesome to check this, or later you will notice that it is not tied at all and it will be a disaster, but if you use ʻindicator` introduced in this article, you can easily check this. I will.
There is no particular problem if the target to be analyzed is only in-house data and all the data is completely linked with the key, but it is useful when pulling unfamiliar data from the outside and linking it with the company's own data. To do.
Let's take a concrete example.
Suppose you want to attach the weight data b picked up from the outside to the height data a you have.
a = pd.DataFrame({
"ID" : [1, 2, 3, 4, 6],
"name": ["Tom", "Bob", "Alex", "Mike", "John"],
"heigt": [169, 173, 163, 170, 182]
})
b = pd.DataFrame({
"ID" : [1, 2, 3, 4, 5],
"weight": [65, 70, 72, 58, 61]
})
a
b
If you want to analyze without thinking
c = a.merge(b, on="ID", how="left")
# or
c = a.merge(b, on="ID", how="inner")
I think you should do it according to your purpose.
Here, b is the data brought from the outside, so we want to know the following.
――How much are a and b linked? --I want to see the details of the record where a (b) does not stick
You can easily check this with the ʻindicator option. It's easy to use, just add ʻindicator = True
as shown below.
(Duplicate check of join key is not mentioned here)
c = a.merge(b, on="ID", how="outer", indicator=True)
c
Then the _merge
column will be added along with the merged dataset as above.
The _merge
column contains one of the values both
, left_only
, or right_only
, which tells you where the record came from. After that, use this column
c_left_only = c[c["_merge"]=="left_only"]
c_left_only
Then you can get a list of people who are tall but not weight.
It is convenient to set how = "outer" at the time of merging and later extract the data with the value of "_merge" according to the purpose.
Also, ʻindicator` supports not only bool but also str type,
d = a.merge(b, on="ID", how="outer", indicator="flg_weight")
d
The `_merge` part can be the specified string.
It's a convenient function, so please use it! Let's have a good data analysis life ~~
Recommended Posts