The pre-stage of data processing is from reading the data. Here, when you read the data you have, you often have empty columns or columns that are not the same length and you want to delete unnecessary columns. Suppose you read the following table.
a | b | c | d |
---|---|---|---|
1 | 2 | 3 | NA |
2 | 3 | 4 | NA |
3 | 4 | NA | NA |
df <- data.frame(a = c(1,2,3),
b = c(2,3,4),
c = c(3,4,NA),
d = c(NA,NA,NA))
> df
a b c d
1 1 2 3 NA
2 2 3 4 NA
3 3 4 NA NA
library(dplyr)
Apply ʻanyNA ()` to check if NA is included for each column, and invert the return value as a logical vector.
df %>% lapply(.,anyNA) %>% unlist %>% !.
a b c d
TRUE TRUE FALSE FALSE
Pass this to the function select_if ()
, which selects the columns that meet the conditions.
df %>% select_if(lapply(.,anyNA) %>% unlist %>% !.)
a b
1 1 2
2 2 3
3 3 4
I managed to get rid of unnecessary columns with a one-line script. I think the readability is not so bad thanks to the pipe.
The comment from @hkzm was helpful.
Using purrr :: negate ()
, which passes the negation of the given function, made it more straightforward and simple to write.
library(tidyverse)
df %>% select_if(negate(anyNA))
a b
1 1 2
2 2 3
3 3 4
It is equivalent to inverting with the operator !
And making it a formula with ~
.
df %>% select_if(~ !anyNA(.))
a b
1 1 2
2 2 3
3 3 4
I personally thought that the one with better readability would use purrr :: negat ()
.
For Python pandas, there is a method dropna ()
that removes missing values, which is applicable in both row and column directions.
Reference: Exclude (delete) / replace (fill in) / extract missing value NaN with pandas
--Delete columns that contain at least one NA
dropna(how='any', axis=1)
--Delete all NA columnsdropna(how='all', axis=1)
However, in R, the only function that deletes missing values is in the row direction.
--Delete lines that contain at least one NA
na.omit()
I did a google search and searched for a way to delete in the column direction and couldn't find it easily. I finally found this article.
Means to remove columns containing NA in R
Depending on the programming language, you may have strengths and weaknesses. Why isn't there a similar function in R?
Recommended Posts