Drop column in R using Dplyr: Drop column in R can be done by using minus before the select function. Dplyr package in R is provided with select() function which is used to select or drop the columns based on conditions like starts with, ends with, contains and matches certain criteria and also dropping column based on position, Regular expression, criteria like column names with missing values has been depicted with an example for each.
- Drop column with column name in R dplyr.
- Drop column by column position in dplyr
- Drop column which contains a value or matches a pattern.
- Drop column which starts with or ends with certain character.
- Drop column name with Regular Expression using grepl() function
- Drop column name with missing values
We will be using mtcars data to depict, dropping of the variable
Drop by column names in Dplyr R:
select() function along with minus which is used to drop the columns by name
library(dplyr) mydata <- mtcars # Drop the columns of the dataframe select (mydata,-c(mpg,cyl,wt))
the above code drops mpg, cyl and wt columns. thus dropping the column by column name has been accomplished.
Drop column by position in R Dplyr:
Drop 3rd, 4th and 5th columns of the dataframe:
In order to drop column by column position we will be passing the column position as a vector to the select function with negative sign as shown below.
library(dplyr) mydata <- mtcars # Drop 3rd,4th and 5th columns of the dataframe select(mydata,-c(3,4,5))
the above code drops 3rd, 4th and 5th column. thus dropping the column by column position has been accomplished.
Dropping by Matching with patterns
starts_with() function in R:
In order to drop the column which starts with certain label we will be using select() function along with starts_with() function by passing the column label inside the starts_with() function as shown below.
library(dplyr) mydata <- mtcars # Drop column names of the dataframe which starts with select(mydata,-starts_with("mpg"))
Dropping the column name which starts with mpg is accomplished using starts_with() function and select() function.
ends_with() function in R:
In order to drop the column which ends with certain label we will be using select() function along with ends_with() function by passing the column label inside the ends_with() function as shown below.
library(dplyr) mydata <- mtcars # Drop column names of the dataframe which ends with select(mydata,-ends_with("cyl"))
Dropping the column name which ends with “cyl” is accomplished using ends_with() function and select() function.
contains() function in R:
In order to drop the column which contains with certain label we will be using select() function along with contains() function by passing the text inside the contains() function as shown below.
library(dplyr) mydata <- mtcars # drop the column names of the dataframe which contains select(mydata,-contains("s"))
Dropping the column name which contains “s” is accomplished using contains() function and select() function.
matches() function:
Drop the column name which matches with “di”. In order to drop the column which matches with certain pattern we will be using select() function along with matches() function by passing the text or pattern inside the matches() function as shown below.
library(dplyr) mydata <- mtcars # Drop the columns names of the dataframe which matches select(mydata,-matches("di"))
Dropping the column name which matches “di” is accomplished using matches() function and select() function.
Drop Column names using Regular Expression in R Regex:
Drop the column name which matches with certain pattern using regular expression has been accomplished with the help of grepl() function. grepl() function pass the column name and regular expression as argument and returns the matched column as shown below.
mydata = mtcars # Drop the column names using Regular Expression mydata1 = mydata[,!grepl("^c",names(mydata))] mydata1
Dropping the column name which starts with “c” is accomplished using grepl() function along with regular expression.
Drop columns with missing values in R:
In order depict an example on dropping a column with missing values, First lets create the dataframe as shown below.
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"), ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"), Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,60), Tax = c(2,4,5,NA,2,3,NA,1,NA,4,5,NA,4,NA)) my_basket
so the dataframe will be
sapply
function is an alternative of for loop
. which built-in or user-defined function on each column of data frame. sapply(df, function(x) mean(is.na(x)))
returns percentage of missing values in each column of a dataframe.
###### drop columns on a missing value my_basket = my_basket[,!sapply(my_basket, function(x) mean(is.na(x)))>0.3] my_basket
The above program removed column “Tax” as it contains more than 30% missing values as we have given our threshold as 30%. so the final output dataframe will be without Tax column
for further understanding of dropping a column with dplyr package one can refer documentation.