Select function in R is used to select variables (columns) in R using Dplyr package. Dplyr package in R is provided with select() function which select the columns based on conditions. select() function in dplyr which is used to select the columns based on conditions like starts with, ends with, contains and matches certain criteria and also selecting column based on position, Regular expression, criteria like selecting column names without missing values has been depicted with an example for each.
- Select column with column name in R dplyr.
- Select column by column position in dplyr
- Select column which contains a value or matches a pattern.
- Select column which starts with or ends with certain character.
- Select column name with Regular Expression using grepl() function
- Select column name with missing values
We will be using mtcars data to depict the select() function
Select column by column name and Position
select () Function in Dplyr: Select Column by Name
select() function helps us to select the column by passing the dataframe and column names of the dataframe as argument
library(dplyr) mydata <- mtcars # Select columns of the dataframe select(mydata,mpg,cyl,wt)
The above code selects mpg, cyl and wt column
Select Column by Position :
Select 3rd and 4th columns of the dataframe:
select() function also helps us to select the column by position, select() function takes dataframe and column position as argument
library(dplyr) mydata <- mtcars # Select 3rd and 4th columns of the dataframe select(mydata,3:4)
the above code selects (3rd) disp and (4th) hp column
Select Column with conditions and pattern matching in R dplyr
starts_with() function:
Select the column name which starts with mpg
library(dplyr) mydata <- mtcars # Select on columns names of the dataframe which starts with select(mydata,starts_with("mpg"))
Select the column names which does not starts with mpg
library(dplyr) mydata <- mtcars # deselect on columns names of the dataframe which starts with select(mydata,-starts_with("mpg"))
ends_with() function:
Select the column name which ends with cyl
library(dplyr) mydata <- mtcars # Select on columns names of the dataframe which ends with select(mydata,ends_with("cyl"))
contains() function:
Select the column name which contains “s”
library(dplyr) mydata <- mtcars # Select on columns names of the dataframe which contains select(mydata,contains("s"))
matches() function:
Select the column name which matches with “di”
library(dplyr) mydata <- mtcars # Select on columns names of the dataframe which matches select(mydata,matches("di"))
everything() function:
select everything /all columns of the dataframe
library(dplyr) mydata <- mtcars # Select everything select(mydata,everything())
select Column names using Regular Expression:
select the column name which matches with certain pattern using regular expression has been accomplished with the help of grepl() function. grepl() function pass the column name and regular expression as argument and returns the matched column as shown below.
mydata = mtcars # select the column names using Regular Expression mydata1 = mydata[,grepl("^c",names(mydata))] mydata1
Selecting the column name which starts with “c” is accomplished using grepl() function along with regular expression.
Select columns without missing values:
In order depict an example on selecting a column without missing values, First lets create the dataframe as shown below.
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"), ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"), Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,60), Tax = c(2,4,5,NA,2,3,NA,1,NA,4,5,NA,4,NA)) my_basket
so the dataframe will be
sapply
function is an alternative of for loop
. which built-in or user-defined function on each column of data frame. sapply(df, function(x) mean(is.na(x)))
returns percentage of missing values in each column of a dataframe.
###### select columns without missing value my_basket = my_basket[,!sapply(my_basket, function(x) mean(is.na(x)))> 0.3] my_basket
The above program removed column “Tax” as it contains more than 30% missing values as we have given our threshold as 30%. so the final output dataframe will be without Tax column. Thereby selecting all the columns without missing values
for further understanding of selecting a column with dplyr package one can refer documentation
Other Related Topics: