Remove Duplicate rows in R using Dplyr – distinct () function

Distinct function in R is used to remove duplicate rows in R using Dplyr package. Dplyr package in R is provided with distinct() function which eliminate duplicates rows with single variable or with multiple variable. There are other methods to drop duplicate rows in R one method is duplicated() which identifies and removes duplicate in R. The other method is unique() which identifies the unique values.

we will looking at example on How to

Get distinct Rows of the dataframe in R using distinct() function.
Remove duplicate rows based on two or more variables/columns in R
Drop duplicates of the dataframe using duplicated() function in R
Get unique rows (remove duplicate rows) of the dataframe in R using unique() function.

remove duplicates in R dplyr 0

Create Dataframe

We will be using the following dataframe to depict the above functions. Lets first create the dataframe.

# simple Data frame creation

mydata = data.frame (NAME =c ('Alisa','Bobby','jodha','jack','raghu','Cathrine',
                      'Alisa','Bobby','kumar','Alisa','jack','Cathrine'),
                      Age = c (26,24,26,22,23,24,26,24,22,26,22,25),
                      Score =c(85,63,55,74,31,77,85,63,42,85,74,78))

mydata

so the resultant data frame will be

remove duplicates in R dplyr 1

distinct() Function in Dplyr – Remove duplicate rows of a dataframe in R:

library(dplyr)

# Remove duplicate rows of the dataframe
distinct(mydata)

 
library(dplyr) 
mydata %>% distinct()

In this dataset, all the duplicate rows are eliminated so it returns the unique rows in mydata.

remove duplicates in R dplyr 2

Remove Duplicate Rows based on a variable

We will be removing duplicate rows using a particular variable.

library(dplyr)

# Remove duplicate rows of the dataframe using NAME variable
distinct(mydata,NAME, .keep_all= TRUE)

 
library(dplyr) 
mydata %>% distinct(NAME, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame. So the output dataframe will be

remove duplicates in R dplyr 3

Remove Duplicate Rows based on multiple variables

We will be removing duplicate rows using Multiple variables in the below example.

library(dplyr)

# Remove duplicate rows of the dataframe using NAME and Age variables
distinct(mydata, NAME,Age, .keep_all= TRUE)

 
library(dplyr) 
mydata %>% distinct(NAME,Age, .keep_all= TRUE)

The .keep_all function is used to retain all other variables in the output data frame. So the resultant dataframe will be

remove duplicates in R dplyr 4

DROP Duplicates in R using unique() function in R

When we apply unique function to the above data frame

## Apply unique function for data frame in R
unique(mydata)

Duplicate entries in the data frame are eliminated and the final output will be
unique function in R 5

unique rows of the dataframe by keeping last occurrences

unique() function along with the argument fromLast =T indicates keeping the last occurrence in the process of identifying unique values

 
## unique value of dataframe in R by keeping last occurrences

unique(mydata, fromLast=T)

unique values of a dataframe by keeping last occurrences will be

unique value of the columns in the dataframe

unique() function takes up the column name as argument and results in identifying unique value of the particular column as shown below

## unique value of the column in R dataframe

unique(mydata$NAME)

so the unique values of the name column will be

unique function in R 7

Remove Duplicates based on a column using duplicated() function

duplicated() function along with [!] takes up the column name as argument and results in identifying unique value of the particular column as shown below

 
## unique value of the column in R dataframe 
mydata[!duplicated(mydata$NAME), ]

so the dataframe with unique values of the NAME column will be

remove duplicates in R dplyr 3

For Further understanding on how to drop duplicate rows in R using Dplyr one can refer dplyr documentation

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts