Get, Keep or check duplicate rows in pyspark

In order to get duplicate rows in pyspark we use round about method. First we do groupby count of all the columns and then we filter the rows with count greater than 1. Thereby we keep or  get duplicate rows in pyspark. We can also assign a flag which indicates the duplicate records which is nothing but flagging duplicate row or getting indices of the duplicate rows in pyspark there by check if duplicate row is present

  • Get Duplicate rows in pyspark using groupby count function – Keep or extract duplicate records.
  • Flag or check the duplicate rows in pyspark – check whether a row is a duplicate row or not.

Get, keep or check duplicate rows in pyspark c1

We will be using dataframe df_basket1

Get duplicate rows in pyspark 1

 


Get Duplicate rows in pyspark : Keep Duplicate rows in pyspark:

In order to keep only duplicate rows in pyspark we will be using groupby function along with count() function.

### Get Duplicate rows in pyspark

df1=df_basket1.groupBy("Item_group","Item_name","price").count().filter("count > 1")
df1.drop('count').show()
  • First we do groupby count of all the columns i.e. “Item_group”,”Item_name”,”price”
  • Secondly we filter the rows with count greater than 1.

So the resultant duplicate rows are

Get duplicate rows in pyspark 2

 


Flag or Check Duplicate rows in pyspark:

In order to check whether the row is duplicate or not we will be generating the flag “Duplicate_Indicator” with 1 indicates the row is duplicate and 0 indicate the row is not duplicate. This is accomplished by grouping dataframe by all the columns and taking the count. if count more than 1 the flag is assigned as 1 else 0 as shown below.

### flag or check Duplicate rows in pyspark

import pyspark.sql.functions as f
df_basket1.join(
    df_basket1.groupBy(df_basket1.columns).agg((f.count("*")>1).cast("int").alias("Duplicate_indicator")),
    on=df_basket1.columns,
    how="inner"
).show()

  • so the resultant dataframe with duplicate rows flagged as 1  and non duplicates rows flagged as 0 will be

Get, keep or check duplicate rows in pyspark d1

 


Other Related Topics:

 

Get duplicate rows in pyspark                                                                                                Get duplicate rows in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.

    View all posts