Set difference in Pyspark returns the rows that are in the one dataframe but not other dataframe. Set difference performs set difference i.e. difference of two dataframe in Pyspark.
We will see an example of
- Set difference which returns the difference of two dataframe in pyspark
- Set difference of a column in two dataframe – difference of a column in two dataframe in pyspark
We will be using two dataframes namely df_summerfruits and df_fruits.
df_summerfruits:
df_fruits:
Difference of two dataframe in pyspark – set difference
Syntax:
df1 – dataframe1
df2 – dataframe2
dataframe1.subtract(dataframe2) gets the difference of dataframe2 from dataframe1. So the rows that are present in first dataframe but not present in the second dataframe will be returned
########## Difference of two dataframe in pyspark df_summerfruits.subtract(df_fruits).show()
Set difference of two dataframes will be calculated
Difference of a column in two dataframe in pyspark – set difference of a column
We will be using subtract() function along with select() to get the difference between a column of dataframe2 from dataframe1. So the column value that are present in first dataframe but not present in the second dataframe will be returned
########## Difference of a column in two dataframe in pyspark df_summerfruits.select('color').subtract(df_fruits.select('color')).show()
Set difference of “color” column of two dataframes will be calculated. “Color” value that are present in first dataframe but not in the second dataframe will be returned.
Other Related Topics:
- Typecast Integer to Decimal and Integer to float in Pyspark
- Get number of rows and number of columns of dataframe in pyspark
- Extract Top N rows in pyspark – First N rows
- Absolute value of column in Pyspark – abs() function
- Union and union all of two dataframe in pyspark (row bind)
- Intersect of two dataframe in pyspark (two or more)
- Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)
- Sort the dataframe in pyspark – Sort on single column & Multiple column
- Drop rows in pyspark – drop rows with condition
- Distinct value of a column in pyspark
- Distinct value of dataframe in pyspark – drop duplicates
- Count of Missing (NaN,Na) and null values in Pyspark
- Mean, Variance and standard deviation of column in Pyspark
- Maximum or Minimum value of column in Pyspark