In order to get the distinct value of a column in pyspark we will be using select() and distinct() function. There is another way to get distinct value of the column in pyspark using dropDuplicates() function. Distinct value of multiple columns in pyspark using dropDuplicates() function. Distinct value or unique value all the columns. Let’s see with an example for both
- Distinct value of a column in pyspark using distinct() function
- Distinct value of the column in pyspark using dropDuplicates() function
- Unique/Distinct value of multiple columns in pyspark distinct() function & dropDuplicates() function
- unique/Distinct value of all the columns using distinct() function
We will be using dataframe Basket_df
Get distinct value of a column in pyspark – distinct() – Method 1
Distinct value of the column is obtained by using select() function along with distinct() function. select() function takes up the column name as argument, Followed by distinct() function will give distinct value of the column
### Get distinct value of column df_basket.select("Price").distinct().show()
distinct value of “Price” column will be
Get distinct value of a column – dropDuplicates() – Method 2
dropDuplicates() function takes up the column name as argument, will give distinct value of that column.
### Drop Duplicate of the column from pyspark.sql import Row df_basket.dropDuplicates((['Price'])).select("Price").show()
distinct value of “Price” column will be
Distinct Value of multiple columns in pyspark: Method 1
Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
### Get distinct value of multiple columns df_basket.select("Item_group","Price").distinct().show()
distinct value of “Item_group” & “Price” columns will be
Get distinct value of multiple columns in pyspark – Method 2
dropDuplicates() function takes up multiple column names as argument, will give distinct value of those columns.
### Drop Duplicate of the column from pyspark.sql import Row df_basket.dropDuplicates((['Price','Item_group'])).select("Item_group","Price").show()
distinct value of “Item_group” & “Price” columns will be
distinct value of all the columns in pyspark using – distinct() function : Method 1
distinct() function without any arguments or select function, will give distinct value of the dataframe i.e. distinct value of the columns
### distinct of all the columns df_basket.distinct().show()
distinct value of all the columns will be
distinct value of all the columns using dropDuplicates() function : Method 2
dropDuplicates() function without any arguments gets the distinct value of all the columns as shown below.
### Drop Duplicates of all the columns from pyspark.sql import Row df_basket.dropDuplicates().show()
distinct value of all the columns will be
Other Related Topics:
- Rename column name in pyspark – Rename single and multiple column
- Typecast Integer to Decimal and Integer to float in Pyspark
- Get number of rows and number of columns of dataframe in pyspark
- Extract Top N rows in pyspark – First N rows
- Absolute value of column in Pyspark – abs() function
- Set Difference in Pyspark – Difference of two dataframe
- Union and union all of two dataframe in pyspark (row bind)
- Intersect of two dataframe in pyspark (two or more)
- Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)
- Sort the dataframe in pyspark – Sort on single column & Multiple column
- Drop rows in pyspark – drop rows with condition
- Distinct value of a column in pyspark
- Distinct value of dataframe in pyspark – drop duplicates
- Count of Missing (NaN,Na) and null values in Pyspark
- Mean, Variance and standard deviation of column in Pyspark
- Maximum or Minimum value of column in Pyspark