In order to calculate the quantile rank , decile rank and n tile rank in pyspark we use ntile() Function. By passing argument 4 to ntile() function quantile rank of the column in pyspark is calculated. By passing argument 10 to ntile() function decile rank of the column in pyspark is calculated. Let’s see with an example of each.
- Quantile Rank of the column in pyspark
- Quantile rank of the column by group in pyspark
- Decile Rank of the column in pyspark using ntile() function
- Decile rank of the column by group in pyspark
- N tile rank of the column in pyspark
We will be using the dataframe df_basket1
Quantile Rank of the column in pyspark
Quantile rank of the “price” column is calculated by passing argument 4 to ntile() function. we will be using partitionBy(), orderBy() on “price” column.
### Quantile Rank in pyspark from pyspark.sql.window import Window import pyspark.sql.functions as F df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.ntile(4).over(Window.partitionBy().orderBy(df_basket1['price'])).alias("quantile_rank")) df_basket1.show()
so the resultant quantile rank is shown below
Quantile Rank of the column by group in pyspark
Quantile rank of the column by group is calculated by passing argument 4 to ntile() function. we will be using partitionBy() on “Item_group”, orderBy() on “price” column.
### Quantile Rank of the column by group in pyspark from pyspark.sql.window import Window import pyspark.sql.functions as F df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.ntile(4).over(Window.partitionBy(df_basket1['Item_group']).orderBy(df_basket1['price'])).alias("quantile_rank")) df_basket1.show()
so the resultant quantile rank by group is shown below
Decile Rank of the column in pyspark
Decile rank of the “price” column is calculated by passing argument 10 to ntile() function. we will be using partitionBy(), orderBy() on “price” column.
### Decile Rank of the column in pyspark from pyspark.sql.window import Window import pyspark.sql.functions as F df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.ntile(10).over(Window.partitionBy().orderBy(df_basket1['price'])).alias("decile_rank")) df_basket1.show()
So the resultant Decile rank is shown below
Decile Rank of the column by group in pyspark
Decile rank of the column by group is calculated by passing argument 10 to ntile() function. we will be using partitionBy() on “Item_group”, orderBy() on “price” column.
### Decile Rank of the column by group in pyspark from pyspark.sql.window import Window import pyspark.sql.functions as F df_basket1 = df_basket1.select("Item_group","Item_name","Price", F.ntile(10).over(Window.partitionBy(df_basket1['Item_group']).orderBy(df_basket1['price'])).alias("decile_rank")) df_basket1.show()
so the resultant Decile rank by group is shown below
NOTE: N tile rank of the column in pyspark – N tile function takes up the argument to calculate n tile rank of the column in pyspark.
Other Related Topics:
- Simple random sampling and stratified sampling in pyspark – Sample(), SampleBy()
- Join in pyspark (Merge) inner , outer, right , left join in pyspark
- Get duplicate rows in pyspark
- Quantile rank, decile rank & n tile rank in pyspark – Rank by Group
- Populate row number in pyspark – Row number by Group
- Percentile Rank of the column in pyspark
- Mean of two or more columns in pyspark
- Sum of two or more columns in pyspark
- Row wise mean, sum, minimum and maximum in pyspark
- Rename column name in pyspark – Rename single and multiple column
- Typecast Integer to Decimal and Integer to float in Pyspark
- Get number of rows and number of columns of dataframe in pyspark
- Extract Top N rows in pyspark – First N rows
- Absolute value of column in Pyspark – abs() function
- Set Difference in Pyspark – Difference of two dataframe
- Union and union all of two dataframe in pyspark (row bind)