In order to calculate Mean of two or more columns in pyspark. We will be using + operator of the column in pyspark and dividing by number of columns to calculate mean of columns. Second method is to calculate mean of columns in pyspark and add it to the dataframe by using simple + operation along with select Function and dividing by number of columns. Let’s see an example of each.
- mean of two or more columns in pyspark using + and select() and dividing by number of columns
- mean of multiple columns in pyspark and appending to dataframe and dividing by number of columns
We will be using the dataframe df_student_detail.
Mean of two or more column in pyspark : Method 1
- In Method 1 we will be using simple + operator to calculate mean of multiple column in pyspark. using + to calculate sum and dividing by number of column, gives the mean
### Mean of two or more columns in pyspark from pyspark.sql.functions import col, lit df1=df_student_detail.select(((col("mathematics_score") + col("science_score")) / lit(2)).alias("mean")) df1.show()
In this method simply finds the mean of the two or more columns and produce the resultant column as shown below.
Mean of multiple column in pyspark and appending to dataframe: Method 2
In Method 2 we will be using simple + operator and dividing the result by number of column to calculate mean of multiple column in pyspark, and appending the results to the dataframe
### Mean of two or more columns in pyspark from pyspark.sql.functions import col df1=df_student_detail.withColumn("mean_of_col", (col("mathematics_score")+col("science_score"))/2) df1.show()
so we will be finding the mean the two columns namely “mathematics_score” and “science_score”, then storing the result in the column named “mean_of_col” as shown below in the resultant dataframe.
Other Related Topics :
- Sum of multiple column in pyspark
- Row wise mean, sum, minimum and maximum in pyspark
- Rename column name in pyspark – Rename single and multiple column
- Typecast Integer to Decimal and Integer to float in Pyspark
- Extract Top N rows in pyspark – First N rows
- Absolute value of column in Pyspark – abs() function
- Set Difference in Pyspark – Difference of two dataframe
- Union and union all of two dataframe in pyspark (row bind)
- Intersect of two dataframe in pyspark (two or more)
- Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)
- Sort the dataframe in pyspark – Sort on single column & Multiple column