Mean single and multiple columns in pyspark

To calculate the mean of a column in pyspark, you can use the mean() function from the pyspark.sql.functions module. You can either use agg() or select() to calculate the mean for a single column and multiple columns in pyspark. Lets see how to calculate

  • Mean of the single column in pyspark using agg() function and mean() function
  • Mean of the single column in pyspark using select() function and mean() function
  • Mean of multiple columns in pyspark using mean() and select() function
  • Mean of multiple columns in pyspark using mean() and agg() function

We will use the dataframe named df.

Calculate Mean of the Column in PySpark 1

 

Mean of the single column in pyspark : Method 1 using agg() function

To Calculate Mean of single column you can use mean() function and agg() function as shown below

### Mean of single column in pyspark

from pyspark.sql import functions as F

#calculate mean of column named 'science_score'
df.agg(F.mean('science_score')).collect()[0][0]

so the resultant Mean of “science_score” column will be

Output:

Calculate Mean of the Column in PySpark 2

 

 

Mean of the single column in pyspark : Method 2 using select() function

To Calculate Mean of single column you can use mean() function and agg() function as shown below


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score'
df.select(mean("science_score")).show()

so the resultant Mean of “science_score” column will be

Output:

Calculate Mean of the Column in PySpark 3

 

 

Mean of multiple columns in pyspark : Method 1 using mean()  and agg() function

To calculate the mean of multiple columns in PySpark, you can use the agg() function, which allows you to apply aggregate functions like mean() to more than one column at a time.


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score' and 'mathematics_score'
df.agg(mean("science_score"),mean("mathematics_score")).show()

agg(mean(“science_score”), mean(“mathematics_score”)): This applies the mean() function to both the columns.

so the resultant Mean of “science_score” and “mathematics_score” column will be

Output:

Calculate Mean of the Column in PySpark 4

 

 

Mean of multiple columns in pyspark : Method 2 using mean()  and select() function

The alternative approach to calculate the mean of multiple columns in PySpark is using the select() function, If you only need to calculate the mean without aggregation over groups, you can use select() with multiple mean() functions to calculate mean value to more than one column at a time.


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score' and 'mathematics_score'
df.select(mean("science_score"), mean("mathematics_score")).show()

so the resultant Mean of “science_score” and “mathematics_score” column will be

Output:

Calculate Mean of the Column in PySpark 5


Other Related Topics :

Mean of two or more columns in pyspark                                                                                                  Mean of two or more columns in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.

    View all posts