Mean single and multiple columns in pyspark

To calculate the mean of a column in pyspark, you can use the mean() function from the pyspark.sql.functions module. You can either use agg() or select() to calculate the mean for a single column and multiple columns in pyspark. Lets see how to calculate

Mean of the single column in pyspark using agg() function and mean() function
Mean of the single column in pyspark using select() function and mean() function
Mean of multiple columns in pyspark using mean() and select() function
Mean of multiple columns in pyspark using mean() and agg() function

We will use the dataframe named df.

Mean of the single column in pyspark : Method 1 using agg() function

To Calculate Mean of single column you can use mean() function and agg() function as shown below

### Mean of single column in pyspark

from pyspark.sql import functions as F

#calculate mean of column named 'science_score'
df.agg(F.mean('science_score')).collect()[0][0]

so the resultant Mean of “science_score” column will be

Output:

Mean of the single column in pyspark : Method 2 using select() function

To Calculate Mean of single column you can use mean() function and agg() function as shown below


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score'
df.select(mean("science_score")).show()

so the resultant Mean of “science_score” column will be

Output:

Mean of multiple columns in pyspark : Method 1 using mean() and agg() function

To calculate the mean of multiple columns in PySpark, you can use the agg() function, which allows you to apply aggregate functions like mean() to more than one column at a time.


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score' and 'mathematics_score'
df.agg(mean("science_score"),mean("mathematics_score")).show()

agg(mean(“science_score”), mean(“mathematics_score”)): This applies the mean() function to both the columns.

so the resultant Mean of “science_score” and “mathematics_score” column will be

Output:

Mean of multiple columns in pyspark : Method 2 using mean() and select() function

The alternative approach to calculate the mean of multiple columns in PySpark is using the select() function, If you only need to calculate the mean without aggregation over groups, you can use select() with multiple mean() functions to calculate mean value to more than one column at a time.


from pyspark.sql.functions import mean

#calculate mean of column named 'science_score' and 'mathematics_score'
df.select(mean("science_score"), mean("mathematics_score")).show()

so the resultant Mean of “science_score” and “mathematics_score” column will be

Output:

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts

Mean single and multiple columns in pyspark

Mean of the single column in pyspark : Method 1 using agg() function

Output:

Mean of the single column in pyspark : Method 2 using select() function

Mean of multiple columns in pyspark : Method 1 using mean() and agg() function

Other Related Topics :

Author

Related Posts:

.