To calculate the mean of a column in pyspark, you can use the mean() function from the pyspark.sql.functions module. You can either use agg() or select() to calculate the mean for a single column and multiple columns in pyspark. Lets see how to calculate
- Mean of the single column in pyspark using agg() function and mean() function
- Mean of the single column in pyspark using select() function and mean() function
- Mean of multiple columns in pyspark using mean() and select() function
- Mean of multiple columns in pyspark using mean() and agg() function
We will use the dataframe named df.
Mean of the single column in pyspark : Method 1 using agg() function
To Calculate Mean of single column you can use mean() function and agg() function as shown below
### Mean of single column in pyspark from pyspark.sql import functions as F #calculate mean of column named 'science_score' df.agg(F.mean('science_score')).collect()[0][0]
so the resultant Mean of “science_score” column will be
Output:
Mean of the single column in pyspark : Method 2 using select() function
To Calculate Mean of single column you can use mean() function and agg() function as shown below
from pyspark.sql.functions import mean #calculate mean of column named 'science_score' df.select(mean("science_score")).show()
so the resultant Mean of “science_score” column will be
Output:
Mean of multiple columns in pyspark : Method 1 using mean() and agg() function
To calculate the mean of multiple columns in PySpark, you can use the agg() function, which allows you to apply aggregate functions like mean() to more than one column at a time.
from pyspark.sql.functions import mean #calculate mean of column named 'science_score' and 'mathematics_score' df.agg(mean("science_score"),mean("mathematics_score")).show()
agg(mean(“science_score”), mean(“mathematics_score”)): This applies the mean() function to both the columns.
so the resultant Mean of “science_score” and “mathematics_score” column will be
Output:
Mean of multiple columns in pyspark : Method 2 using mean() and select() function
The alternative approach to calculate the mean of multiple columns in PySpark is using the select() function, If you only need to calculate the mean without aggregation over groups, you can use select() with multiple mean() functions to calculate mean value to more than one column at a time.
from pyspark.sql.functions import mean #calculate mean of column named 'science_score' and 'mathematics_score' df.select(mean("science_score"), mean("mathematics_score")).show()
so the resultant Mean of “science_score” and “mathematics_score” column will be
Output:
Other Related Topics :
- Sum of multiple column in pyspark
- Row wise mean, sum, minimum and maximum in pyspark
- Rename column name in pyspark – Rename single and multiple column
- Typecast Integer to Decimal and Integer to float in Pyspark
- Extract Top N rows in pyspark – First N rows
- Absolute value of column in Pyspark – abs() function
- Set Difference in Pyspark – Difference of two dataframe
- Union and union all of two dataframe in pyspark (row bind)
- Intersect of two dataframe in pyspark (two or more)
- Round up, Round down and Round off in pyspark – (Ceil & floor pyspark)
- Sort the dataframe in pyspark – Sort on single column & Multiple column