Sum of column values in pyspark

To calculate the sum of a column values in PySpark, you can use the sum() function from the pyspark.sql.functions module. You can either use agg() or select() to calculate the Sum of column values for a single column or multiple columns. Lets see how to calculate

  • Sum of column values of the single column in pyspark using agg() function and sum() function
  • Sum of column values of the single column in pyspark using select() function and sum() function
  • Sum of column values of multiple columns in pyspark using sum() and select() function
  • Sum of column values of multiple columns in pyspark using sum() and agg() function

We will use the dataframe named df.

Sum of column values in pyspark 1

 

 

Sum of column values of the single column in pyspark : Method 1 using agg() function

To Calculate Sum of column values of single column you can use sum() function and agg() function as shown below

from pyspark.sql import functions as F

#calculate Sum of column values of column named 'science_score'
df.agg(F.sum('science_score')).collect()[0][0]

so the resultant Sum of column values of “science_score” column will be

Sum of column values in pyspark 2b

 

 

Sum of column values of the single column in pyspark : Method 2 using select() function

To Calculate Sum of column values of single column you can use sum() function and agg() function as shown below

from pyspark.sql.functions import max

#calculate Sum of column values of column named 'science_score'
df.select(sum("science_score")).show()


so the resultant Sum of column values of “science_score” column will be

Sum of column values in pyspark 2

 

 

 

Sum of column values of multiple columns in pyspark : Method 1 using sum()  and agg() function

To calculate the Sum of column values of multiple columns in PySpark, you can use the agg() function, which allows you to apply aggregate functions like sum() to more than one column at a time.

from pyspark.sql.functions import max
#calculate Sum of column values of column named 'science_score' and 'mathematics_score'
df.agg(sum("science_score"),sum("mathematics_score")).show()

agg(sum(“science_score”), sum(“mathematics_score”)): This applies the sum() function to both the columns.

so the resultant Sum of column values of “science_score” and “mathematics_score” column will be

Output:

Sum of column values in pyspark 3

 

Sum of column values of multiple columns in pyspark : Method 2 using sum()  and select() function

The alternative approach to calculate the Sum of column values of multiple columns in PySpark is using the select() function, If you only need to calculate the Sum of column values without aggregation over groups, you can use select() with multiple sum() functions to calculate Sum of column values with more than one column at a time.

from pyspark.sql.functions import max

#calculate Sum of column values of column named 'science_score' and 'mathematics_score'
df.select(sum("science_score"), sum("mathematics_score")).show()

so the resultant Sum of column values of “science_score” and “mathematics_score” column will be

Output:

Sum of column values in pyspark 4

 


Other Related Topics:

 

Sum of two or more columns in pyspark                                                                                        Sum of two or more columns in pyspark

Author

  • Sridhar Venkatachalam

    With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.

    View all posts