Drop column in pyspark – drop single & multiple columns

Deleting or Dropping column in pyspark can be accomplished using drop() function. drop() Function with argument column name is used to drop the column in pyspark. drop single & multiple colums in pyspark is accomplished in two ways, we will also look how to drop column using column position, column name starts with, ends with and contains certain character value. Drop columns which has NA , Null values also depicted with example. We will see how to

Drop single column in pyspark with example
Drop multiple column in pyspark with example
Drop column like function in pyspark – drop column name contains a string
Drop column with column name starts with a specific string in pyspark
Drop column with column name ends with a specific string in pyspark
Drop column by column position in pyspark
Drop column with NA/NaN and null values in pyspark

We will be using df_orders.

drop column in pyspark drop single and multiple columns with conditions 1

Drop single column in pyspark – Method 1 :

Drop column in pyspark c2

Drop single column in pyspark using drop() function. Drop function with the column name as argument drops that particular column

## drop single column

df_orders.drop('cust_no').show()

So the resultant dataframe has cust_no column dropped

drop column in pyspark drop single and multiple columns with conditions 2

Drop single column in pyspark – Method 2:

Drop single column in pyspark using drop() function. Drop function with the df.column_name as argument drops that particular column.

## drop single column

df_orders.drop(df_orders.cust_no).show()

So the resultant dataframe has “cust_no” column dropped

Drop column in pyspark – drop single & multiple columns 3

Drop multiple column in pyspark :Method 1

Drop multiple column in pyspark using drop() function. Drop function with list of column names as argument drops those columns.

## drop multiple columns

df_orders.drop('cust_no','eno').show()

So the resultant dataframe has “cust_no” and “eno” columns dropped

drop column in pyspark drop single and multiple columns with conditions 3

Drop multiple column in pyspark :Method 2

Drop multiple column in pyspark using drop() function. List of column names to be dropped is mentioned in the list named “columns_to_drop”. This list is passed to the drop() function.

## drop multiple columns

columns_to_drop = ['cust_no', 'eno']
df_orders.drop(*columns_to_drop).show()

So the resultant dataframe has “cust_no” and “eno” columns dropped

drop column in pyspark drop single and multiple columns with conditions 3

Drop multiple column in pyspark :Method 3

Drop multiple column in pyspark using two drop() functions which drops the columns one after another in a sequence with single step as shown below.

## drop multiple columns

df_orders.drop(df_orders.eno).drop(df_orders.cust_no).show()

So the resultant dataframe has “cust_no” and “eno” columns dropped

drop column in pyspark drop single and multiple columns with conditions 3

Drop column using position in pyspark:

Drop column in pyspark c3

Dropping multiple columns using position in pyspark is accomplished in a roundabout way . First the list with required columns and rows is extracted using select() function and then it is converted to dataframe as shown below.

## drop multiple columns using position

spark.createDataFrame(df_orders.select(df_orders.columns[:2]).take(5)).show()

So the resultant dataframe has “cust_no” and “eno” columns dropped

drop column in pyspark drop single and multiple columns with conditions 4

Drop column name which starts with the specific string in pyspark:

Dropping multiple columns which starts with a specific string in pyspark accomplished in a roundabout way . First the list of column names starts with a specific string is extracted using startswith() function and then it is passed to drop() function as shown below.

## drop multiple columns starts with a string 
some_list=df_orders.columns
columns_to_drop = [i for i in some_list if i.startswith('cust')]
df_orders.drop(*columns_to_drop).show()

So the column name which starts with “cust” is dropped so the resultant dataframe will be

drop column in pyspark drop single and multiple columns with conditions 6

Drop column name which ends with the specific string in pyspark:

Drop column in pyspark c4

Dropping multiple columns which ends with a specific string in pyspark accomplished in a roundabout way . First the list of column names ends with a specific string is extracted using endswith() function and then it is passed to drop() function as shown below.

## drop multiple columns ends with a string 

some_list=df_orders.columns
columns_to_drop = [i for i in some_list if i.endswith('date')]
df_orders.drop(*columns_to_drop).show()

So the column name which ends with “date” is dropped so the resultant dataframe will be

drop column in pyspark drop single and multiple columns with conditions 7

Drop column name which contains a specific string in pyspark:

Dropping multiple columns which contains a specific string in pyspark accomplished in a roundabout way . First the list of column names contains a specific string is extracted and then it is passed to drop() function as shown below.

## drop multiple columns ends with a string 

some_list=df_orders.columns
columns_to_drop = [i for i in some_list if i.endswith('date')]
df_orders.drop(*columns_to_drop).show()

So the column name which contains “ved” is dropped so the resultant dataframe with “received_date” dropped will be

drop column in pyspark drop single and multiple columns with conditions 5

Drop the columns which has Null values in pyspark :

Dropping multiple columns which contains a Null values in pyspark accomplished in a roundabout way by creating a user defined function. column names which contains null values are extracted using isNull() function and then it is passed to drop() function as shown below.

import pyspark.sql.functions as F

def drop_null_columns(df_orders):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df_orders.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_orders.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > 0]
    df_orders = df_orders.drop(*to_drop)
    return df_orders

drop_null_columns(df_orders).show()

So the column name which contains only null values is dropped so the resultant dataframe with column “shipped_date” dropped will be

drop column in pyspark drop single and multiple columns with conditions 8

Drop the columns which has NA/NAN values in pyspark :

Drop column in pyspark c5

Dropping multiple columns which contains NAN/NA values in pyspark accomplished in a roundabout way by creating a user defined function. column names which contains NA/NAN values are extracted using isnan() function and then it is passed to drop() function as shown below.

import pyspark.sql.functions as F

def drop_null_columns(df_orders):
    """
    This function drops all columns which contain null values.
    :param df: A PySpark DataFrame
    """
    null_counts = df_orders.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_orders.columns]).collect()[0].asDict()
    to_drop = [k for k, v in null_counts.items() if v > 0]
    df_orders = df_orders.drop(*to_drop)
    return df_orders

drop_null_columns(df_orders).show()

So the column name which contains only NA/NAN values is dropped so the resultant dataframe with column “order_no” dropped will be

drop column in pyspark drop single and multiple columns with conditions 9

also for other function refer the cheatsheet.

Author

Sridhar Venkatachalam

With close to 10 years on Experience in data science and machine learning Have extensively worked on programming languages like R, Python (Pandas), SAS, Pyspark.
View all posts