In order to select column in pyspark we will be using select function. Select() function is used to select single column and multiple columns in pyspark. Select column name like in pyspark. We will explain how to select column in Pyspark using regular expression and also by column position with an example.
- Select single column in pyspark using select() function.
- Select multiple column in pyspark
- Select column name like in pyspark using select() function
- Select the column in pyspark using column position.
- Select column name using regular expression in pyspark using colRegex() function
Syntax:
df – dataframe
colname1..n – column name
We will use the dataframe named df_basket1.
Select single column in pyspark
Select() function with column name passed as argument is used to select that single column in pyspark.
df_basket1.select('Price').show()
We use select and show() function to select particular column. So in our case we select the ‘Price’ column as shown above.
Select multiple column in pyspark
Select() function with set of column names passed as argument is used to select those set of columns
df_basket1.select('Price','Item_name').show()
We use select function to select columns and use show() function along with it. So in our case we select the ‘Price’ and ‘Item_name’ columns as shown above.
Select column by column position in pyspark:
We can use the select function inorder to select the column by position. In the below example the columns are selected using the position, say will be selecting the first column (Position:0) and last column (Position:2), by passing position as argument as shown below
## select column by position df_basket1.select(df_basket1.columns[0],df_basket1.columns[2]).show()
so the resultant dataframe by selecting the column by position will be
Select using Regex with column name like in pyspark (select column name like):
colRegex() function with regular expression inside is used to select the column with regular expression. In our example we will be using the regular expressions and will be capturing the column whose name starts with or contains “Item” in it.
## select using Regex with column name like df_basket1.select(df_basket1.colRegex("`(Item)+?.+`")).show()
the above code selects columns which has the column name like Item%. so the resultant dataframe will be
Other Related Topics:
- Distinct value of a column in pyspark
- Distinct value of dataframe in pyspark – drop duplicates
- Count of Missing (NaN,Na) and null values in Pyspark
- Mean, Variance and standard deviation of column in Pyspark
- Maximum or Minimum value of column in Pyspark
- Raised to power of column in pyspark – square, cube , square root and cube root in pyspark
- Drop column in pyspark – drop single & multiple columns
- Subset or Filter data with multiple conditions in pyspark
- Frequency table or cross table in pyspark – 2 way cross table
- Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max
- Descriptive statistics or Summary Statistics of dataframe in pyspark
- Rearrange or reorder column in pyspark
- cumulative sum of column and group in pyspark
- Calculate Percentage and cumulative percentage of column in pyspark
- Select column in Pyspark (Select single & Multiple columns)
- Get data type of column in Pyspark (single & Multiple columns)
- Get List of columns and its data type in Pyspark