In order to Extract First N and Last N characters in pyspark we will be using substr() function. In this section we will see an example on how to extract First N character from left in pyspark and how to extract last N character from right in pyspark. Let’s see how to
- Extract First N characters in pyspark – First N character from left
- Extract Last N characters in pyspark – Last N character from right
- Extract characters from string column of the dataframe in pyspark using substr() function.
With an example for both
We will be using the dataframe named df_states
Extract First N character in pyspark – First N character from left
First N character of column in pyspark is obtained using substr() function.
########## Extract first N character from left in pyspark df = df_states.withColumn("first_n_char", df_states.state_name.substr(1,6)) df.show()
First 6 characters from left is extracted using substring function so the resultant dataframe will be
Extract Last N characters in pyspark – Last N character from right
Extract Last N character of column in pyspark is obtained using substr() function. by passing first argument as negative value as shown below
########## Extract Last N character from right in pyspark df = df_states.withColumn("last_n_char", df_states.state_name.substr(-2,2)) df.show()
Last 2 characters from right is extracted using substring function so the resultant dataframe will be
Extract characters from string column in pyspark – substr()
Extract characters from string column in pyspark is obtained using substr() function. by passing two values first one represents the starting position of the character and second one represents the length of the substring. In our example we have extracted the two substrings and concatenated them using concat() function as shown below
########## Extract N characters from string column in pyspark df_states_new=df_states.withColumn('new_string', concat(df_states.state_name.substr(1, 3), lit('_'), df_states.state_name.substr(6, 2))) df_states_new.show()
so the resultant dataframe will be
Other Related Columns:
- Remove leading zero of column in pyspark
- Left and Right pad of column in pyspark –lpad() & rpad()
- Add Leading and Trailing space of column in pyspark – add space
- Remove Leading, Trailing and all space of column in pyspark – strip & trim space
- String split of the columns in pyspark
- Repeat the column in Pyspark
- Get Substring of the column in Pyspark
- Get String length of column in Pyspark
- Typecast string to date and date to string in Pyspark
- Typecast Integer to string and String to integer in Pyspark
- Add leading zeros to the column in pyspark
- Concatenate two columns in pyspark
- Convert to upper case, lower case and title case in pyspark