In order to get substring of the column in pyspark we will be using substr() Function. We look at an example on how to get substring of the column in pyspark.
- Get substring of the column in pyspark using substring function.
- Get Substring from end of the column in pyspark substr() .
- Extract characters from string column in pyspark
Syntax:
df- dataframe
colname- column name
start – starting position
length – number of string from starting position
We will be using the dataframe named df_states
Substring from the start of the column in pyspark – substr() :
df.colname.substr() gets the substring of the column. Extracting first 6 characters of the column in pyspark is achieved as follows.
### Get Substring of the column in pyspark df = df_states.withColumn("substring_statename", df_states.state_name.substr(1,6)) df.show()
substr(1,6) returns the first 6 characters from column “state_name”
Get Substring from end of the column in pyspark
df.colname.substr() gets the substring of the column in pyspark . In order to get substring from end we will specifying first parameter with minus(-) sign.
### Get Substring from end of the column in pyspark df = df_states.withColumn("substring_from_end", df_states.state_name.substr(-2,2)) df.show()
In our example we will extract substring from end. i.e. last two character of the column. We will specifying first parameter with minus(-) sign, Followed by length as second parameter so the resultant table will be
Extract characters from string column in pyspark – substr()
Extract characters from string column in pyspark is obtained using substr() function. by passing two values first one represents the starting position of the character and second one represents the length of the substring. In our example we have extracted the two substrings and concatenated them using concat() function as shown below
########## Extract N characters from string column in pyspark df_states_new=df_states.withColumn('new_string', concat(df_states.state_name.substr(1, 3), lit('_'), df_states.state_name.substr(6, 2))) df_states_new.show()
so the resultant dataframe will be
Other Related Topics:
- Remove leading zero of column in pyspark
- Left and Right pad of column in pyspark –lpad() & rpad()
- Add Leading and Trailing space of column in pyspark – add space
- Remove Leading, Trailing and all space of column in pyspark – strip & trim space
- String split of the columns in pyspark
- Repeat the column in Pyspark
- Get String length of column in Pyspark
- Typecast string to date and date to string in Pyspark
- Typecast Integer to string and String to integer in Pyspark
- Extract First N and Last N character in pyspark
- Convert to upper case, lower case and title case in pyspark
- Add leading zeros to the column in pyspark
- Concatenate two columns in pyspark.