We can merge or join two data frames in pyspark by using the join() function. The different arguments to join() allows you to perform left join, right join, full outer join and natural join or inner join in pyspark. Join in pyspark (Merge) inner, outer, right, left join in pyspark is explained below
- Inner join in pyspark with example with join() function
- Outer join in pyspark with example
- Left join in pyspark with example
- Right join in pyspark with example
- Left semi join in pyspark with example
- Left anti join in pyspark with example
- Full join in pyspark with example
- Anti join in pyspark with example
Syntax :
- df1− Dataframe1.
- df2– Dataframe2.
- on− Columns (names) to join on. Must be found in both df1 and df2.
- how– type of join needs to be performed – ‘left’, ‘right’, ‘outer’, ‘inner’, Default is inner join
We will be using dataframes df1 and df2:
df1:
df2:
Inner join in pyspark with example
Inner Join in pyspark is the simplest and most common type of join. It is also known as simple join or Natural Join. Inner join returns the rows when matching condition is met.
### Inner join in pyspark df_inner = df1.join(df2, on=['Roll_No'], how='inner') df_inner.show()
inner join will be
Outer join in pyspark with example
outer Join in pyspark combines the results of both left and right outer joins. The joined table will contain all records from both the tables
### Outer join in pyspark df_outer = df1.join(df2, on=['Roll_No'], how='outer') df_outer.show()
outer join will be
Left join in pyspark with example
The LEFT JOIN in pyspark returns all records from the left dataframe (A), and the matched records from the right dataframe (B)
### Left join in pyspark df_left = df1.join(df2, on=['Roll_No'], how='left') df_left.show()
left join will be
Right join in pyspark with example
The RIGHT JOIN in pyspark returns all records from the right dataframe (B), and the matched records from the left dataframe (A)
### Right join in pyspark df_right = df1.join(df2, on=['Roll_No'], how='right') df_right.show()
Right join will be
Left Anti join in pyspark with example
This join is like df1-df2, as it selects all rows from df1 that are not present in df2.
### Left Anti join in pyspark df_left_anti = df1.join(df2, on=['Roll_No'], how='left_anti') df_left_anti.show()
Left Anti join will be
Left Semi join in pyspark with example
This is like inner join, with only the left dataframe columns and values are selected
### Left Semi join in pyspark df_left_semi = df1.join(df2, on=['Roll_No'], how='left_semi') df_left_semi.show()
Left Semi Join will be
Full join in pyspark:
Full Join in pyspark combines the results of both left and right outer joins. The joined table will contain all records from both the tables
, how='full') df_full.show()
full join will be
Anti join in pyspark:
Anti join in pyspark returns rows from the first table where no matches are found in the second table
### Anti join in pyspark df_anti = df1.join(df2, on=['Roll_No'], how='anti') df_anti.show()
Anti join will be
Other Related Topics:
- Distinct value of dataframe in pyspark – drop duplicates
- Count of Missing (NaN,Na) and null values in Pyspark
- Mean, Variance and standard deviation of column in Pyspark
- Maximum or Minimum value of column in Pyspark
- Raised to power of column in pyspark – square, cube , square root and cube root in pyspark
- Drop column in pyspark – drop single & multiple columns
- Subset or Filter data with multiple conditions in pyspark
- Frequency table or cross table in pyspark – 2 way cross table
- Groupby functions in pyspark (Aggregate functions) – Groupby count, Groupby sum, Groupby mean, Groupby min and Groupby max
- Descriptive statistics or Summary Statistics of dataframe in pyspark
- Rearrange or reorder column in pyspark
- cumulative sum of column and group in pyspark
- Join in pyspark (Merge) inner , outer, right , left join in pyspark
- Get duplicate rows in pyspark
- Quantile rank, decile rank & n tile rank in pyspark – Rank by Group
- Calculate Percentage and cumulative percentage of column in pyspark
- Select column in Pyspark (Select single & Multiple columns)
- Get data type of column in Pyspark (single & Multiple columns)