Joining pandas DataFrames by Column names
Categories:
Joining pandas DataFrames by Column Names

Learn how to effectively merge or join pandas DataFrames based on common column names, covering various join types and practical examples.
Joining DataFrames is a fundamental operation in data manipulation with pandas. It allows you to combine two or more DataFrames based on common columns, similar to SQL JOIN operations. This article will guide you through the process of joining DataFrames by column names using the merge()
function, exploring different join types and providing clear examples.
Understanding pandas merge()
for Column-Based Joins
The pandas.merge()
function is the primary tool for combining DataFrames based on shared columns or indices. It offers robust functionality to handle various joining scenarios. When joining by column names, you specify the columns to join on using the on
parameter, or pandas will infer common columns if not specified. The how
parameter dictates the type of join (e.g., inner, left, right, outer).
flowchart LR DF1[DataFrame 1] --> Merge(pandas.merge()) DF2[DataFrame 2] --> Merge Merge -->|on='key_column'| Result[Joined DataFrame] Merge -->|how='inner'| Result Merge -->|how='left'| Result Merge -->|how='right'| Result Merge -->|how='outer'| Result
Conceptual flow of pandas.merge() operation
Let's start by creating two sample DataFrames to demonstrate the join operations.
import pandas as pd
# DataFrame 1: Customer information
df1 = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
# DataFrame 2: Order information
df2 = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [2, 3, 1, 5, 2],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
print("DataFrame 1 (df1):")
print(df1)
print("\nDataFrame 2 (df2):")
print(df2)
Creating two sample pandas DataFrames
Common Join Types
The how
parameter in merge()
is crucial for defining how rows are matched and included in the result. Here are the most common types:
on
is not specified, merge()
defaults to using the intersection of column names in both DataFrames as join keys. It's generally good practice to explicitly specify on
for clarity and to avoid unexpected behavior.1. Inner Join
An inner join returns only the rows where the join key exists in both DataFrames. It's the default join type if how
is not specified.
# Inner Join: Only common customer_ids are included
inner_join_df = pd.merge(df1, df2, on='customer_id', how='inner')
print("Inner Join Result:")
print(inner_join_df)
Performing an inner join on 'customer_id'
2. Left Join (Left Outer Join)
A left join returns all rows from the left DataFrame (df1
in this case) and the matching rows from the right DataFrame (df2
). If there's no match in the right DataFrame, NaN
(Not a Number) is filled for columns from the right DataFrame.
# Left Join: All rows from df1, matching rows from df2
left_join_df = pd.merge(df1, df2, on='customer_id', how='left')
print("Left Join Result:")
print(left_join_df)
Performing a left join on 'customer_id'
3. Right Join (Right Outer Join)
A right join returns all rows from the right DataFrame (df2
) and the matching rows from the left DataFrame (df1
). If there's no match in the left DataFrame, NaN
is filled for columns from the left DataFrame.
# Right Join: All rows from df2, matching rows from df1
right_join_df = pd.merge(df1, df2, on='customer_id', how='right')
print("Right Join Result:")
print(right_join_df)
Performing a right join on 'customer_id'
4. Outer Join (Full Outer Join)
An outer join returns all rows when there is a match in either the left or right DataFrame. If there's no match, NaN
is filled for the missing values from the non-matching DataFrame.
# Outer Join: All rows from both df1 and df2
outer_join_df = pd.merge(df1, df2, on='customer_id', how='outer')
print("Outer Join Result:")
print(outer_join_df)
Performing an outer join on 'customer_id'
on
parameter, e.g., on=['col1', 'col2']
. If the join columns have different names in each DataFrame, use left_on
and right_on
parameters.