Joining pandas DataFrames by Column names
Categories:
Joining pandas DataFrames by Column Names

Learn how to effectively merge or join pandas DataFrames based on common column names, covering various join types and practical examples.
Joining DataFrames is a fundamental operation in data manipulation with pandas. It allows you to combine two or more DataFrames based on common columns, similar to SQL JOIN operations. This article will guide you through the process of joining DataFrames by column names using the merge() function, exploring different join types and providing clear examples.
Understanding pandas merge() for Column-Based Joins
The pandas.merge() function is the primary tool for combining DataFrames based on shared columns or indices. It offers robust functionality to handle various joining scenarios. When joining by column names, you specify the columns to join on using the on parameter, or pandas will infer common columns if not specified. The how parameter dictates the type of join (e.g., inner, left, right, outer).
flowchart LR
DF1[DataFrame 1] --> Merge(pandas.merge())
DF2[DataFrame 2] --> Merge
Merge -->|on='key_column'| Result[Joined DataFrame]
Merge -->|how='inner'| Result
Merge -->|how='left'| Result
Merge -->|how='right'| Result
Merge -->|how='outer'| ResultConceptual flow of pandas.merge() operation
Let's start by creating two sample DataFrames to demonstrate the join operations.
import pandas as pd
# DataFrame 1: Customer information
df1 = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'city': ['New York', 'London', 'Paris', 'Tokyo']
})
# DataFrame 2: Order information
df2 = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'customer_id': [2, 3, 1, 5, 2],
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})
print("DataFrame 1 (df1):")
print(df1)
print("\nDataFrame 2 (df2):")
print(df2)
Creating two sample pandas DataFrames
Common Join Types
The how parameter in merge() is crucial for defining how rows are matched and included in the result. Here are the most common types:
on is not specified, merge() defaults to using the intersection of column names in both DataFrames as join keys. It's generally good practice to explicitly specify on for clarity and to avoid unexpected behavior.1. Inner Join
An inner join returns only the rows where the join key exists in both DataFrames. It's the default join type if how is not specified.
# Inner Join: Only common customer_ids are included
inner_join_df = pd.merge(df1, df2, on='customer_id', how='inner')
print("Inner Join Result:")
print(inner_join_df)
Performing an inner join on 'customer_id'
2. Left Join (Left Outer Join)
A left join returns all rows from the left DataFrame (df1 in this case) and the matching rows from the right DataFrame (df2). If there's no match in the right DataFrame, NaN (Not a Number) is filled for columns from the right DataFrame.
# Left Join: All rows from df1, matching rows from df2
left_join_df = pd.merge(df1, df2, on='customer_id', how='left')
print("Left Join Result:")
print(left_join_df)
Performing a left join on 'customer_id'
3. Right Join (Right Outer Join)
A right join returns all rows from the right DataFrame (df2) and the matching rows from the left DataFrame (df1). If there's no match in the left DataFrame, NaN is filled for columns from the left DataFrame.
# Right Join: All rows from df2, matching rows from df1
right_join_df = pd.merge(df1, df2, on='customer_id', how='right')
print("Right Join Result:")
print(right_join_df)
Performing a right join on 'customer_id'
4. Outer Join (Full Outer Join)
An outer join returns all rows when there is a match in either the left or right DataFrame. If there's no match, NaN is filled for the missing values from the non-matching DataFrame.
# Outer Join: All rows from both df1 and df2
outer_join_df = pd.merge(df1, df2, on='customer_id', how='outer')
print("Outer Join Result:")
print(outer_join_df)
Performing an outer join on 'customer_id'
on parameter, e.g., on=['col1', 'col2']. If the join columns have different names in each DataFrame, use left_on and right_on parameters.