Joining pandas DataFrames by Column names

Learn joining pandas dataframes by column names with practical examples, diagrams, and best practices. Covers python, python-3.x, pandas development techniques with visual explanations.

Joining pandas DataFrames by Column Names

Hero image for Joining pandas DataFrames by Column names

Learn how to effectively merge or join pandas DataFrames based on common column names, covering various join types and practical examples.

Joining DataFrames is a fundamental operation in data manipulation with pandas. It allows you to combine two or more DataFrames based on common columns, similar to SQL JOIN operations. This article will guide you through the process of joining DataFrames by column names using the merge() function, exploring different join types and providing clear examples.

Understanding pandas merge() for Column-Based Joins

The pandas.merge() function is the primary tool for combining DataFrames based on shared columns or indices. It offers robust functionality to handle various joining scenarios. When joining by column names, you specify the columns to join on using the on parameter, or pandas will infer common columns if not specified. The how parameter dictates the type of join (e.g., inner, left, right, outer).

flowchart LR
    DF1[DataFrame 1] --> Merge(pandas.merge())
    DF2[DataFrame 2] --> Merge
    Merge -->|on='key_column'| Result[Joined DataFrame]
    Merge -->|how='inner'| Result
    Merge -->|how='left'| Result
    Merge -->|how='right'| Result
    Merge -->|how='outer'| Result

Conceptual flow of pandas.merge() operation

Let's start by creating two sample DataFrames to demonstrate the join operations.

import pandas as pd

# DataFrame 1: Customer information
df1 = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'city': ['New York', 'London', 'Paris', 'Tokyo']
})

# DataFrame 2: Order information
df2 = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [2, 3, 1, 5, 2],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam']
})

print("DataFrame 1 (df1):")
print(df1)
print("\nDataFrame 2 (df2):")
print(df2)

Creating two sample pandas DataFrames

Common Join Types

The how parameter in merge() is crucial for defining how rows are matched and included in the result. Here are the most common types:

1. Inner Join

An inner join returns only the rows where the join key exists in both DataFrames. It's the default join type if how is not specified.

# Inner Join: Only common customer_ids are included
inner_join_df = pd.merge(df1, df2, on='customer_id', how='inner')

print("Inner Join Result:")
print(inner_join_df)

Performing an inner join on 'customer_id'

2. Left Join (Left Outer Join)

A left join returns all rows from the left DataFrame (df1 in this case) and the matching rows from the right DataFrame (df2). If there's no match in the right DataFrame, NaN (Not a Number) is filled for columns from the right DataFrame.

# Left Join: All rows from df1, matching rows from df2
left_join_df = pd.merge(df1, df2, on='customer_id', how='left')

print("Left Join Result:")
print(left_join_df)

Performing a left join on 'customer_id'

3. Right Join (Right Outer Join)

A right join returns all rows from the right DataFrame (df2) and the matching rows from the left DataFrame (df1). If there's no match in the left DataFrame, NaN is filled for columns from the left DataFrame.

# Right Join: All rows from df2, matching rows from df1
right_join_df = pd.merge(df1, df2, on='customer_id', how='right')

print("Right Join Result:")
print(right_join_df)

Performing a right join on 'customer_id'

4. Outer Join (Full Outer Join)

An outer join returns all rows when there is a match in either the left or right DataFrame. If there's no match, NaN is filled for the missing values from the non-matching DataFrame.

# Outer Join: All rows from both df1 and df2
outer_join_df = pd.merge(df1, df2, on='customer_id', how='outer')

print("Outer Join Result:")
print(outer_join_df)

Performing an outer join on 'customer_id'