Parsing a JSON string which was loaded from a CSV using Pandas

Learn parsing a json string which was loaded from a csv using pandas with practical examples, diagrams, and best practices. Covers python, pandas development techniques with visual explanations.

Parsing JSON Strings from CSVs with Pandas

Hero image for Parsing a JSON string which was loaded from a CSV using Pandas

Learn how to effectively load CSV data containing JSON strings into Pandas DataFrames and parse them into structured columns.

Working with data often involves combining different formats. A common scenario is encountering CSV files where one or more columns contain JSON strings. While Pandas excels at reading CSVs, these embedded JSON strings require special handling to be properly parsed into a structured format within a DataFrame. This article will guide you through the process, from loading the data to transforming the JSON strings into usable columns.

Loading CSV Data with Embedded JSON

The first step is to load your CSV file into a Pandas DataFrame. Pandas' read_csv function is robust, but it will initially treat the JSON strings as regular text. We'll use a sample CSV file with a column named json_data containing JSON objects.

import pandas as pd
import io

# Simulate a CSV file with embedded JSON
csv_data = """
id,name,json_data
1,Alice,"{\"age\": 30, \"city\": \"New York\"}"
2,Bob,"{\"age\": 24, \"city\": \"London\", \"occupation\": \"Engineer\"}"
3,Charlie,"{\"age\": 35, \"city\": \"Paris\"}"
"""

df = pd.read_csv(io.StringIO(csv_data))
print("Original DataFrame:\n", df)
print("\nData type of 'json_data' column:", df['json_data'].dtype)

Loading a CSV with JSON strings into a Pandas DataFrame.

As you can see from the output, the json_data column is of object dtype, meaning Pandas treats its contents as strings. Our goal is to extract the age, city, and occupation fields into their own DataFrame columns.

Parsing JSON Strings into DataFrame Columns

To parse the JSON strings, we need to apply a function that converts each string into a Python dictionary. The json module's loads function is perfect for this. After converting to dictionaries, we can then normalize this semi-structured data into new columns using pd.json_normalize() or by directly accessing dictionary keys.

flowchart TD
    A[CSV File] --> B{pd.read_csv()}
    B --> C[DataFrame with JSON Strings]
    C --> D{Apply json.loads()}
    D --> E[DataFrame with Python Dictionaries]
    E --> F{pd.json_normalize() or Dictionary Access}
    F --> G[DataFrame with Structured Columns]

Workflow for parsing JSON strings from a CSV into a Pandas DataFrame.

import json

# Convert JSON strings to Python dictionaries
df['parsed_json'] = df['json_data'].apply(json.loads)
print("\nDataFrame after parsing JSON strings:\n", df)
print("\nData type of 'parsed_json' column:", df['parsed_json'].dtype)

# Normalize the 'parsed_json' column into new columns
json_df = pd.json_normalize(df['parsed_json'])
print("\nNormalized JSON DataFrame:\n", json_df)

# Concatenate the original DataFrame with the new JSON columns
final_df = pd.concat([df.drop(columns=['json_data', 'parsed_json']), json_df], axis=1)
print("\nFinal DataFrame with structured JSON data:\n", final_df)

Parsing JSON strings and normalizing them into new DataFrame columns.

Handling Malformed JSON and Missing Keys

Real-world data is rarely perfect. You might encounter malformed JSON strings or JSON objects where certain keys are missing. It's crucial to handle these cases gracefully to prevent your script from crashing.

import numpy as np

# Simulate a CSV with malformed JSON and missing keys
csv_data_error = """
id,name,json_data
1,Alice,"{\"age\": 30, \"city\": \"New York\"}"
2,Bob,"{\"age\": 24, \"city\": \"London\", \"occupation\": \"Engineer\"}"
3,Charlie,"{\"age\": 35, \"city\": \"Paris\"}"
4,David,"invalid json string"
5,Eve,"{\"age\": 28, \"country\": \"Germany\"}"
"""
df_error = pd.read_csv(io.StringIO(csv_data_error))

def parse_json_safe(json_str):
    try:
        return json.loads(json_str)
    except json.JSONDecodeError:
        return {}

df_error['parsed_json'] = df_error['json_data'].apply(parse_json_safe)

# Normalize, filling missing values with NaN
json_df_error = pd.json_normalize(df_error['parsed_json'])

final_df_error = pd.concat([df_error.drop(columns=['json_data', 'parsed_json']), json_df_error], axis=1)
print("\nFinal DataFrame with error handling:\n", final_df_error)

Robust parsing of JSON strings with error handling for malformed data and missing keys.