Parsing a JSON string which was loaded from a CSV using Pandas
Categories:
Parsing JSON Strings from CSVs with Pandas

Learn how to effectively load CSV data containing JSON strings into Pandas DataFrames and parse them into structured columns.
Working with data often involves combining different formats. A common scenario is encountering CSV files where one or more columns contain JSON strings. While Pandas excels at reading CSVs, these embedded JSON strings require special handling to be properly parsed into a structured format within a DataFrame. This article will guide you through the process, from loading the data to transforming the JSON strings into usable columns.
Loading CSV Data with Embedded JSON
The first step is to load your CSV file into a Pandas DataFrame. Pandas' read_csv
function is robust, but it will initially treat the JSON strings as regular text. We'll use a sample CSV file with a column named json_data
containing JSON objects.
import pandas as pd
import io
# Simulate a CSV file with embedded JSON
csv_data = """
id,name,json_data
1,Alice,"{\"age\": 30, \"city\": \"New York\"}"
2,Bob,"{\"age\": 24, \"city\": \"London\", \"occupation\": \"Engineer\"}"
3,Charlie,"{\"age\": 35, \"city\": \"Paris\"}"
"""
df = pd.read_csv(io.StringIO(csv_data))
print("Original DataFrame:\n", df)
print("\nData type of 'json_data' column:", df['json_data'].dtype)
Loading a CSV with JSON strings into a Pandas DataFrame.
As you can see from the output, the json_data
column is of object
dtype, meaning Pandas treats its contents as strings. Our goal is to extract the age
, city
, and occupation
fields into their own DataFrame columns.
Parsing JSON Strings into DataFrame Columns
To parse the JSON strings, we need to apply a function that converts each string into a Python dictionary. The json
module's loads
function is perfect for this. After converting to dictionaries, we can then normalize this semi-structured data into new columns using pd.json_normalize()
or by directly accessing dictionary keys.
flowchart TD A[CSV File] --> B{pd.read_csv()} B --> C[DataFrame with JSON Strings] C --> D{Apply json.loads()} D --> E[DataFrame with Python Dictionaries] E --> F{pd.json_normalize() or Dictionary Access} F --> G[DataFrame with Structured Columns]
Workflow for parsing JSON strings from a CSV into a Pandas DataFrame.
import json
# Convert JSON strings to Python dictionaries
df['parsed_json'] = df['json_data'].apply(json.loads)
print("\nDataFrame after parsing JSON strings:\n", df)
print("\nData type of 'parsed_json' column:", df['parsed_json'].dtype)
# Normalize the 'parsed_json' column into new columns
json_df = pd.json_normalize(df['parsed_json'])
print("\nNormalized JSON DataFrame:\n", json_df)
# Concatenate the original DataFrame with the new JSON columns
final_df = pd.concat([df.drop(columns=['json_data', 'parsed_json']), json_df], axis=1)
print("\nFinal DataFrame with structured JSON data:\n", final_df)
Parsing JSON strings and normalizing them into new DataFrame columns.
pd.json_normalize()
function is incredibly powerful for flattening nested JSON structures. If your JSON strings contain deeply nested objects, json_normalize
can handle them efficiently, creating new columns with appropriate prefixes.Handling Malformed JSON and Missing Keys
Real-world data is rarely perfect. You might encounter malformed JSON strings or JSON objects where certain keys are missing. It's crucial to handle these cases gracefully to prevent your script from crashing.
import numpy as np
# Simulate a CSV with malformed JSON and missing keys
csv_data_error = """
id,name,json_data
1,Alice,"{\"age\": 30, \"city\": \"New York\"}"
2,Bob,"{\"age\": 24, \"city\": \"London\", \"occupation\": \"Engineer\"}"
3,Charlie,"{\"age\": 35, \"city\": \"Paris\"}"
4,David,"invalid json string"
5,Eve,"{\"age\": 28, \"country\": \"Germany\"}"
"""
df_error = pd.read_csv(io.StringIO(csv_data_error))
def parse_json_safe(json_str):
try:
return json.loads(json_str)
except json.JSONDecodeError:
return {}
df_error['parsed_json'] = df_error['json_data'].apply(parse_json_safe)
# Normalize, filling missing values with NaN
json_df_error = pd.json_normalize(df_error['parsed_json'])
final_df_error = pd.concat([df_error.drop(columns=['json_data', 'parsed_json']), json_df_error], axis=1)
print("\nFinal DataFrame with error handling:\n", final_df_error)
Robust parsing of JSON strings with error handling for malformed data and missing keys.
try-except
blocks) when parsing external data. Malformed JSON can lead to json.JSONDecodeError
, halting your script. Returning an empty dictionary {}
or None
for invalid JSON allows pd.json_normalize
to handle missing fields gracefully.