Overlay normal curve to histogram in R
Categories:
Overlaying a Normal Curve on a Histogram in R

Learn how to visually compare your data's distribution to a theoretical normal distribution by overlaying a normal curve on a histogram in R.
Histograms are powerful tools for visualizing the distribution of a dataset. Often, it's useful to compare this empirical distribution to a theoretical distribution, such as the normal (Gaussian) distribution. This comparison helps in assessing whether your data approximately follows a normal pattern, which is a common assumption for many statistical tests. This article will guide you through the process of overlaying a normal curve on a histogram in R, providing clear code examples and explanations.
Understanding the Components
Before we dive into the code, let's understand the key components involved:
- Histogram: This visualizes the frequency distribution of continuous data. In R, the
hist()
function creates a histogram. - Normal Distribution: Characterized by its mean (μ) and standard deviation (σ), it's a symmetric, bell-shaped curve. The probability density function (PDF) of a normal distribution gives the relative likelihood for a random variable to take on a given value.
- Overlay: To superimpose the normal curve, we need to calculate the normal probability density for the range of our data and then plot it on the same graph as the histogram.
flowchart TD A[Start] --> B{Generate Sample Data}; B --> C[Calculate Mean and SD]; C --> D[Create Histogram]; D --> E[Generate Normal Curve Data]; E --> F[Overlay Normal Curve]; F --> G[Add Labels and Title]; G --> H[End];
Workflow for overlaying a normal curve on a histogram.
Step-by-Step Implementation in R
Let's walk through the process with a practical example. We'll generate some sample data, create its histogram, and then overlay a normal curve calculated from the sample's mean and standard deviation.
# 1. Generate sample data (e.g., 1000 random numbers from a normal distribution)
set.seed(123) # for reproducibility
data <- rnorm(1000, mean = 50, sd = 10)
# 2. Calculate mean and standard deviation of the sample data
sample_mean <- mean(data)
sample_sd <- sd(data)
# 3. Create the histogram
# Use 'freq = FALSE' to plot densities instead of frequencies
# This is crucial for overlaying a density curve
hist_info <- hist(data,
freq = FALSE,
main = "Histogram with Normal Curve Overlay",
xlab = "Value",
ylab = "Density",
col = "lightblue",
border = "black")
# 4. Generate data for the normal curve
# Create a sequence of x-values covering the range of the histogram
x_values <- seq(min(data), max(data), length.out = 100)
# Calculate the normal probability density for each x-value
# using the sample's mean and standard deviation
y_values <- dnorm(x_values, mean = sample_mean, sd = sample_sd)
# 5. Overlay the normal curve
lines(x_values, y_values, col = "red", lwd = 2)
R code to generate data, create a histogram, and overlay a normal curve.
freq = FALSE
argument in the hist()
function is critical. It makes the histogram plot densities (area sums to 1) rather than frequencies, allowing the normal probability density function (dnorm()
) to be correctly overlaid. If you omit freq = FALSE
, the scales will not match, and the curve will appear incorrect.Customizing Your Plot
You can further customize your plot to improve its clarity and aesthetics. This includes adjusting colors, line types, adding a legend, or changing axis limits. The plot()
and lines()
functions in R offer extensive customization options.
# Example of further customization
# Re-create histogram with more bins and custom colors
hist(data,
freq = FALSE,
main = "Customized Histogram with Normal Curve",
xlab = "Data Values",
ylab = "Probability Density",
col = "#A9D0F5", # Light blue
border = "darkblue",
breaks = 30, # More bins
xlim = c(min(data) - 5, max(data) + 5)) # Extend x-axis limits
# Overlay the normal curve with a different line type and color
lines(x_values, y_values, col = "darkgreen", lwd = 3, lty = 2) # Dashed line
# Add a legend
legend("topright",
legend = c("Data Histogram", "Normal Curve"),
col = c("#A9D0F5", "darkgreen"),
lwd = c(NA, 3), # No line for histogram, line for curve
pch = c(15, NA), # Square for histogram, no point for curve
bty = "n") # No box around legend
Customizing the histogram and normal curve overlay.
breaks
for your histogram, consider the nature of your data. Too few breaks can obscure the distribution's shape, while too many can make it look noisy. Experiment to find a balance that best represents your data.