How to calculate median in AWS Redshift?
Categories:
Calculating Median in AWS Redshift: A Comprehensive Guide

Learn various methods to calculate the median value in AWS Redshift, including window functions, approximate functions, and custom SQL approaches, with practical examples and performance considerations.
AWS Redshift, a powerful, fully managed, petabyte-scale data warehouse service, is optimized for analytical workloads. While it provides a rich set of SQL functions, a direct MEDIAN()
aggregate function is not natively available in the same way as AVG()
or SUM()
. This article explores several effective strategies to calculate the median in Redshift, catering to different accuracy requirements and performance considerations.
Understanding the Median
The median is the middle value in a sorted, ascending or descending, list of numbers. If the list has an odd number of observations, the median is the middle value. If the list has an even number of observations, the median is typically the average of the two middle values. Unlike the mean, the median is less affected by outliers, making it a robust measure of central tendency.
Method 1: Using Window Functions (PERCENTILE_CONT)
Redshift's PERCENTILE_CONT
window function is the most accurate and generally recommended method for calculating the median. It computes a percentile based on a continuous distribution model, interpolating between adjacent input values. For the median, you specify a percentile of 0.5
(50%).
SELECT
column_name,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value_column) OVER (PARTITION BY column_name) AS median_value
FROM
your_table;
Calculating median using PERCENTILE_CONT with a PARTITION BY clause.
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value_column) OVER () AS overall_median
FROM
your_table
LIMIT 1;
Calculating the overall median for the entire table using PERCENTILE_CONT.
OVER ()
clause calculates the median across the entire dataset. If you need the median per group (e.g., per product, per region), use OVER (PARTITION BY group_column)
.Method 2: Approximate Median (APPROXIMATE PERCENTILE_CONT)
For very large datasets where exact precision is not critical and performance is paramount, Redshift offers APPROXIMATE PERCENTILE_CONT
. This function uses a randomized algorithm to estimate the percentile, which can be significantly faster but with a small margin of error.
SELECT
column_name,
APPROXIMATE PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value_column) OVER (PARTITION BY column_name) AS approximate_median
FROM
your_table;
Calculating approximate median using APPROXIMATE PERCENTILE_CONT.
APPROXIMATE PERCENTILE_CONT
only when a slight deviation from the true median is acceptable. The performance gains can be substantial for extremely large tables.Method 3: Custom SQL with ROW_NUMBER() (Less Efficient)
Before the introduction of PERCENTILE_CONT
, a common approach involved using ROW_NUMBER()
to assign ranks and then selecting the middle value(s). This method is generally less efficient than PERCENTILE_CONT
for large datasets due to the need for sorting the entire dataset twice (once for ROW_NUMBER
and once for COUNT
), but it demonstrates the underlying logic.
flowchart TD A[Start] --> B{Calculate Row Numbers (RNK)}; B --> C{Calculate Total Count (CNT)}; C --> D{Determine Median Position (MED_POS)}; D --> E{Filter for RNK = MED_POS or RNK = MED_POS + 1}; E --> F{Calculate Average if Even Count, else Select Value}; F --> G[End];
Flowchart for calculating median using ROW_NUMBER() and COUNT().
WITH RankedData AS (
SELECT
value_column,
ROW_NUMBER() OVER (ORDER BY value_column) as rn,
COUNT(*) OVER () as total_count
FROM
your_table
)
SELECT
CASE
WHEN total_count % 2 = 1 THEN (SELECT value_column FROM RankedData WHERE rn = (total_count + 1) / 2)
ELSE (SELECT AVG(value_column) FROM RankedData WHERE rn IN (total_count / 2, total_count / 2 + 1))
END AS median_value
FROM
RankedData
LIMIT 1;
Custom SQL median calculation using ROW_NUMBER() for the entire table.
PARTITION BY
clause to the ROW_NUMBER()
and COUNT()
window functions.Performance Considerations
When choosing a method, consider the size of your dataset and your precision requirements:
PERCENTILE_CONT
: Generally the best balance of accuracy and performance for exact medians. It's highly optimized within Redshift.APPROXIMATE PERCENTILE_CONT
: Ideal for very large datasets where a small margin of error is acceptable and query speed is critical.- Custom SQL with
ROW_NUMBER()
: Least efficient for large datasets due to multiple passes and sorting, but useful for understanding the logic or in scenarios wherePERCENTILE_CONT
might not be available (though it is in Redshift).
Always test different approaches with your specific data volumes and query patterns to determine the most suitable method for your use case.