postgresql COUNT(DISTINCT ...) very slow

Learn postgresql count(distinct ...) very slow with practical examples, diagrams, and best practices. Covers performance, postgresql, count development techniques with visual explanations.

Optimizing COUNT(DISTINCT ...) in PostgreSQL for Performance

A magnifying glass hovering over a database icon, symbolizing performance optimization and distinct counting in PostgreSQL.

Learn why COUNT(DISTINCT ...) can be slow in PostgreSQL and discover effective strategies, including indexing, subqueries, and alternative approaches, to significantly improve query performance.

The COUNT(DISTINCT column_name) aggregate function in PostgreSQL is a powerful tool for determining the number of unique values within a column. However, it's notorious for being a performance bottleneck, especially on large datasets. This article delves into the reasons behind its slowness and provides practical, actionable strategies to optimize these queries, ensuring your PostgreSQL database remains responsive and efficient.

Understanding the Performance Bottleneck

At its core, COUNT(DISTINCT ...) requires PostgreSQL to process every row that matches the WHERE clause, extract the specified column's value, and then build a unique set of these values before counting them. This process is inherently resource-intensive for several reasons:

Consider a table with millions of rows. Even if an index exists on the column, a simple COUNT(*) can often use an index-only scan, which is very fast. But for COUNT(DISTINCT ...), the database typically cannot rely solely on a standard B-tree index to get the distinct count directly. It still needs to read the actual values to ensure uniqueness.

flowchart TD
    A[Start Query] --> B{Scan Table/Index for WHERE clause}
    B --> C[Extract 'distinct_column' values]
    C --> D{Sort or Hash values}
    D --> E[Identify Unique Values]
    E --> F[Count Unique Values]
    F --> G[Return Result]
    style D fill:#f9f,stroke:#333,stroke-width:2px

Process flow for COUNT(DISTINCT ...) in PostgreSQL

Optimization Strategies

Several techniques can be employed to mitigate the performance issues associated with COUNT(DISTINCT ...). The best approach often depends on your specific data, query patterns, and acceptable data freshness.

1. Indexing the Column

While a standard B-tree index doesn't directly speed up the distinct counting process itself, it can significantly accelerate the initial filtering step (the WHERE clause). If your COUNT(DISTINCT ...) query includes a WHERE clause, an index on the filtered column(s) can drastically reduce the number of rows PostgreSQL needs to process.

CREATE INDEX idx_your_table_filtered_column ON your_table (filtered_column);

-- Example query benefiting from the index on 'status'
SELECT COUNT(DISTINCT user_id)
FROM orders
WHERE status = 'completed';

Creating an index on a filtered column to speed up WHERE clauses.

2. Using Subqueries or CTEs for Distinct Values

Sometimes, explicitly selecting the distinct values first and then counting them can be more efficient, especially if the distinct set is relatively small. This allows PostgreSQL to materialize the distinct set before counting.

SELECT COUNT(*)
FROM (
    SELECT DISTINCT user_id
    FROM orders
    WHERE order_date >= '2023-01-01'
) AS distinct_users;

Using a subquery to first get distinct values, then count them.

3. Leveraging Partial Indexes

If you frequently count distinct values for a specific subset of your data (e.g., active users, completed orders), a partial index can be highly effective. It's smaller and faster to maintain than a full index.

CREATE UNIQUE INDEX idx_active_distinct_users ON users (user_id) WHERE status = 'active';

-- This index can help if 'user_id' is unique for active users
SELECT COUNT(user_id) FROM users WHERE status = 'active';

Creating a partial unique index for specific subsets of data.

4. Materialized Views for Pre-aggregation

For reports or dashboards where real-time accuracy isn't strictly required, materialized views are an excellent solution. You can pre-calculate the distinct count and refresh the view periodically.

CREATE MATERIALIZED VIEW mv_distinct_users_by_month AS
SELECT
    DATE_TRUNC('month', order_date) AS month,
    COUNT(DISTINCT user_id) AS distinct_user_count
FROM orders
GROUP BY 1;

-- Refresh the view periodically
REFRESH MATERIALIZED VIEW mv_distinct_users_by_month;

-- Query the fast materialized view
SELECT distinct_user_count FROM mv_distinct_users_by_month WHERE month = '2023-01-01';

Using a materialized view to pre-aggregate distinct counts.

5. Approximate Distinct Counts (HyperLogLog)

When an exact distinct count isn't critical and a good approximation is sufficient, extensions like pg_hll (HyperLogLog) can provide extremely fast approximate counts with a very small memory footprint. This is ideal for analytics where trends are more important than absolute precision.

-- Install the extension (requires superuser privileges)
CREATE EXTENSION IF NOT EXISTS pg_hll;

-- Create an HLL aggregate for your column
SELECT hll_cardinality(hll_add_agg(hll_hash_text(user_id::text)))
FROM orders
WHERE order_date >= '2023-01-01';

Using the pg_hll extension for approximate distinct counts.

6. Optimizing Data Types

The data type of the column being counted distinctly can also impact performance. Smaller, fixed-width data types (like INTEGER or UUID) are generally faster to process and compare than variable-length types (like TEXT or VARCHAR) due to reduced memory overhead and simpler comparison logic.

If user_id is stored as TEXT but only contains numeric values, consider changing it to BIGINT if possible. This can reduce the cost of hashing or sorting distinct values.

7. Analyzing Query Plans

Always use EXPLAIN ANALYZE to understand how PostgreSQL is executing your COUNT(DISTINCT ...) queries. This will reveal where the time is being spent (e.g., sequential scans, sorts, hash aggregates) and guide your optimization efforts.

EXPLAIN ANALYZE
SELECT COUNT(DISTINCT user_id)
FROM orders
WHERE order_date >= '2023-01-01';

Using EXPLAIN ANALYZE to understand query execution.

Look for HashAggregate or GroupAggregate nodes that consume a lot of time and memory. These are often the culprits for slow distinct counts.