Delete duplicate rows keeping the first row

Learn delete duplicate rows keeping the first row with practical examples, diagrams, and best practices. Covers sql, sql-server-2008, duplicates development techniques with visual explanations.

Efficiently Delete Duplicate Rows in SQL, Keeping the First Occurrence

Database table with duplicate rows highlighted, and one row remaining after deletion

Learn various SQL techniques to identify and remove duplicate rows from your tables, ensuring data integrity by preserving the initial record.

Duplicate data can compromise database integrity and lead to inaccurate reports. This article explores several robust methods to delete duplicate rows from a SQL table while ensuring that the first occurrence of each unique record is preserved. We'll cover techniques using ROW_NUMBER(), CTEs, and GROUP BY clauses, suitable for SQL Server and similar relational database systems.

Understanding Duplicates and the 'First Row' Concept

Before deleting, it's crucial to define what constitutes a 'duplicate' and what 'keeping the first row' means. A duplicate row typically refers to a row where a specific set of columns (or all columns) has identical values to another row. 'Keeping the first row' usually implies retaining the row with the lowest value in a designated ordering column (like an IDENTITY column, a timestamp, or even an arbitrary ROW_NUMBER() assignment) among the duplicates.

flowchart TD
    A[Start] --> B{Identify Duplicates?}
    B -- Yes --> C[Define 'First' Row Criteria]
    C --> D[Select Duplicates to Delete]
    D --> E[Execute DELETE Statement]
    E --> F[Verify Deletion]
    F --> G[End]
    B -- No --> G

Workflow for identifying and deleting duplicate rows.

Method 1: Using ROW_NUMBER() with a Common Table Expression (CTE)

This is one of the most common and flexible methods for deleting duplicates. It involves assigning a sequential number to each row within partitions defined by the columns that determine uniqueness. Rows with a ROW_NUMBER() greater than 1 are considered duplicates to be deleted.

WITH CTE_Duplicates AS (
    SELECT
        Column1, Column2, Column3, -- Columns that define uniqueness
        ROW_NUMBER() OVER (PARTITION BY Column1, Column2, Column3 ORDER BY (SELECT NULL)) as rn
    FROM
        YourTable
)
DELETE FROM CTE_Duplicates
WHERE rn > 1;

SQL query to delete duplicate rows using ROW_NUMBER() and CTE, keeping the first arbitrary row.

💡

To keep the 'first' row based on a specific order (e.g., the oldest record), replace ORDER BY (SELECT NULL) with ORDER BY YourPrimaryKeyColumn ASC or ORDER BY YourTimestampColumn ASC.

Method 2: Using a Temporary Table or Table Variable

This approach involves inserting the unique rows into a new temporary table or table variable, then truncating the original table and re-inserting the unique data. This can be useful for very large tables or when you need to perform additional operations on the unique set before re-insertion.

SELECT DISTINCT Column1, Column2, Column3, Column4 -- Select all columns
INTO #TempUniqueTable
FROM YourTable;

TRUNCATE TABLE YourTable;

INSERT INTO YourTable (Column1, Column2, Column3, Column4)
SELECT Column1, Column2, Column3, Column4
FROM #TempUniqueTable;

DROP TABLE #TempUniqueTable;

SQL query using a temporary table to remove duplicates.

⚠️

The temporary table method requires sufficient disk space and can be slower for extremely large tables due to the INSERT and TRUNCATE operations. It also requires exclusive access to the table during the process.

Method 3: Using MIN/MAX with a Self-Join (Less Common for Deletion)

While often used for selecting unique rows, a self-join with MIN() or MAX() can also be adapted for deletion, especially if you have a primary key or unique identifier. This method identifies the MIN() or MAX() primary key for each group of duplicates and then deletes all other rows within that group.

DELETE t1
FROM YourTable t1
INNER JOIN (
    SELECT Column1, Column2, Column3, MIN(PrimaryKeyColumn) as MinPK
    FROM YourTable
    GROUP BY Column1, Column2, Column3
) t2 ON t1.Column1 = t2.Column1
AND t1.Column2 = t2.Column2
AND t1.Column3 = t2.Column3
WHERE t1.PrimaryKeyColumn > t2.MinPK;

SQL query to delete duplicates using a self-join and MIN() on a primary key.

Delete duplicate rows keeping the first row

Tags:

Categories:

Efficiently Delete Duplicate Rows in SQL, Keeping the First Occurrence

Understanding Duplicates and the 'First Row' Concept

Method 1: Using ROW_NUMBER() with a Common Table Expression (CTE)

Method 2: Using a Temporary Table or Table Variable

Method 3: Using MIN/MAX with a Self-Join (Less Common for Deletion)