Programming, SQL, SQL Server 2000, SQL Server 2005, SQL Server 2008, SQL Tips, Transact-SQL, tSQL

SQL Server: Delete Duplicate Records Except the First Duplicate Record

Problem:

I encountered an interesting problem today. I had a SQL Server table with many records in it that I needed to clean up. Specifically, I needed to find all duplicate records in the table, but keep the first of each of the duplicate records.

The table also contained several rows of detail information that was different for each entry, but I only wanted to keep whatever detail information that the first of each of the duplicate records contained.

Solution:

To do the clean-up of the data in the table I needed to write a query that is not immediately intuitive. In the code below I’m providing a simple example of the statement that I used to clean up a temporary table called #myTest:

-- Delete the IDs of duplicated records but leave out the first duplicate record
 DELETE FROM #myTest WHERE ID IN(
 SELECT Id FROM #myTest tblMain
 WHERE tblMain.Id NOT IN (
 SELECT TOP 1 tblNested.id
 FROM #myTest tblNested
 WHERE tblNested.SerialNum=tblMain.SerialNum
 ORDER BY tblNested.SerialNum
 )
 AND tblMain.SerialNum IN(
 SELECT SerialNum
 FROM #myTest
 GROUP BY SerialNum
 HAVING COUNT(serialNum)>1 )
 )
 GO

This is a multi-part statement that looks somewhat complex, but that definitely does the job it was designed for!

Within the main select statement, the second nested query restricts the results of returned IDs to only show duplicate records.

Then the first nested query ties everything together by removing from the list of duplicates the first record of each duplicate.

The order of these nested queries is not really important. The main thing is that they exist.

In the end this query allows us to remove duplicate records, but to keep all information from the first of each of the duplicate records, exactly what we want!

Below is a full working example of the process in action. I ran this in SQL Server 2008 R2, but this SQL can run in any version of SQL Server 2000+.

-- Create the temporary table
CREATE TABLE #myTest(
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[SerialNum] [varchar](20) NOT NULL,
	[RandomInfo] [varchar](150) NOT NULL
)
GO

-- Populate the table. Insert three duplicates of 111111, two of 222222, and four of 4444
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('000000','123abc')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','456abc')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','789abc')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','123def')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('222222','123ghi')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('222222','123jkl')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('333333','123mno')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123pqr')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123stu')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123vwx')
INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123yz')
GO

-- Select the full list of records
SELECT * FROM #myTest
GO

-- Select the IDs of duplicated records but leave out the first duplicate record
SELECT Id FROM #myTest tblMain
WHERE tblMain.Id NOT IN (
SELECT TOP 1 tblNested.id 
FROM #myTest tblNested
WHERE tblNested.SerialNum=tblMain.SerialNum 
ORDER BY tblNested.SerialNum
)
AND tblMain.SerialNum IN( 
SELECT SerialNum
FROM #myTest 
GROUP BY SerialNum
HAVING COUNT(serialNum)>1 )
GO

-- Delete the IDs of duplicated records but leave out the first duplicate record
DELETE FROM #myTest WHERE ID IN(
SELECT Id FROM #myTest tblMain
WHERE tblMain.Id NOT IN (
SELECT TOP 1 tblNested.id 
FROM #myTest tblNested
WHERE tblNested.SerialNum=tblMain.SerialNum 
ORDER BY tblNested.SerialNum
)
AND tblMain.SerialNum IN( 
SELECT SerialNum
FROM #myTest 
GROUP BY SerialNum
HAVING COUNT(serialNum)>1 )
)
GO

-- Select the full list of records
SELECT * FROM #myTest
GO
-- Drop the temporary table
DROP TABLE #myTest
GO

The resulting data from the code above looks like this:

These are the results of the SQL to remove duplicates from a table but keep the first of the duplicate records.
These are the results of the SQL to remove duplicates from a table but keep the first of the duplicate records.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s