Problem:
I encountered an interesting problem today. I had a SQL Server table with many records in it that I needed to clean up. Specifically, I needed to find all duplicate records in the table, but keep the first of each of the duplicate records.
The table also contained several rows of detail information that was different for each entry, but I only wanted to keep whatever detail information that the first of each of the duplicate records contained.
Solution:
To do the clean-up of the data in the table I needed to write a query that is not immediately intuitive. In the code below I’m providing a simple example of the statement that I used to clean up a temporary table called #myTest:
-- Delete the IDs of duplicated records but leave out the first duplicate record DELETE FROM #myTest WHERE ID IN( SELECT Id FROM #myTest tblMain WHERE tblMain.Id NOT IN ( SELECT TOP 1 tblNested.id FROM #myTest tblNested WHERE tblNested.SerialNum=tblMain.SerialNum ORDER BY tblNested.SerialNum ) AND tblMain.SerialNum IN( SELECT SerialNum FROM #myTest GROUP BY SerialNum HAVING COUNT(serialNum)>1 ) ) GO
This is a multi-part statement that looks somewhat complex, but that definitely does the job it was designed for!
Within the main select statement, the second nested query restricts the results of returned IDs to only show duplicate records.
Then the first nested query ties everything together by removing from the list of duplicates the first record of each duplicate.
The order of these nested queries is not really important. The main thing is that they exist.
In the end this query allows us to remove duplicate records, but to keep all information from the first of each of the duplicate records, exactly what we want!
Below is a full working example of the process in action. I ran this in SQL Server 2008 R2, but this SQL can run in any version of SQL Server 2000+.
-- Create the temporary table CREATE TABLE #myTest( [Id] [int] IDENTITY(1,1) NOT NULL, [SerialNum] [varchar](20) NOT NULL, [RandomInfo] [varchar](150) NOT NULL ) GO -- Populate the table. Insert three duplicates of 111111, two of 222222, and four of 4444 INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('000000','123abc') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','456abc') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','789abc') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('111111','123def') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('222222','123ghi') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('222222','123jkl') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('333333','123mno') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123pqr') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123stu') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123vwx') INSERT INTO #myTest ([SerialNum],[RandomInfo]) VALUES ('444444','123yz') GO -- Select the full list of records SELECT * FROM #myTest GO -- Select the IDs of duplicated records but leave out the first duplicate record SELECT Id FROM #myTest tblMain WHERE tblMain.Id NOT IN ( SELECT TOP 1 tblNested.id FROM #myTest tblNested WHERE tblNested.SerialNum=tblMain.SerialNum ORDER BY tblNested.SerialNum ) AND tblMain.SerialNum IN( SELECT SerialNum FROM #myTest GROUP BY SerialNum HAVING COUNT(serialNum)>1 ) GO -- Delete the IDs of duplicated records but leave out the first duplicate record DELETE FROM #myTest WHERE ID IN( SELECT Id FROM #myTest tblMain WHERE tblMain.Id NOT IN ( SELECT TOP 1 tblNested.id FROM #myTest tblNested WHERE tblNested.SerialNum=tblMain.SerialNum ORDER BY tblNested.SerialNum ) AND tblMain.SerialNum IN( SELECT SerialNum FROM #myTest GROUP BY SerialNum HAVING COUNT(serialNum)>1 ) ) GO -- Select the full list of records SELECT * FROM #myTest GO -- Drop the temporary table DROP TABLE #myTest GO
The resulting data from the code above looks like this:
