|Article Time Stamp: 19 November 2008, 00:10:38 GMT+7|
Microsoft SQL (MS SQL): Deleting Duplicate Records
The following is an analysis of records in a table, some of which may be duplicated. Many different things can be accomplished with this method, including looking for duplicates, looking for records that are duplicated n times, deleting duplicates, etc.
People develop many ways to delete duplicate records from a table. You can make a new table and select the unique records into that table. You can concatenate fields together, get a distinct representation of those records, and place those in another table. But these methods are normally used when people think they do not have any other way to accomplish this.
Let's suppose we have a table structure with a multiple field Primary Key. The key is from an old ordering system we imported into a table that includes an IDENTITY field:
If for some reason, your current table doesn't have an identity column, you can add the identity column with this syntax:
After you've added the identity column, MS SQL will automatically generate the values.
For sake of argument, there is no Primary Key on this table. Let's assume that some duplicate data was introduced into it. For our discussion, we will assume a target Primary Key to be [shipment], [purchase_order], [item_sku]. We want to get some statistical information from the data first so our end goal will be to remove the duplicate records. I will now provide the base query for all of our operations:
1. The kind of query we want will be either SELECT * or DELETE FROM. We will view our results first, then change this to delete the offending records if necessary.
2. This is the table we are performing the above action on.
3. This is the same table, duplicated. We are self-referencing our target table.
4. We must compare each field to its copy. This is 1,000 times better than concatenation, since we do not need to worry about the data lining up. This can be used when you have two (or more) tables you need to compare that have similar data. The way most people do this is concatenate all target fields together. That is the most intensive thing you can do when comparing a group of fields in a table.
5. We need to GROUP BY the same fields because we want to ensure that our data is unique in our analysis.
6. This ultimately determines what our condition for the resulting data will be.
After running this query, we would see all of the duplicated records. For instance, if we had the following records in the table:
the result after running the above query would be:
This would be a report of the duplicated records that are present in the data. Now, if we look at section 6 in the query:
we can see that an alternate query could use MIN() instead of MAX(). Note that this would not affect the outcome, except it would pick the minimum of the field instead of the maximum. In the case of our table, the only field that is not duplicated is [id]. If the table we were dealing with had a field such as a datetime field, it could be used as a pseudo-IDENTITY field.
Suppose we want to get a report of all records that have a single, duplicated record. This is easily obtainable by changing one part of this query (6):
Based on the previous data, our output would be:
Notice that we get both records.
Remember the first query we ran? It showed all of the duplicated records in the table. If we changed the first two words (SELECT *) in the SQL sentence to DELETE, we accomplish our goal of removing all duplicates from the table.
For a recap, we did not:
1. create a new table
2. concatenate fields
3. use the DISTINCT keyword
4. alter the table
and we did:
1. use the original table to compare
2. compare field to field without altering them
3. rely on the existence of data, not generate new data
4. use the IDENTITY field to ensure unique records
This query template can be used for many different types of reports and operations on data. At first look, that query may seem rather complicated, but when it is dissected, the sections are not complex at all.