StumbleUpon  Del.icio.us  Facebook  Reddit  Add to diigo  


Follow - Monx007
Article Time Stamp: 19 November 2008, 00:10:38 GMT+7

Microsoft SQL (MS SQL): Deleting Duplicate Records



The following is an analysis of records in a table, some of which may be duplicated. Many different things can be accomplished with this method, including looking for duplicates, looking for records that are duplicated n times, deleting duplicates, etc.

People develop many ways to delete duplicate records from a table. You can make a new table and select the unique records into that table. You can concatenate fields together, get a distinct representation of those records, and place those in another table. But these methods are normally used when people think they do not have any other way to accomplish this.

Let's suppose we have a table structure with a multiple field Primary Key. The key is from an old ordering system we imported into a table that includes an IDENTITY field:


CREATE TABLE SHIPMENT_ORDERS (
[id] [int] IDENTITY (1, 1) NOT NULL
, [shipment] [varchar] (20) NOT NULL
, [purchase_order] [varchar] (20) NOT NULL
, [item_sku] [char] (20) NOT NULL
, [units_ordered] [int] NOT NULL
, [units_shipped] [int] NOT NULL
, [shipment_date] [datetime] NOT NULL
, [order_date] [datetime] NOT NULL
, [last_name] [varchar] (30) NOT NULL
, [first_name] [varchar] (30) NOT NULL
, [full_address] [varchar] (30) NOT NULL
, [email] [varchar] (30) NOT NULL
)




If for some reason, your current table doesn't have an identity column, you can add the identity column with this syntax:


ALTER TABLE SHIPMENT_ORDERS
ADD
[id] identity (1, 1) not null
GO



After you've added the identity column, MS SQL will automatically generate the values.

For sake of argument, there is no Primary Key on this table. Let's assume that some duplicate data was introduced into it. For our discussion, we will assume a target Primary Key to be [shipment], [purchase_order], [item_sku]. We want to get some statistical information from the data first so our end goal will be to remove the duplicate records. I will now provide the base query for all of our operations:


/* 1. Query heading */
SELECT
* /* Remember to never use SELECT * and write out the field names */

FROM
/* 2. Table with duplicates */
SHIPMENT_ORDERS

WHERE
EXISTS (SELECT NULL
FROM
/* 3. Table with duplicates, with an alias */
SHIPMENT_ORDERS b

WHERE
/* 4. Join each field with *itself*. These are fields that could be Primary Keys */
b.[shipment] = SHIPMENT_ORDERS.[shipment]
AND b.[purchase_order] = SHIPMENT_ORDERS.[purchase_order]
AND b.[item_sku] = SHIPMENT_ORDERS.[item_sku]

GROUP BY
/* 5. I must GROUP BY these fields because of the HAVING
clause and because these are the possible PK */
b.[shipment], b.[purchase_order], b.[item_sku]

HAVING
/* 6. This is the determining factor. We can control our
output from here. In this case, we want to pick records
where the ID is less than the MAX ID */
SHIPMENT_ORDERS.[id] < MAX(b.[id])
)



1. The kind of query we want will be either SELECT * or DELETE FROM. We will view our results first, then change this to delete the offending records if necessary.
2. This is the table we are performing the above action on.
3. This is the same table, duplicated. We are self-referencing our target table.
4. We must compare each field to its copy. This is 1,000 times better than concatenation, since we do not need to worry about the data lining up. This can be used when you have two (or more) tables you need to compare that have similar data. The way most people do this is concatenate all target fields together. That is the most intensive thing you can do when comparing a group of fields in a table.
5. We need to GROUP BY the same fields because we want to ensure that our data is unique in our analysis.
6. This ultimately determines what our condition for the resulting data will be.

After running this query, we would see all of the duplicated records. For instance, if we had the following records in the table:

id shipment purchase_order item_sku
3 435436 A123765 453987001201
46 435436 A123765 453987001201
354 435436 A123765 453987001201
23 981123 C543219 843209132209
613 981123 C543219 843209132209


the result after running the above query would be:

id shipment purchase_order item_sku
3 435436 A123765 453987001201
46 435436 A123765 453987001201
23 981123 C543219 843209132209


This would be a report of the duplicated records that are present in the data. Now, if we look at section 6 in the query:


SHIPMENT_ORDERS.[id] < MAX(b.[id])



we can see that an alternate query could use MIN() instead of MAX(). Note that this would not affect the outcome, except it would pick the minimum of the field instead of the maximum. In the case of our table, the only field that is not duplicated is [id]. If the table we were dealing with had a field such as a datetime field, it could be used as a pseudo-IDENTITY field.

Suppose we want to get a report of all records that have a single, duplicated record. This is easily obtainable by changing one part of this query (6):


COUNT(b.[id]) = 2



Based on the previous data, our output would be:

id shipment purchase_order item_sku
23 981123 C543219 843209132209
613 981123 C543219 843209132209


Notice that we get both records.

Remember the first query we ran? It showed all of the duplicated records in the table. If we changed the first two words (SELECT *) in the SQL sentence to DELETE, we accomplish our goal of removing all duplicates from the table.

id shipment purchase_order item_sku
354 435436 A123765 453987001201
613 981123 C543219 843209132209



Conclusion

For a recap, we did not:

1. create a new table
2. concatenate fields
3. use the DISTINCT keyword
4. alter the table

and we did:

1. use the original table to compare
2. compare field to field without altering them
3. rely on the existence of data, not generate new data
4. use the IDENTITY field to ensure unique records

This query template can be used for many different types of reports and operations on data. At first look, that query may seem rather complicated, but when it is dissected, the sections are not complex at all.



Source: http://www.15seconds.com

Article Source: Monx Digital Library

Copyrighted@ Monx Digital Library, otherwise stated
Use of our service is protected by our Terms of Use



 Back To Previous Page ...  



 

 

 

AQWorlds Nulgath Secret Walkthrough