Delete Duplicate Records
Because of some unorganized OLTP (transactional system), sometimes it may occur that duplicate records get entered in the OLTP database. It is the most common thing in case of legacy database and bad database. For a better output in analysis service or in reporting, the data must be cleaned, scrubbed and must be flawless. And it becomes worse when the records are alike even for values in each column.
Before taking any step to clean the redundant data, you must first see and check for data redundancy.
For this, the most commonly used query is DISTINCT. The DISTINCT query returns exactly number of distinct rows in a given table. But it has some flaws, the COUNT function does not work with “DISTINCT” keyword for more than one column. Below stated are some alternate ways to get this job done.
Let us consider the table named dbo.Cars containing the data given below:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
TATA |
small |
Nano |
petrol |
|
TATA |
SUV |
Sumo |
diesel |
|
Maruti |
sedan |
SX4 |
petrol |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
TATA |
small |
Nano |
petrol |
|
Maruti |
sedan |
SX4 |
petrol |
Here the above data has some redundant rows. In order to get the unique rows out of it, the below query can be executed.
SELECT DISTINCT * FROM dbo.Cars
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
Maruti |
sedan |
SX4 |
petrol |
|
TATA |
Small |
Nano |
petrol |
|
TATA |
SUV |
Sumo |
diesel |
Note: Important fact about DISTINCT keyword:
DISTINCT keyword automatically sorts the whole result set before displaying the output according to the columns specified, resulting into high processing time as well as memory usage. Thus, it is advisable to avoid DISTINCT keyword in case of long running queries.
To get the total number of rows and the total number of distinct rows, one can use the below query.
SELECT COUNT(*) AS TotalCount, (SELECT COUNT(*) FROM (SELECT DISTINCT * FROM dbo.Cars) "C") AS CountDistinct, (COUNT(*) - (SELECT COUNT(*) FROM (SELECT DISTINCT * FROM dbo.Cars) "C")) AS SurplusRows FROM dbo.Cars
Output:
|
TotalCount |
CountDistinct |
SurplusRows |
|
7 |
4 |
3 |
Note: As stated above, DISTINCT keywords conflicts with the Count function for more than one column. So the below query will yield error message for incorrect syntax.
SELECT COUNT(*), COUNT(DISTINCT *) FROM dbo.Cars
Now we got to know that we have 3 redundant rows in the table. As we don’t have any unique columns like surrogate keys or primary keys or identity columns related to this table, it’s a bit difficult task to hunt it down. If you are going to delete on any condition, then it may happen that all the rows get deleted matching that condition. Let us find out various ways to remove these redundant rows by using T – SQL.
The following methods can be used to remove redundancy from a table.
- Using TOP keyword
- Using SET ROWCOUNT
- Using IDENTITY column
- Filtering Distinct records to another table
Method 1: Using TOP Keyword
Syntax of TOP: Here only the first set of rows will be returned from the query result that is specified in the expression.
[
TOP (expression) [PERCENT]
[WITH TIES]
]
Expression: Numeric expression that specifies the number of rows to be returned
[PERCENT]: Query returns only the first expression percent of rows from the result set
[WITH TIES]: Additional rows matching the last row of [Expression] will be added to the result set. This is bit complicated one. Consider one example to show the first 3 records order by CarName.
Example: without [WITH TIES]
SELECT TOP (3) * FROM dbo.Cars ORDER BY carname DESC
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
SX4 |
petrol |
|
Maruti |
sedan |
SX4 |
petrol |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
Example: with [WITH TIES]
SELECT TOP (3) WITH TIES * FROM dbo.Cars ORDER BY carname DESC
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
SX4 |
petrol |
|
Maruti |
sedan |
SX4 |
petrol |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
So to delete the duplicate records using TOP operator, follow the below steps:
In our example, we have 2 redundant rows with the same values as “TATA”, “small”, “Nano” and “petrol”. So, to remove one row from it, the below query can be used.
DELETE TOP (1) FROM dbo.Cars WHERE CarCompany = 'TATA' AND CarBodyType ='small' and CarName = 'Nano' and EngineType = 'petrol'
Output:
(1 row(s) affected)
The above query might have deleted both the rows in case if we haven’t used TOP keyword. Using the TOP keyword thus helps in deleting only the duplicate row. In this way, one can keep the data safe and also get rid of the redundancy.
It’s a bit boring and cumbersome work to delete manually all the redundant rows. In order to delete all the redundant rows automatically, we have to take help of the CURSOR (another good utility provided by SQL Server to hold the result set of any query). Let’s see how CURSOR can be used to solve our problem.
--Variables to take care the values of the columns DECLARE @Col1CarCompany NVARCHAR(50) DECLARE @Col2CarBodyType NVARCHAR(50) DECLARE @Col3CarName NVARCHAR(50) DECLARE @Col4EngineType NVARCHAR(50) --Variable to keep the count of redundant rows DECLARE @RowCount INT DECLARE Rmv_Redt_Cur CURSOR FOR SELECT CarCompany, CarBodyType, CarName, EngineType, COUNT(*) AS RedCount FROM dbo.Cars GROUP BY CarCompany, CarBodyType, CarName, EngineType HAVING COUNT(*) > 1 OPEN Rmv_Redt_Cur FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount WHILE @@FETCH_STATUS = 0 BEGIN -- 1 should be subtracted from @RowCount so as to keep atleast one record behind DELETE TOP(@RowCount - 1) FROM dbo.Cars WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount END CLOSE Rmv_Redt_Cur DEALLOCATE Rmv_Redt_Cur
Output:
(1 row(s) affected)
(1 row(s) affected)
(1 row(s) affected)
The above cursor contains information about redundant rows. Now one by one each redundant data is fetched and fed to the DELETE statement. The TOP operator has (@RowCount – 1) number of rows so that it will leave single row for each group.
Method 2: Using SET ROWCOUNT
SET ROWCOUNT cause SQL Server to stop processing the query after the specified number of rows is returned.
Syntax:
SET ROWCOUNT { number | @number_var }
[number | @number_var ] : This specify the number of rows to be processed before stopping the particular query.
The default value for SET ROWCOUNT is 0, which will return all rows while executing the query.
Note: SET ROWCOUNT operator will not affect DML statements i.e. DELETE, INSERT, and UPDATE in the future versions.
So as per our earlier example to delete a redundant row manually, use the below query:
SET ROWCOUNT 1 DELETE FROM dbo.Cars WHERE CarCompany = 'TATA' AND CarBodyType ='small' and CarName = 'Nano' and EngineType = 'petrol'
Output:
(1 row(s) affected)
As per our earlier example in TOP operator, to delete all the redundant rows automatically, we have to use a CURSOR for this case too.
CURSOR command can be used in the following manner.
--Variables to take care the values of the columns DECLARE @Col1CarCompany NVARCHAR(50) DECLARE @Col2CarBodyType NVARCHAR(50) DECLARE @Col3CarName NVARCHAR(50) DECLARE @Col4EngineType NVARCHAR(50) --Variable to keep the count of redundant rows DECLARE @RowCount INT DECLARE Rmv_Redt_Cur CURSOR FOR SELECT CarCompany, CarBodyType, CarName, EngineType, COUNT(*) AS RedCount FROM dbo.Cars GROUP BY CarCompany, CarBodyType, CarName, EngineType HAVING COUNT(*) > 1 OPEN Rmv_Redt_Cur FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount WHILE @@FETCH_STATUS = 0 BEGIN SET @RowCount = @RowCount - 1 SET ROWCOUNT @RowCount DELETE FROM dbo.Cars WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType SET ROWCOUNT 0 FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount END CLOSE Rmv_Redt_Cur DEALLOCATE Rmv_Redt_Cur
Output:
(1 row(s) affected)
(1 row(s) affected)
(1 row(s) affected)
Method 3: Using IDENTITY Column
In this method, one has to insert one new column to the existing table with IDENTITY property.
IDENTITY:
When a new row is added to the table, SQL Server provides a unique, incremental value for the column. Only Int data type can be assigned to IDENTITY columns and only one identity column is possible for a given table. It takes two arguments- seed (Initial value of the counter) and increment (incremental value added to the identity value). The default value for both is (1, 1).
Execute the below query to add an IDENTITY column to dbo.Cars table.
ALTER TABLE dbo.Cars ADD IdentityColmn INT IDENTITY(1,1) SELECT * FROM dbo.Cars
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
IdentityColmn |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
1 |
|
TATA |
small |
Nano |
petrol |
2 |
|
TATA |
SUV |
Sumo |
diesel |
3 |
|
Maruti |
sedan |
SX4 |
petrol |
4 |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
5 |
|
TATA |
small |
Nano |
petrol |
6 |
|
Maruti |
sedan |
SX4 |
petrol |
7 |
Now, there are unique rows in the table i.e. one can distinguish each row in the table by using the IDENTITY column named IdentityColmn.
The below query will delete the redundant rows from dbo.Cars table.
DELETE FROM dbo.Cars WHERE IdentityColmn = 5
Output:
(1 row(s) affected)
The combination of Identity and Cursor can help in removing all the redundancy at one go. The below code has to be used.
--Variables to take care the values of the columns DECLARE @Col1CarCompany NVARCHAR(50) DECLARE @Col2CarBodyType NVARCHAR(50) DECLARE @Col3CarName NVARCHAR(50) DECLARE @Col4EngineType NVARCHAR(50) DECLARE @Col5IdentityColmn INT DECLARE Rmv_Redt_Cur CURSOR FOR SELECT CarCompany, CarBodyType, CarName, EngineType, MIN(IdentityColmn) FROM dbo.Cars GROUP BY CarCompany, CarBodyType, CarName, EngineType HAVING COUNT(*) > 1 OPEN Rmv_Redt_Cur FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @Col5IdentityColmn WHILE @@FETCH_STATUS = 0 BEGIN DELETE FROM dbo.Cars WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType and IdentityColmn <> @Col5IdentityColmn FETCH NEXT FROM Rmv_Redt_Cur INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @Col5IdentityColmn END CLOSE Rmv_Redt_Cur DEALLOCATE Rmv_Redt_Cur
Output:
(1 row(s) affected)
(1 row(s) affected)
(1 row(s) affected)
Method 4: Filtering Distinct records to another table
Here the distinct data from the main table will be filtered and transferred to a temporary table. Then the main table will be replaced by the temporary table. It may happen that the constraints and all other parameters related to the given table may not be restored. Although this is a bad idea as this will hamper the integrity constraints of the database, for small examples, this is the quickest and simple trick that one can use.
This can be achieved by two ways:
1) By GROUP BY clause :
For more information about GROUP BY clause, please refer to my article GROUP BY Clause (SQL Server 2005 Vs 2008).
SELECT CarCompany, CarBodyType, CarName, EngineType INTO tempCar FROM dbo.Cars GROUP BY CarCompany, CarBodyType, CarName, EngineType go DROP TABLE dbo.Cars SELECT * INTO dbo.Cars FROM tempCar go DROP TABLE tempCar go SELECT * FROM dbo.Cars
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
Maruti |
sedan |
SX4 |
petrol |
|
TATA |
small |
Nano |
petrol |
|
TATA |
SUV |
Sumo |
diesel |
2) By Using DISTINCT keyword:
SELECT DISTINCT CarCompany, CarBodyType, CarName, EngineType INTO tempCar FROM dbo.Cars go DROP TABLE dbo.Cars SELECT * INTO dbo.Cars FROM tempCar go DROP TABLE tempCar go SELECT * FROM dbo.Cars
Output:
|
CarCompany |
CarBodyType |
CarName |
EngineType |
|
Maruti |
sedan |
Swift-Dzire |
diesel |
|
Maruti |
sedan |
SX4 |
petrol |
|
TATA |
small |
Nano |
petrol |
|
TATA |
SUV |
Sumo |
diesel |
As Database experts we all know that redundancy of data is inevitable but hope you got enough information for dwindling SQL Server database to tackle this problem more effectively.
To make the work of Data warehouse users more flexible and comfortable, SQL Lion team has come up with a tool named “Duplicate Remover” which will automate the process of deleting redundant rows from any database or table. Download the free version from here.
Reference:
Microsoft Books Online



Dear Readers,
Here is another way of deleting redundant rows by using IDENTITY column.
DELETE FROM dbo.Cars
WHERE
EXISTS
(SELECT IdentityColmn FROM dbo.Cars T
WHERE T.CarCompany = dbo.Cars.CarCompany and T.CarBodyType= dbo.Cars.CarBodyType and T.CarName= dbo.Cars.CarName and T.EngineType = dbo.Cars.EngineType AND T.IdentityColmn < dbo.Cars.IdentityColmn)
Thanks.
Hi all, another way to find duplicates and remove duplicate.
This query will help you to find the duplicates rows
;WITH CTE(RowNum,CarCompany,CarBodyType,CarName, EngineType)
AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY CarCompany,CarBodyType,CarName, EngineType
ORDER BY CarCompany,CarBodyType,CarName, EngineType) as RowNum,
CarCompany,CarBodyType,CarName, EngineType
FROM TEST.[dbo].[CARS]
)
SELECT * FROM CTE WHERE RowNum >1
To delete/remove the duplicate Please execute the below query.
;WITH CTE(RowNum,CarCompany,CarBodyType,CarName, EngineType)
AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY CarCompany,CarBodyType,CarName, EngineType
ORDER BY CarCompany,CarBodyType,CarName, EngineType) as RowNum,
CarCompany,CarBodyType,CarName, EngineType
FROM TEST.[dbo].[CARS]
)
DELTE FROM CTE WHERE RowNum >1
Hope this is helpful.
Lal
Hi,usually i use this simple logic to delete duplicate records.
CREATE PROC dbo.DELETE_DUPLICATE_RECORD
AS
BEGIN
— Group duplicate table using distinct to temp table
SELECT DISTINCT * INTO Single_Result FROM Original_table
— Clear all data from source table
TRUNCATE TABLE Original_table
— Copy from temp table to source table (original table)
INSERT INTO Original_table
SELECT * FROM Single_Result
END
GO
Hope this is more helpful!
Cheers!
FEBRIE