Warning: is_writable() [function.is-writable]: open_basedir restriction in effect. File(/f5/sqllion/public/wp-content/uploads/shareaholic/) is not within the allowed path(s): (/fs3d/sqllion/:/nfsn/apps/php53/lib/php/:/nfsn/apps/php5/lib/php/:/nfsn/apps/php/lib/php/) in /fs3d/sqllion/public/wp-content/plugins/sexybookmarks/sexy-bookmarks.php on line 335
Delete Duplicate Records | SQL Lion
Home > T - SQL > Delete Duplicate Records

Delete Duplicate Records

Because of some unorganized OLTP (transactional system), sometimes it may occur that duplicate records get entered in the OLTP database. It is the most common thing in case of legacy database and bad database. For a better output in analysis service or in reporting, the data must be cleaned, scrubbed and must be flawless.  And it becomes worse when the records are alike even for values in each column.

Before taking any step to clean the redundant data, you must first see and check for data redundancy.

For this, the most commonly used query is DISTINCT. The DISTINCT query returns exactly number of distinct rows in a given table.  But it has some flaws, the COUNT function does not work with “DISTINCT” keyword for more than one column. Below stated are some alternate ways to get this job done.
Let us consider the table named dbo.Cars containing the data given below:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

Swift-Dzire

diesel

TATA

small

Nano

petrol

TATA

SUV

Sumo

diesel

Maruti

sedan

SX4

petrol

Maruti

sedan

Swift-Dzire

diesel

TATA

small

Nano

petrol

Maruti

sedan

SX4

petrol

Here the above data has some redundant rows. In order to get the unique rows out of it, the below query can be executed.

select distinct * from dbo.Cars

Output:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

Swift-Dzire

diesel

Maruti

sedan

SX4

petrol

TATA

Small

Nano

petrol

TATA

SUV

Sumo

diesel

Note: Important fact about DISTINCT keyword:
DISTINCT keyword automatically sorts the whole result set before displaying the output according to the columns specified, resulting into high processing time as well as memory usage. Thus, it is advisable to avoid DISTINCT keyword in case of long running queries.

To get the total number of rows and the total number of distinct rows, one can use the below query.

select count(*) as TotalCount,
(select Count(*)
      from (select distinct * from dbo.Cars) "C") as CountDistinct,
 (count(*) - (select Count(*)
      from (select distinct * from dbo.Cars) "C")) as SurplusRows
from  dbo.Cars

Output:

TotalCount

CountDistinct

SurplusRows

7

4

3

Note: As stated above, DISTINCT keywords conflicts with the Count function for more than one column. So the below query will yield error message for incorrect syntax.

select count(*), count(distinct *) from dbo.Cars

Now we got to know that we have 3 redundant rows in the table. As we don’t have any unique columns like surrogate keys or primary keys or identity columns related to this table, it’s a bit difficult task to hunt it down. If you are going to delete on any condition, then it may happen that all the rows get deleted matching that condition. Let us find out various ways to remove these redundant rows by using T – SQL.





The following methods can be used to remove redundancy from a table.

  • Using TOP keyword
  • Using SET ROWCOUNT
  • Using IDENTITY column
  • Filtering Distinct records to another table

Method 1: Using TOP Keyword

Syntax of TOP:  Here only the first set of rows will be returned from the query result that is specified in the expression.


[
TOP (expression) [PERCENT]
[WITH TIES]
]

Expression: Numeric expression that specifies the number of rows to be returned

[PERCENT]:  Query returns only the first expression percent of rows from the result set

[WITH TIES]: Additional rows matching the last row of [Expression] will be added to the result set.  This is bit complicated one. Consider one example to show the first 3 records order by CarName.

Example: without [WITH TIES]

select TOP (3)
  * from dbo.Cars order by carname desc

Output:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

SX4

petrol

Maruti

sedan

SX4

petrol

Maruti

sedan

Swift-Dzire

diesel

Example: with [WITH TIES]

select TOP (3) WITH TIES
  * from dbo.Cars order by carname desc

Output:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

SX4

petrol

Maruti

sedan

SX4

petrol

Maruti

sedan

Swift-Dzire

diesel

Maruti

sedan

Swift-Dzire

diesel

So to delete the duplicate records using TOP operator, follow the below steps:

In our example, we have 2 redundant rows with the same values as “TATA”, “small”, “Nano” and “petrol”. So, to remove one row from it, the below query can be used.

delete top (1) from dbo.Cars
where CarCompany = 'TATA' AND CarBodyType ='small' and CarName = 'Nano' and EngineType = 'petrol'

Output:

(1 row(s) affected)

The above query might have deleted both the rows in case if we haven’t used TOP keyword. Using the TOP keyword thus helps in deleting only the duplicate row.  In this way, one can keep the data safe and also get rid of the redundancy.

It’s a bit boring and cumbersome work to delete manually all the redundant rows. In order to delete all the redundant rows automatically, we have to take help of the CURSOR (another good utility provided by SQL Server to hold the result set of any query). Let’s see how CURSOR can be used to solve our problem.

--Variables to take care the values of the columns
DECLARE @Col1CarCompany nvarchar(50)
DECLARE @Col2CarBodyType nvarchar(50)
DECLARE @Col3CarName nvarchar(50)
DECLARE @Col4EngineType nvarchar(50)
--Variable to keep the count of redundant rows
DECLARE @RowCount int
 
DECLARE Rmv_Redt_Cur CURSOR
FOR SELECT CarCompany, CarBodyType, CarName, EngineType, Count(*) as RedCount FROM  dbo.Cars
      GROUP BY  CarCompany, CarBodyType, CarName, EngineType
      HAVING Count(*) > 1
OPEN Rmv_Redt_Cur
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount
WHILE @@FETCH_STATUS = 0
BEGIN
-- 1 should be subtracted from @RowCount so as to keep atleast one record behind
DELETE TOP(@RowCount - 1) FROM dbo.Cars
WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType
 
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount
END
 
CLOSE Rmv_Redt_Cur
DEALLOCATE Rmv_Redt_Cur

Output:

(1 row(s) affected)

(1 row(s) affected)

(1 row(s) affected)

The above cursor contains information about redundant rows. Now one by one each redundant data is fetched and fed to the DELETE statement. The TOP operator has (@RowCount – 1) number of rows so that it will leave single row for each group.




Method 2: Using SET ROWCOUNT

SET ROWCOUNT cause SQL Server to stop processing the query after the specified number of rows is returned.

Syntax:

SET ROWCOUNT { number | @number_var }

[number | @number_var ] : This specify the number of rows to be processed before stopping the particular query.

The default value for SET ROWCOUNT is 0, which will return all rows while executing the query.

Note: SET ROWCOUNT operator will not affect DML statements i.e. DELETE, INSERT, and UPDATE in the future versions.

So as per our earlier example to delete a redundant row manually, use the below query:

SET ROWCOUNT 1
delete from dbo.Cars
         where CarCompany = 'TATA' AND CarBodyType ='small' and CarName = 'Nano' and EngineType = 'petrol'

Output:

(1 row(s) affected)

As per our earlier example in TOP operator, to delete all the redundant rows automatically, we have to use a CURSOR for this case too.

CURSOR command can be used in the following manner.

--Variables to take care the values of the columns
DECLARE @Col1CarCompany nvarchar(50)
DECLARE @Col2CarBodyType nvarchar(50)
DECLARE @Col3CarName nvarchar(50)
DECLARE @Col4EngineType nvarchar(50)
--Variable to keep the count of redundant rows
DECLARE @RowCount int
 
DECLARE Rmv_Redt_Cur CURSOR
FOR SELECT CarCompany, CarBodyType, CarName, EngineType, Count(*) as RedCount FROM  dbo.Cars
       GROUP BY  CarCompany, CarBodyType, CarName, EngineType
       HAVING Count(*) > 1
 
OPEN Rmv_Redt_Cur
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount
WHILE @@FETCH_STATUS = 0
BEGIN
     SET @RowCount = @RowCount - 1
     SET ROWCOUNT @RowCount
     DELETE FROM dbo.Cars
              WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType
     SET ROWCOUNT 0
 
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @RowCount
END
 
CLOSE Rmv_Redt_Cur
DEALLOCATE Rmv_Redt_Cur

Output:

(1 row(s) affected)

(1 row(s) affected)

(1 row(s) affected)

Method 3: Using IDENTITY Column

In this method, one has to insert one new column to the existing table with IDENTITY property.

IDENTITY:

When a new row is added to the table, SQL Server provides a unique, incremental value for the column. Only Int data type can be assigned to IDENTITY columns and only one identity column is possible for a given table. It takes two arguments- seed (Initial value of the counter) and increment (incremental value added to the identity value). The default value for both is (1, 1).

Execute the below query to add an IDENTITY column to dbo.Cars table.

ALTER TABLE dbo.Cars ADD IdentityColmn int IDENTITY(1,1)
SELECT * FROM dbo.Cars

Output:

CarCompany

CarBodyType

CarName

EngineType

IdentityColmn

Maruti

sedan

Swift-Dzire

diesel

1

TATA

small

Nano

petrol

2

TATA

SUV

Sumo

diesel

3

Maruti

sedan

SX4

petrol

4

Maruti

sedan

Swift-Dzire

diesel

5

TATA

small

Nano

petrol

6

Maruti

sedan

SX4

petrol

7

Now, there are unique rows in the table i.e. one can distinguish each row in the table by using the IDENTITY column named IdentityColmn.

The below query will delete the redundant rows from dbo.Cars table.

DELETE FROM dbo.Cars where IdentityColmn = 5

Output:

(1 row(s) affected)

The combination of Identity and Cursor can help in removing all the redundancy at one go. The below code has to be used.

--Variables to take care the values of the columns
DECLARE @Col1CarCompany nvarchar(50)
DECLARE @Col2CarBodyType nvarchar(50)
DECLARE @Col3CarName nvarchar(50)
DECLARE @Col4EngineType nvarchar(50)
DECLARE @Col5IdentityColmn int
 
DECLARE Rmv_Redt_Cur CURSOR
FOR SELECT CarCompany, CarBodyType, CarName, EngineType, Min(IdentityColmn) FROM  dbo.Cars
       GROUP BY  CarCompany, CarBodyType, CarName, EngineType
       HAVING Count(*) > 1
 
OPEN Rmv_Redt_Cur
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @Col5IdentityColmn
WHILE @@FETCH_STATUS = 0
BEGIN
 
DELETE FROM dbo.Cars
WHERE CarCompany = @Col1CarCompany AND CarBodyType = @Col2CarBodyType and CarName = @Col3CarName and EngineType = @Col4EngineType and IdentityColmn <> @Col5IdentityColmn
 
FETCH NEXT FROM Rmv_Redt_Cur
INTO @Col1CarCompany, @Col2CarBodyType, @Col3CarName, @Col4EngineType, @Col5IdentityColmn
END
 
CLOSE Rmv_Redt_Cur
DEALLOCATE Rmv_Redt_Cur

Output:

(1 row(s) affected)

(1 row(s) affected)

(1 row(s) affected)

Method 4: Filtering Distinct records to another table

Here the distinct data from the main table will be filtered and transferred to a temporary table. Then the main table will be replaced by the temporary table. It may happen that the constraints and all other parameters related to the given table may not be restored. Although this is a bad idea as this will hamper the integrity constraints of the database, for small examples, this is the quickest and simple trick that one can use.

This can be achieved by two ways:

1) By GROUP BY clause :

For more information about GROUP BY clause, please refer to my article GROUP BY Clause (SQL Server 2005 Vs 2008).

SELECT CarCompany, CarBodyType, CarName, EngineType INTO tempCar
FROM  dbo.Cars
GROUP BY  CarCompany, CarBodyType, CarName, EngineType
go
DROP TABLE dbo.Cars
select * into dbo.Cars from tempCar
go
DROP TABLE tempCar
go
select * from dbo.Cars

Output:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

Swift-Dzire

diesel

Maruti

sedan

SX4

petrol

TATA

small

Nano

petrol

TATA

SUV

Sumo

diesel

2) By Using DISTINCT keyword:

SELECT DISTINCT CarCompany, CarBodyType, CarName, EngineType INTO tempCar
FROM  dbo.Cars
go
DROP TABLE dbo.Cars
select * into dbo.Cars from tempCar
go
DROP TABLE tempCar
go
select * from dbo.Cars

Output:

CarCompany

CarBodyType

CarName

EngineType

Maruti

sedan

Swift-Dzire

diesel

Maruti

sedan

SX4

petrol

TATA

small

Nano

petrol

TATA

SUV

Sumo

diesel

As Database experts we all know that redundancy of data is inevitable but hope you got enough information for dwindling SQL Server database to tackle this problem more effectively.

To make the work of Data warehouse users more flexible and comfortable, SQL Lion team has come up with a tool named “Duplicate Remover” which will automate the process of deleting redundant rows from any database or table. Download the free version from here.

Reference:
Microsoft Books Online

Categories: T - SQL
  1. May 29th, 2009 at 16:35 | #1

    Dear Readers,
    Here is another way of deleting redundant rows by using IDENTITY column.
    DELETE FROM dbo.Cars
    WHERE
    EXISTS
    (SELECT IdentityColmn FROM dbo.Cars T
    WHERE T.CarCompany = dbo.Cars.CarCompany and T.CarBodyType= dbo.Cars.CarBodyType and T.CarName= dbo.Cars.CarName and T.EngineType = dbo.Cars.EngineType AND T.IdentityColmn < dbo.Cars.IdentityColmn)

    Thanks.

  2. June 10th, 2009 at 21:55 | #2

    Hi all, another way to find duplicates and remove duplicate.

    This query will help you to find the duplicates rows

    ;WITH CTE(RowNum,CarCompany,CarBodyType,CarName, EngineType)
    AS
    (
    SELECT ROW_NUMBER() OVER(PARTITION BY CarCompany,CarBodyType,CarName, EngineType
    ORDER BY CarCompany,CarBodyType,CarName, EngineType) as RowNum,
    CarCompany,CarBodyType,CarName, EngineType
    FROM TEST.[dbo].[CARS]
    )
    SELECT * FROM CTE WHERE RowNum >1

    To delete/remove the duplicate Please execute the below query.

    ;WITH CTE(RowNum,CarCompany,CarBodyType,CarName, EngineType)
    AS
    (
    SELECT ROW_NUMBER() OVER(PARTITION BY CarCompany,CarBodyType,CarName, EngineType
    ORDER BY CarCompany,CarBodyType,CarName, EngineType) as RowNum,
    CarCompany,CarBodyType,CarName, EngineType
    FROM TEST.[dbo].[CARS]
    )
    DELTE FROM CTE WHERE RowNum >1

    Hope this is helpful.

    Lal

  3. FEBRIE DHARMA KUNCORO
    July 21st, 2009 at 14:11 | #3

    Hi,usually i use this simple logic to delete duplicate records.

    CREATE PROC dbo.DELETE_DUPLICATE_RECORD
    AS
    BEGIN
    — Group duplicate table using distinct to temp table
    SELECT DISTINCT * INTO Single_Result FROM Original_table
    — Clear all data from source table
    TRUNCATE TABLE Original_table
    — Copy from temp table to source table (original table)
    INSERT INTO Original_table
    SELECT * FROM Single_Result
    END
    GO

    Hope this is more helpful!

    Cheers! :)

    FEBRIE

  4. Babita
    July 23rd, 2010 at 10:50 | #4

    Very good post.

  5. Bikas
    February 22nd, 2011 at 16:19 | #5

    Good Queries…….

  6. Vinod
    May 29th, 2012 at 11:28 | #6

    Your code will give performance flaws. why don’t we use Common table expressions with row_number()?

  7. April 10th, 2013 at 09:15 | #7

    It’s really a nice and useful piece of information. I’m glad that you shared this useful info with us. Please keep us up to date like this. Thanks for sharing.

  1. June 12th, 2010 at 20:29 | #1
  2. August 16th, 2011 at 20:14 | #2
  3. June 14th, 2014 at 21:21 | #3
You must be logged in to post a comment.