SSIS package to Remove Duplicates from any Database
Here comes one more challenging yet interesting topic to tide over. The requirement goes something like this: You have some sources, let it be some sales data or some Call center data coming from different sources. Data can be of different media. But you are able to load it in your staging tables i.e. one staging database you are maintaining like StageDB for storing these incremental data.
So, here we have both master i.e. Dimensional data and Detail i.e. Fact data in our stage database. But before processing further and loading it into our Mart or Data warehouse, we need to check if there is any redundancy at the row level for each of these tables in the staging database. Here comes the main problem, what if you don’t know how many tables are there in the staging database and how many columns are there for each table in the staging database but still you have to keep only unique records for each table.
This means there must be some mechanism to deal with this problem dynamically, i.e.
- It has to detect the number of tables for the StageDB database.
- It has to detect all the columns for each table in the StageDB database.
- It has to detect only the redundant records for each table in the StageDB database.
Here I built one package in SSIS 2008 to overcome this problem. To understand the logic behind this, you need to follow the below steps:
Step 1: Get tables available in a given database:
Here is the simple query for this:
SELECT name FROM sysobjects where sysobjects.xtype = 'U'
Use the above query in an “Execute SQL Task” and store it in a variable named “TableNamesCollection” of data type “Object” (this is the only data type that can hold any dataset).
Configure the “Execute SQL Task” as below.

Set the Result Set of the Query to the user defined variable.

Up to this, we got all the tables available in a given database.
Step 2: Iterating through each database:
For iteration of unknown collection, I will always prefer to have a Foreach Loop Container as we are not sure about the number of tables present at each run of the package.
Drag one Foreach Loop Container to the control flow, which will iterate for each table and do certain tasks at each iteration.

Out of the many collections available in a Foreach loop container, we will use ADO (AcitveX Data Object ) enumerator for enumeration of datasets. Configure the Foreach loop container as below:

Specify the variable name containing the Result Set for all the table names. At each iteration, store the current table name in another variable named “EachTableName” of data type String as shown below.

Now each table has to be processed for deletion of redundant data.
Step 3: Deleting Redundancy in a table at row level:
Drag an “Execute SQL Task” to accomplish this task as we are going to write a T-SQL code to detect redundancy and remove them. One of my previously published articles named “Delete Duplicate Records “has some more detail information for this.

The code used to do the above operation is given below:
DECLARE @TableName varchar(200) = ? IF EXISTS (SELECT * FROM sysobjects WHERE name = 'tempDupData' AND type = 'U') DROP TABLE tempDupData DECLARE @SQLQuery nvarchar(MAX) = '' DECLARE @CondQuery nvarchar(MAX) = '' DECLARE @Colmns varchar(8000) = '' DECLARE @ColName varchar(500) Declare ColInfo Cursor for select C.name from sysobjects T inner join syscolumns C on T.id = C.id and T.type = 'U' and T.name = @TableName inner join systypes DT on C.xusertype = DT.xusertype and DT.name not in ('image','text','ntext') OPEN ColInfo FETCH NEXT FROM ColInfo INTO @ColName WHILE @@FETCH_STATUS = 0 BEGIN SET @Colmns = @Colmns + '[' + upper(ltrim(rtrim(@ColName))) + '], ' SET @CondQuery = @CondQuery + 'A.[' + upper(ltrim(rtrim(@ColName))) + '] = B.[' + upper(ltrim(rtrim(@ColName))) + '] and ' FETCH NEXT FROM ColInfo INTO @ColName END SET @CondQuery = @CondQuery + ' B.RecID = ' CLOSE ColInfo DEALLOCATE ColInfo SET @SQLQuery = 'SELECT ' + @Colmns + ' COUNT(*) as [TCount], IDENTITY(bigint,1,1) as RecID into tempDupData FROM ' + @TableName + ' GROUP BY ' + LEFT(@Colmns, len(@Colmns) - 1) + ' HAVING count(*) > 1' Print @SQLQuery Print @CondQuery EXEC sp_executeSQL @SQLQuery; DECLARE @RowCount INT DECLARE @RecID bigint = (select MAX(RecID) from tempDupData) while @RecID > 0 BEGIN SET @RowCount = (select [TCount] from tempDupData where RecID = @RecID) - 1 SET ROWCOUNT @RowCount SET @SQLQuery = 'Delete A FROM ' + @TableName + ' A inner join tempDupData B on ' + @CondQuery + cast(@RecID as nvarchar(50)) EXEC sp_executeSQL @SQLQuery; SET ROWCOUNT 0 SET @RecID = @RecID - 1 END DROP table tempDupData
Detail description of the T-SQL code used in Exucute SQL Task named “Delete Redundant data from each table” is given below:
DECLARE @TableName varchar(200) = ?
On each iteration of the loop, the code is feed with a new table name and that is going to be stored in a local variable name @TableName. Here, “?” represents the parameter passed to the code as below:
The next important thing to keep in mind is the columns for the given table. And here as we don’t know what and how many columns will be there for the given table, it has to be derived dynamically by the below code.
select C.name from sysobjects T inner join syscolumns C on T.id = C.id and T.type = 'U' and T.name = @TableName inner join systypes DT on C.xusertype = DT.xusertype and DT.name not in ('image','text','ntext')
As we are going to use GROUP BY method to detect the redundant records, so the columns with the data types ‘image’,‘text’,‘ntext’ has to be omitted. GROUP BY or any other sorting method does not support these types of data types. Hence, we have to check for redundancy without considering these data types.
The below code uses a CURSOR to properly align the columns so that it can be execute under any T-SQL statement dynamically.
OPEN ColInfo FETCH NEXT FROM ColInfo INTO @ColName WHILE @@FETCH_STATUS = 0 BEGIN SET @Colmns = @Colmns + '[' + upper(ltrim(rtrim(@ColName))) + '], ' SET @CondQuery = @CondQuery + 'A.[' + upper(ltrim(rtrim(@ColName))) + '] = B.[' + upper(ltrim(rtrim(@ColName))) + '] and ' FETCH NEXT FROM ColInfo INTO @ColName END SET @CondQuery = @CondQuery + ' B.RecID = ' CLOSE ColInfo DEALLOCATE ColInfo
Here the two variables @Columns and @CondQuery are used to keep information regarding the columns like
@Columns = Keep information about the columns used in upper part of the query like “[Col1],[Col2],….[ColN]”
@CondQuery = Keep information about the columns used in lower part of the query i.e. in where clause like “A.[Col1] = B.[Col1] and A.[Col2] = B.[Col2] and … A.[ColN] = B.[ColN]” where A is the alias for original table and B is the alias for the table containing information about duplicate records only. Yah, we are using one separate temporary table (standard table but will be dropped at the end) and this is done by running the below query.
SET @SQLQuery = 'SELECT ' + @Colmns + ' COUNT(*) as [TCount], IDENTITY(bigint,1,1) as RecID into tempDupData FROM ' + @TableName + ' GROUP BY ' + LEFT(@Colmns, len(@Colmns) - 1) + ' HAVING count(*) > 1' Print @SQLQuery Print @CondQuery EXEC sp_executeSQL @SQLQuery;
This query will give us only those records that are redundant in the table. We keep these records in a table named tempDupData which will compare with the main table to delete redundant records from it.
The below query will delete the surplus redundant records from the main table.
DECLARE @RowCount INT DECLARE @RecID bigint = (select MAX(RecID) from tempDupData) while @RecID > 0 BEGIN SET @RowCount = (select [TCount] from tempDupData where RecID = @RecID) - 1 SET ROWCOUNT @RowCount SET @SQLQuery = 'Delete A FROM ' + @TableName + ' A inner join tempDupData B on ' + @CondQuery + cast(@RecID as nvarchar(50)) EXEC sp_executeSQL @SQLQuery; SET ROWCOUNT 0 SET @RecID = @RecID - 1 END
As we know that we have to keep at least one record for each redundant records, so for this ROWCOUNT is set to max count of a particular redundant record – 1, so as to keep delete all except one record. The query engine will stop after deleting N – 1 records.
The STEP 3 will continue till it processed all the tables in a given database.
Note: Download the Full SSIS Remove redundancy Package to detect and remove redundancy for FREE. ![]()
To make the work of Data warehouse users more flexible and comfortable, SQL Lion team has come up with a tool named “Duplicate Remover” which will automate the process of deleting redundant rows from any database or table. Download the free version from here.

What about using the Windowing function (CTE) technique so you don’t have to use a temp table?
WITH tempDupData AS (
SELECT RecID = ROW_NUMBER() OVER (PARTITION BY ORDER BY ),
FROM WITH(NOLOCK))
DELETE FROM tempDupData
WHERE RecID >= 1
Beautiful example and explanation!!!
Thanks for the posting.
Hi,
How to execute the package using .net (web)page and pass the table names as parameters.
Hi,
i need a help regrading eliminate duplicates in particular table.