Warning: is_writable() [function.is-writable]: open_basedir restriction in effect. File(/f5/sqllion/public/wp-content/uploads/shareaholic/) is not within the allowed path(s): (/fs3d/sqllion/:/nfsn/apps/php53/lib/php/:/nfsn/apps/php5/lib/php/:/nfsn/apps/php/lib/php/) in /fs3d/sqllion/public/wp-content/plugins/sexybookmarks/sexy-bookmarks.php on line 335
Faster Extraction & Loading by SSIS | SQL Lion
Home > SSIS > Faster Extraction & Loading by SSIS

Faster Extraction & Loading by SSIS

Lots of things really matter when building an efficient and robust database and its corresponding package for extracting and loading of data. And this matter more when the database size is very big (or huge).

It’s high time when we need some method to optimize the extraction and loading process in the package.  This means increasing the performance of SSIS package and which in turn makes your package to be able to retrieve and update the data they want very quickly.

The key component for faster extraction is to put indexes in the target table but it will give you less performance when you load data on it. So when to put indexes and where to put indexes is very important. Insert indexes to the joining columns before starting extraction, especially to those that have data types other than integer like varchar, nvarchar, datetime, etc.

Secondly, the data type chosen for a particular column also plays a significant role in the performance of your SSIS package. Choose data type and size of the columns wisely.




Try to avoid type casting or manipulating data types inside the SSIS package as it is an additional overhead for SSIS. Do it prior to the package or do typecasting in the source query of OLE DB Source component.

Try to populate numerical values in facts and in joining columns of the corresponding Dimensions as much as possible. It will increase the efficiency. The foreign key in the referring tables will use less space if you choose a numeric key; less space means more records per page which means less I/O for the same amount of data. Also joining on a numeric key is usually faster than joining on a varchar / nvarchar key.

To optimize memory usage, SELECT only the columns you actually need. If you SELECT all columns from a table (e.g., SELECT * FROM) you will needlessly use memory and bandwidth to store and retrieve columns that do not get used.

The OLE DB Destination provides more than one way to load data in the destination (5 types of Data Access Mode).  Use Fast Load option while loading data into the destination.

  • Data Access Mode – It allows to define the method to upload data into the destination. The fast load option will use BULK INSERT statement instead of INSERT statement. If the fast load option is not selected then by default INSERT is used.
  • Keep Identity – If selected, the identity values of source are preserved and the same are uploaded into the destination table. Else destination table will create its own identity values if there is any column of identity type.
  • Keep Nulls – If selected, the null values of the source are preserved and are uploaded into the destination table. Else if any column has default constraint defined at destination table and NULL value is coming from the source for that column then in that case, default value will be inserted into the destination table.
  • Table Lock – If selected, the TABLOCK is acquired on the table during data upload. It is the recommended option if table is not being used by any other application at the time of data upload as it removes the overhead of lock escalation.
  • Check Constraints – Check constraints will always check for any constraint for the data that is coming through pipeline. It is preferable to uncheck this option if constraint checking is not required. This will reduce the overhead for the pipeline engine.
  • Rows per batchRowsPerBatch is the number of rows you would want in One Buffer. SSIS automatically sets this property based on the RowSize and MaxBufferRows property. The number of rows coming from the pipeline per batch can be defined by user. The default value is -1 if it is kept blank. You can specify the no. of rows as a positive integer (N) so that the records will come as small segments or batches, each segment containing N no. of rows.
  • Maximum insert commit size – You can specify the batch size that the OLE DB destination tries to commit during fast load operations; it actually splits up chunks of data as they are inserted into your destination. If you provide a value for this property, the destination commits rows in batches that are the smaller from either (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.

Try to run all load processes in parallel as long as possible. Run as many load processes as you have available CPUs. The number of parallel process you can run is directly proportion to the number of logical processors (hyper threading technology) you have in your system.

An important thing that has to be kept in mind while dealing with parallel loads is to deal with table locks. You have to assure that no two processes are going to access the same table simultaneously under normal condition. Generally this problem occurs while dealing with logging. For example, we are using one extractlog or loadlog table to keep track of the number of records extracted from the source or loaded to the destination. And generally we are using Execute SQL Task to do the same. So while performing this task it may be possible that deadlock may occur while accessing the same table by more than one process.

For example, in the below diagram, the user is trying to load the data simultaneously by using three sequence container parallel wise. But the Log Extract task is using the same table to log the extract information. In this scenario, table lock exception will arise.

Parallel processes

Parallel processes

To avoid this type of problem change the value of the property IsolationLevel from Serializable to ReadUncommitted for all Log Extract tasks. This will not put the locks on table level and thus will not trigger any lock exception.

Network limitations:  You can transfer data as fast as your network supports. But use them efficiently; you can customize SSIS to use the maximum bandwidth of your network.

You can set the Packet Size property of the connection manager to an integer value that suits you. The max value that you can insert is 32767.

Network packet size:  (this will extract source data from network very fast)

EXEC sp_configure 'show advanced option', '1'; 
RECONFIGURE;
EXEC sp_configure;
 
EXEC sp_configure 'network packet size (B)', '32767';
RECONFIGURE WITH OVERRIDE;
RECONFIGURE;
 
EXEC sp_configure




Derived Column Transformation:

The derived column, we usually use for data type conversation or handling NULL values. If that is the case we can do it in more efficient way.  Consider the example below, where we are going to handle NULL values with “NA” (Not Applicable) for a varchar column.  Suppose we have some two thousand and above records to be extracted from source and out of which some records is having NULL that you need to convert to “NA” before loading it to the destination.

Derived Column Transformation Editor

Derived Column Transformation Editor

For this we have two approaches with Derived Column Transformation:

1)      Take derived Column transformation for all the records that are coming from the source and take separate column for each record of that given column.

Without Conditional split

Without Conditional split

Start, 12:56:09 PM

Finished, 12:56:09 PM, Elapsed time: 00:00:00.703

2)      Take derived Column transformation only for those records that have NULL values in it.

With Conditional Split

With Conditional Split

Start, 12:46:04 PM
Finished, 12:46:05 PM, Elapsed time: 00:00:00.609

Although the difference is very little but it can show significant values for huge amount of data.

  • If your package and SQL Server is running on the same system, then better try to take SQL Server destination instead of the OLE DB destination, so as to increase the performance of loading of data.
  • Avoid sorting of large amount of data in SSIS package. Better don’t use sorting transformations at all inside SSIS as it will consume more memory and as a result degrade the performance. If possible try to sort data by using ORDER BY clause in SQL statements and then pull it in SSIS.
  • Truncate Vs Delete: Try to perform Truncate operations more instead of Delete operations. Because the truncate statement simply removes all data from the table on one go and maintain a short log for this. While the Delete statement log every thing on record level as a result decreasing the performance.

Look up transformation with Derived columns transformation:

Sometime we need to do Lookup transformation where we need to compare some values with our target table and then we are picking some values when the data got matched. But always this is not the case as there may be some values that may not be present in the target table like NULL values, for that we generally convert the values with some predefined values like 0 or -1 and then it will be matched to the target table. But performing Derived column transformation for all the records is again not a good idea. Instead we can perform a lookup transformation first and will take all the records that are successful in lookup and do the derived column transformation for the rest of the bad records. This will definitely increase the performance of overall package.

A common table expression (CTE) approach will be much easier and faster way then nesting sub queries while dealing with multiple aggregate functions.

Conclusion:

In this article, you have seen how we can architect our package for more efficient and reliable performance. And it is mostly recommended to keep your package as simple as possible for better understandability.

Categories: SSIS
  1. July 7th, 2009 at 00:58 | #1

    Hi! I like your srticle and I would like very much to read some more information on this issue. Will you post some more?

  2. September 22nd, 2010 at 03:42 | #2

    thanks, cool article.
    Also set fastparse for flat file source, can help in the load performance.

  3. Medona
    November 12th, 2010 at 12:51 | #3

    Very informative.Thanks for the article

  4. February 27th, 2011 at 13:02 | #4

    Restarting Packages with Checkpoints

    Restarting Packages with Checkpoints If you’re like most other information analysts and update your data warehouse every night, this feature will be of much interest to you. After having set up logging for your packages, every morning you’d be checking [...]

    for more please visit this url
    http://server2008.org/?p=866

  5. Vinod Andani
    March 8th, 2011 at 15:03 | #5

    Hi,
    Whats your thoughts on SCD component wizard. Some follow different approach, say, Lookup and Conditional Split components and OLE DB Command.

  6. BIUser_2011
    October 21st, 2011 at 13:57 | #6

    Very informative…

  7. Sean
    April 11th, 2012 at 01:31 | #7

    Some useful information, but some general architecture questions came to mind. What processing do you do on the database side, versus the ETL, etc? It’s important to define the “who handles what”, then optimize. Sometimes you don’t want all of those casts on the source side and want to offload to the ETL, etc. I knew a few of these, but I learned a little more. Good article.

  8. Praveen Kodi
    July 16th, 2012 at 11:52 | #8

    Hello Team,

    Good Afternoon!

    I was looking out a Tool who deals with Data Validation Process of my Project Needs.

    My Project Requirements

    1. Upload the Excel File into a BI Tool
    2. Data Validations from Excel to BI Tool
    3. Less Time to upload the file. Presently, it takes more than 5 to 10 minutes to upload the excel sheet (size of the file is 10 MB).

    Kindly, suggest me the best Data Validation Tool.

    Thanks,
    Praveen Kodi
    Business Analyst
    iConcept

  9. rina
    September 29th, 2012 at 01:49 | #9

    Thanks. Very thorough details with diagrams.Until now read 2,3 articles the same applies and makes the reader easier to understand and hoping to read all slowly.

  10. Revontulet
    February 20th, 2013 at 11:58 | #10

    The network packet size is a cool trick but gives this error later on-Native Error: 233, no process is on the other end of the pipe
    Ref–http://technet.microsoft.com/en-us/library/ms177437.aspx

  11. anil
    March 7th, 2013 at 09:08 | #11

    Thanks for Details explanation and easy way.

  12. LefPrieroge
    May 23rd, 2013 at 02:03 | #12
  13. July 5th, 2013 at 16:49 | #13

    I enjoyed studying this blog publish.

  1. February 12th, 2012 at 22:23 | #1
  2. May 16th, 2013 at 15:55 | #2
  3. October 4th, 2014 at 20:03 | #3
You must be logged in to post a comment.