Faster Extraction & Loading by SSIS
Lots of things really matter when building an efficient and robust database and its corresponding package for extracting and loading of data. And this matter more when the database size is very big (or huge).
It’s high time when we need some method to optimize the extraction and loading process in the package. This means increasing the performance of SSIS package and which in turn makes your package to be able to retrieve and update the data they want very quickly.
The key component for faster extraction is to put indexes in the target table but it will give you less performance when you load data on it. So when to put indexes and where to put indexes is very important. Insert indexes to the joining columns before starting extraction, especially to those that have data types other than integer like varchar, nvarchar, datetime, etc.
Secondly, the data type chosen for a particular column also plays a significant role in the performance of your SSIS package. Choose data type and size of the columns wisely.
Try to avoid type casting or manipulating data types inside the SSIS package as it is an additional overhead for SSIS. Do it prior to the package or do typecasting in the source query of OLE DB Source component.
Try to populate numerical values in facts and in joining columns of the corresponding Dimensions as much as possible. It will increase the efficiency. The foreign key in the referring tables will use less space if you choose a numeric key; less space means more records per page which means less I/O for the same amount of data. Also joining on a numeric key is usually faster than joining on a varchar / nvarchar key.
To optimize memory usage, SELECT only the columns you actually need. If you SELECT all columns from a table (e.g., SELECT * FROM) you will needlessly use memory and bandwidth to store and retrieve columns that do not get used.
The OLE DB Destination provides more than one way to load data in the destination (5 types of Data Access Mode). Use Fast Load option while loading data into the destination.
- Data Access Mode – It allows to define the method to upload data into the destination. The fast load option will use BULK INSERT statement instead of INSERT statement. If the fast load option is not selected then by default INSERT is used.
- Keep Identity – If selected, the identity values of source are preserved and the same are uploaded into the destination table. Else destination table will create its own identity values if there is any column of identity type.
- Keep Nulls – If selected, the null values of the source are preserved and are uploaded into the destination table. Else if any column has default constraint defined at destination table and NULL value is coming from the source for that column then in that case, default value will be inserted into the destination table.
- Table Lock – If selected, the TABLOCK is acquired on the table during data upload. It is the recommended option if table is not being used by any other application at the time of data upload as it removes the overhead of lock escalation.
- Check Constraints – Check constraints will always check for any constraint for the data that is coming through pipeline. It is preferable to uncheck this option if constraint checking is not required. This will reduce the overhead for the pipeline engine.
- Rows per batch – RowsPerBatch is the number of rows you would want in One Buffer. SSIS automatically sets this property based on the RowSize and MaxBufferRows property. The number of rows coming from the pipeline per batch can be defined by user. The default value is -1 if it is kept blank. You can specify the no. of rows as a positive integer (N) so that the records will come as small segments or batches, each segment containing N no. of rows.
- Maximum insert commit size – You can specify the batch size that the OLE DB destination tries to commit during fast load operations; it actually splits up chunks of data as they are inserted into your destination. If you provide a value for this property, the destination commits rows in batches that are the smaller from either (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.
Try to run all load processes in parallel as long as possible. Run as many load processes as you have available CPUs. The number of parallel process you can run is directly proportion to the number of logical processors (hyper threading technology) you have in your system.
An important thing that has to be kept in mind while dealing with parallel loads is to deal with table locks. You have to assure that no two processes are going to access the same table simultaneously under normal condition. Generally this problem occurs while dealing with logging. For example, we are using one extractlog or loadlog table to keep track of the number of records extracted from the source or loaded to the destination. And generally we are using Execute SQL Task to do the same. So while performing this task it may be possible that deadlock may occur while accessing the same table by more than one process.
For example, in the below diagram, the user is trying to load the data simultaneously by using three sequence container parallel wise. But the Log Extract task is using the same table to log the extract information. In this scenario, table lock exception will arise.
To avoid this type of problem change the value of the property IsolationLevel from Serializable to ReadUncommitted for all Log Extract tasks. This will not put the locks on table level and thus will not trigger any lock exception.
Network limitations: You can transfer data as fast as your network supports. But use them efficiently; you can customize SSIS to use the maximum bandwidth of your network.
You can set the Packet Size property of the connection manager to an integer value that suits you. The max value that you can insert is 32767.
Network packet size: (this will extract source data from network very fast)
EXEC sp_configure 'show advanced option', '1'; RECONFIGURE; EXEC sp_configure; EXEC sp_configure 'network packet size (B)', '32767'; RECONFIGURE WITH OVERRIDE; RECONFIGURE; EXEC sp_configure
Derived Column Transformation:
The derived column, we usually use for data type conversation or handling NULL values. If that is the case we can do it in more efficient way. Consider the example below, where we are going to handle NULL values with “NA” (Not Applicable) for a varchar column. Suppose we have some two thousand and above records to be extracted from source and out of which some records is having NULL that you need to convert to “NA” before loading it to the destination.
For this we have two approaches with Derived Column Transformation:
1) Take derived Column transformation for all the records that are coming from the source and take separate column for each record of that given column.
Start, 12:56:09 PM
Finished, 12:56:09 PM, Elapsed time: 00:00:00.703
2) Take derived Column transformation only for those records that have NULL values in it.
Start, 12:46:04 PM
Finished, 12:46:05 PM, Elapsed time: 00:00:00.609
Although the difference is very little but it can show significant values for huge amount of data.
- If your package and SQL Server is running on the same system, then better try to take SQL Server destination instead of the OLE DB destination, so as to increase the performance of loading of data.
- Avoid sorting of large amount of data in SSIS package. Better don’t use sorting transformations at all inside SSIS as it will consume more memory and as a result degrade the performance. If possible try to sort data by using ORDER BY clause in SQL statements and then pull it in SSIS.
- Truncate Vs Delete: Try to perform Truncate operations more instead of Delete operations. Because the truncate statement simply removes all data from the table on one go and maintain a short log for this. While the Delete statement log every thing on record level as a result decreasing the performance.
Look up transformation with Derived columns transformation:
Sometime we need to do Lookup transformation where we need to compare some values with our target table and then we are picking some values when the data got matched. But always this is not the case as there may be some values that may not be present in the target table like NULL values, for that we generally convert the values with some predefined values like 0 or -1 and then it will be matched to the target table. But performing Derived column transformation for all the records is again not a good idea. Instead we can perform a lookup transformation first and will take all the records that are successful in lookup and do the derived column transformation for the rest of the bad records. This will definitely increase the performance of overall package.
A common table expression (CTE) approach will be much easier and faster way then nesting sub queries while dealing with multiple aggregate functions.
In this article, you have seen how we can architect our package for more efficient and reliable performance. And it is mostly recommended to keep your package as simple as possible for better understandability.