Sitting at work the other day going through some old packages that are still in production. We are in the middle of upgrading from 2008 to 2016 so it gives us a great opportunity to review any package that may be a slow running package or one that gives constant failures. Many of the packages are on there 2nd or 3rd upgrade and many still have VB script in script tasks performing simple tasks like setting dates for files. This has led to a discussion among us about the best way to go about fixing the issues, since many of the packages need to have the VB removed or changed to C# to work efficiently with catalogs in SQL Server 2016.
After some discussing and proving why my point was correct once again lol, we figured that wherever possible we are going to remove any script task and replace those with expressions in variables. Why replace something that works, some may ask. The answer is a simple one with a more complex logic behind it. Variables are almost free, yes, I know that a script task has no monetary value to it either but in terms of overhead when executing the package compared to a script task.
When a SSIS package is executed it validates connection strings, data flows and any other task that is within the package. It also gets the variables ready, so if you have a date variable or a collation for a file name or any of the other 1000s uses for expressions, those also get ready when first executing the package. Now that does not seem like a very good argument that we want the variables to be first and that’s why we should get rid of the script tasks.
Let’s look at this with script tasks used to set the dates and collate the file names. In order to make those dates usable or to have the file path in a connection string what needs to happen. That piece of the puzzle needs to be converted to yes, a variable. So now the variable has been set when the package first executed and now it is being set again with the script task. Why do something twice why not use an expression set the variable once and be done with it.
Now some of you might be saying that the script task is only adding a second here or there what’s the big deal. Well simple math shows a different story, we have about 150 packages all running on the same server throughout the day, many of those are even using the same databases and sometimes even the same tables. If out of those 150 packages 75 of them use script tasks, but some are using 2 to 3 script tasks to set multiple variables, so we are back up to the 150 possibly even higher for the number of script tasks being executed.
Now if each one takes one second, we are possibly losing a little more than 2 minutes every day because of the script tasks. Yes, I know, in the grand scheme 2 minutes is not very much time but when looking at it from 150 packages many of them running every 15 to 20 minutes that starts adding up more time. So, let’s say of the 75 packages that contain a script task, 30 of them need to run every 20 minutes and half of those have more than one script task. So now we are at 45 seconds every 20 minutes so 45(number of script tasks) * (60/20)(3 how many times the packages execute every hour) * 24(hours in a day) = 3240 seconds every day or 54 minutes.
The 2 minutes has now increased to 54 minutes throughout the day. If the rest of the package all ran as originally designed or the database is properly indexed, then the 54 minutes from those packages may never be an issue. As we all know, databases do what they are designed for and that is to collect data. Now as more data is collected the speed to read and write to those tables now is getting slower. A package that used to take 2 minutes to run now takes 5 and the next one and the next one, eventually you have multiple packages performing large amounts of ETL processes and they just stop. You check the logs, no errors are showing, packages are showing that they executed or are running. That’s when it hits you, packages are still running, some that should never run at the same time are now running at the same time and the data is not in the tables, the business side is starting to call and send emails because they can’t get their reports and why. Because you wanted a script task to set the date for you.
Now I know there are several other factors in the scenario that would contribute to the failure and many are enough to cover pages of discussions and arguments. The one thing to be always thinking when designing, building or repairing packages is to optimize performance. One way to optimize performance is to save the 1 to 2 seconds that the script task is using up.
Besides performance there is an excellent reason not to use script tasks for things such as dates and filenames. Think of the next person that needs to rerun the package to fix a data issue. If the date is set in a script task then the code in the task needs to be changed, run the package to fix the data, change the script back to what it was and hopefully not messing anything up. If the date was set in a variable and maybe use two for the date then one variable can be easily found, changed and changed back to fix the data. When I build packages, I prefer to build any date as flexible as possible, I use multiple variables to get to one date, this allows me to change one variable to number of days back and execute the package to recover lost data.
I always build my packages using a back-date loop, this allows me to go back and fix data in an easy and smooth way. The loop provides three parts, first a Boolean variable that if needing to back date it is set to true, next start and stop variables to set the dates that need to be fixed and finally a loop that executes the package and advances the day. I plan to eventually do a tutorial on how to set up this easy but very useful task that I use all the time in new builds and every repair that I must make that uses dates.
If you only have a few packages or they run very little, then the time constraint is not a big deal and for most people this is the case. If looking for maximum optimization and ease of data repair later then please please please stop using script tasks for something that can be done in an expression. You may want to show off you C# or VB skills to others but save those for bigger projects like pulling Active Directory information.