Tuesday, August 7, 2012

Datastage - Sequence with Retries

Have you ever come across a scenario in which you need to build a datastage sequence that is smart enough to start calling the actual jobs only if all the required data exists? Then you need create a sequence that can poll the source and retry until it gets the signal that the data is ready or it reaches the retry attempts you configured.

Very frequently, a data warehouse sources its data from another operational data store (ODS), which is usually a database in which data from various other production databases/data sources are replicated to or pushed to on a regular basis (but usually not real time data syncing).

Suppose you have a nightly data warehouse ETL job that processes yesterday's data. You have a dimension table which sources data from 2 tables that originally came from 2 different databases. The 2 databases are replicated to your company's ODS, but for database A, the data is pushed to the ODS around 7 pm, while the other is pushed to the ODS around the same time but could be a couple of minutes behind.  There is also a very likely chance that replication to the ODS broke down. You want to call your datawarehouse ETL job only if both tables have the complete data for  yesterday's date.

In order to solve this issue, I built a datastage sequence with a retry attempt built in. The generic sequence flow is this. Create a loop that takes in the number of retries as a variable. Then inside the loop, you call a job that polls the ODS to see if your data is ready. Perhaps you can write a sql to check the last row_create_data in your source tables, or do a rowcount check and write the output to a textfile. Every environment is different so you'll have to come up with your own strategy as regards to what signals your data is ready. You then create a job to read the textfile output of the job above, and use a Nested Condition stage to decide if it meets your data completion criteria. If so, exit out of the loop and move on to your main ETL job, if not, stay inside the loop and call a simple routine to sleep for a certain amount of time (also use a variable), then continue the loop.

To create the sleep routine:
Create a Server Routine, select Transform Function. It should take in 1 argument, which is the number of seconds it should sleep for. In code, just type:

vSeconds = sleep_seconds
vArray = 0

sleep vSeconds

Then inside your loop, use a Routine Activity stage, and call the above routine. And pass in a job parameter for the sleep_seconds as the argument to this routine.

Below is the image of how the sequence should be designed:


  1. What is your loop definition ??

  2. Assume that your sequence is running overnight? What is the resource usage of this sequence when it is continuously polling for the data? Is it going to block any compute nodes for this process?