Monday, November 11, 2013

Datastage stop and reset long hanging jobs

We run a lot of loading jobs from our source SQL Server databases into Netezza using Datastage..these are just simple table to table load with no data transformation and runs every hour.

Every now and them...some of these jobs will hang but will not abort, and it will be a in a perpetual running state until one of us comes in and manually stop the job and recompile it then the next hourly scheduled run will kick off successfully.

I wrote a little shell script to check for Datastage jobs that have been running for more than a certain interval and if it is on the "okay to reset and kill" list (stored in a textfile), then it will stop the job and reset using dsjob commands.

#! /bin/bash

PROG=`basename ${0}`

if [ ${#} -ne 2 ]; then
   echo "${PROG}    : Invalid parameter list. The script needs 2 parameters:"
   echo "Param 1    : DS Project Name "
   echo "Param 2    : MinutesBeforeReset"

   echo ${NOW} ${PROG} Exiting without completion with status [${EXIT_STATUS}]
   exit ${EXIT_STATUS}

#go to /opt/IBM/InformationServer/Server/DSEngine
BinFileDirectory=`cat /.dshome`/bin
cd ${BinFileDirectory}

#Get current epochtime to as a baseline
CurrentTimeEpoch=$(date +%s)

#check for current running Datastage jobs
ps aux | grep 'DSD.RUN' | tr -s ' ' | cut -d" " -f13 | tr -d '\.' | while read JobName;

   #check if it is in the jobs to monitor & reset file, if not skip it
   if grep -Fxq "$JobName" /home/myfolder/JobsToMonitorAndReset.txt
     #Get starttime which is on the 3rd row after the colon
     StartTime=$(./dsjob -jobinfo $ProjectName $JobName | sed -n 3p | grep -o ": .*" | grep -o " .*")
     StartTimeEpoch=$(date -d "$StartTime" +%s)
     DifferenceInMinutes=$((($CurrentTimeEpoch - $StartTimeEpoch)/(60)))
     echo "$JobName has been running for $DifferenceInMinutes minutes"

     #if it has been running more than x (2nd argument) minutes, stop and reset job
     if [ $DifferenceInMinutes -ge $MaxMinutesBeforeReset];
       echo "$JobName will be stopped and reset."
       ./dsjob -stop $ProjectName $JobName
       ./dsjob -run -mode RESET -wait -jobstatus $ProjectName $JobName
       exit 0

exit 0

If you want to monitor only specific jobs, just add the datastage JobName your file, I stored mine in /home/myfolder/JobsToMonitorAndReset.txt.

You can send an email to yourself too with the jobs that were stopped and reset.

No comments:

Post a Comment