Monitoring Stream - Stuck and Continuous

The following applies to November 2018 release or later.

 

The monitoring stream handles continuous execution check and check for stuck workflows (workflows which take longer to execute, than expected). Parameters for this checks are set in *.spf profiles described below.

SPF Profiles

Example of a *.spf profile contents:

… unit|workflowCheck|workflow|eagle_dkm_source_collector|type|checkPeriodicWorkflow|timeperiod|300|error_prefix|XXX3 Periodic Alert|notificationemail|email_example@eagleinvsys.com|patterns|ON SEND ALERTS unit|workflowCheck|workflow|eagle_dkm_source_launcher|type|checkPeriodicWorkflow|timeperiod|300|error_prefix|XXX3 Periodic Alert|notificationemail|email_example@eagleinvsys.com|patterns|ON SEND ALERTS unit|workflowCheck|workflow|eagle_dkm_source_launcher|type|checkStuckWorkflow|timethreshold|600|error_prefix|XXX3 Stuck Workflow Alert|notificationemail|email_example@eagleinvsys.com|patterns|ON SEND ALERTS unit|workflowCheck|workflow|eagle_dkm_source_collector|type|checkStuckWorkflow|timethreshold|600|error_prefix|XXX3 Stuck Workflow Alert|notificationemail|email_example@eagleinvsys.com|patterns|ON SEND ALERTS …

where

  • unit is a special parameter, works as a pointer on type of the include in case of new alerts processing;

  • workflow is workflow name;

  • type sets the type of processing ;

  • timeperiod is time period in seconds;

  • timethreshold is processing limit in seconds;

  • error_prefix sets default type of alerts for this workflow;

  • notificationemail is default email for notifications;

  • patterns switches on/off sending alerts

Processing Details

Query

read_workflow_profiles.inc executes the following sql query:

select req_def.INSTANCE , req_def.CORRELATION_ID , case when INSTR( queue_def.ProcessCorrelationId, 'DATA' ) > 0 then substr(queue_def.ProcessCorrelationId,1,INSTR( queue_def.ProcessCorrelationId, 'DATA' )) else queue_def.ProcessCorrelationId end as PROCESS_CORRELATION_ID, req_def.BUS_TASK_ID , req_def.PROC_STATUS , to_char(req_def.UPDATE_DATE,'YYYYMMDD HH24MISS') UPDATE_DATE, queue_def.SCHED_QUEUE_INSTANCE , queue_def.orch_state_clob ORCH_STATE, to_char(queue_def.CREATE_DATE,'YYYYMMDD HH24MISS') CREATE_DATE from PACE_MASTERDBO.ORCH_REQUEST_DEF req_def, (select q.SCHED_QUEUE_INSTANCE , q.ORCH_REQ_DEF_INSTANCE, q.orch_state_clob, q.CREATE_DATE, case when INSTR( q.orch_state_clob, ':ProcessCorrelationId:' ) > 0 then to_char(substr(q.orch_state_clob,INSTR( q.orch_state_clob, ':ProcessCorrelationId:' )+22,INSTR( q.orch_state_clob, ':', INSTR( q.orch_state_clob, ':ProcessCorrelationId:' )+22 ) - INSTR( q.orch_state_clob, ':ProcessCorrelationId:' ) -22 )) else '' end as ProcessCorrelationId from PACE_MASTERDBO.ORCH_queue q where q.CREATE_DATE >= trunc(sysdate) ) queue_def where req_def.instance = queue_def.orch_req_def_instance and req_def.correlation_id = queue_def.ProcessCorrelationId and req_def.instance not in (select orch_instance from PACE_MASTERDBO.ORCH_REQUEST_PARAMS where PARAMETER_NAME in ('sn_stuck_workflow') and update_date>=trunc(sysdate)) order by req_def.INSTANCE desc

Result of this query is all necessary information about launched workflows (except for already processed).

All this is loaded to a file. Then the include checks existing *.spf profiles in ml2-0_cm_profiles folder.

Alerts Processing

Alert processing uses methods from the time utils include. A search in the query results file is performed for every workflow in the profile. For periodic alerts last workflow with name from profile is checked by update_date and current time values difference and if the difference is greater than timeperiod parameter – this workflow is written down into current event array. For stuck workflow alerts type all processing workflows are checked and if their processing time is greater than timethreshold – it is also written down in current event array.

TSR Generation

Every line in current event array is parsed. A TSR with necessary information is created by create_tsr.inc. Type of the TSR message is chosen from current event line time variables type.

<EagleML xmlns="http://www.eagleinvsys.com/2011/EagleML-2-0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="TaskStatusResponse" eaglemlVersion="2-0" xsi:schemaLocation="http://www.eagleinvsys.com/2011/EagleML-2-0 eagleml-main-2-0.xsd" eaglemlType="TaskStatusResponse"> <header><messageId>B810J250HIKUOGBG</messageId> <sentBy>monitoring</sentBy> <sendTo>eas_distribution</sendTo> <creationTimestamp></creationTimestamp> </header> <statusItem> <taskIdentifier> <correlationId>84040C2ADCC6D46</correlationId> <businessTaskId>B810J250HCFFDWIC</businessTaskId> </taskIdentifier> <status>REPORTING</status> <severityCode>1</severityCode> <reason> <reasonTypeEnum>INFO</reasonTypeEnum> <reasonCode>1</reasonCode> <description>================================== PPFT3 PERIODIC ALERT: Workflow eagle_dkm_source_launcher didn't launch in time period 300 seconds. Last launch was in 20181009 091845 CorrId 84040C2ADCC6D46 AlertNotificationEmail:vmironov@eagleinvsys.com</description> <reasonTag>PPFT3 PERIODIC ALERT</reasonTag> </reason> </statusItem> </EagleML>

TSR Parsing Within Eas_distribution

Incoming TSRs are parsed by tsr_to_w_state.inc 

This include file transforms incoming TSR with xslt translation to correlation Id and email address from error description (if the TSR has AlertNotificationEmail in description) and with correlation ID and get_w_state.inc gets AlertNotificationEmail task parameter (this parameter is unique!) . If this parameter exists – it is used as email address, if it doesn’t exist, include uses the address from description or default email.

Result of this stage – new email address (if the TSR was for new types of alerts).

Xslt Translation Patterns

Currently, there are four patterns (pattern to be used is defined by the keywords in error description):

1). PPFT3 PERIODIC ALERT

This pattern makes up a email with error description from TSR.

2). PPFT3 STUCK WORKFLOW ALERT

This pattern makes up a email with error description from TSR.

3). PPFT3 ERROR

This pattern makes up a email with all errors from all tasks, that it can get from the incoming TSR.

4). PPFT3 WARNING

This pattern makes up a blank email subject and eas_distribution does not send emails with blank email subject