"The 18000" Issue

This hanging issue can occur only between Sep 2017 and Oct 2018 EagleML versions (fixed in Nov-2018). EagleML version can be found in the first line of \tpe\dynamic\msgcenter\eagle_ml-2-0_cm\w_config.inc

How to tell you are seeing this issue:

  1. There are long running tasks (> 20 min) for the parallel_exec stream. This can be seen in the Tasks screen of MCC with default parameters. Or in the File Statistics screen.
    AND
  2. There was MC restart or crash after the time shown as "start processing" for these files.

You can tell when restart happened by looking at the log files. MC starts a new set of logs in this case. So, if you see more than 4 MC log files for same hour, that is the red flag. Check the times inside the logs to get the exact restart time.

Where does 18000 come from?

  • The value is the hard-coded timeout in the EagleML include file tpe\dynamic\msgcenter\eagle_ml-2-0_cm\waiting_processor.inc for parallel_exec stream to check if a file has been loaded
  •  It can be overridden in the w_config_custom as W_REQ_CNT_LOOP_MAX parameter (but it is too late if it is already processing)

Solution

How to Fix the issue (how to recover the files which are stuck, waiting for those 18000 seconds)

No MC restart will be required.

In the message Center console you can see which files are stuck.

The plan is 

1. To assign this file to the stream where the running process (erroneously) is looking for it

We are going to need the file name and stream name of the line which stays in progress (green circular arrow).

update  msgcenter_dbo.msg_message_stat set stream_id = (select msg_stream_id from msgcenter_dbo.msg_streams where msg_stream_title = 'eagle_ml-2-0_default_cm_acquire_data')
where file_name like '%AK_PE_108%'
and stream_id = (select msg_stream_id from msgcenter_dbo.msg_streams where msg_stream_title = 'eagle_default_in_csv_all')


2. Wait until the waiting process stops the polling and marks itself as complete
This statement will change stream name for the topmost (failed) record. The *parallel_exec stream is in progress only because it does not “see” that completed record.

The picture will turn into something like this:

Notice, that stream name of the failed record changed, and now *parallel_exec stream could finish its polling.

3. Assign the file back to the correct stream.

Same statement, but stream names swapped:

update  msgcenter_dbo.msg_message_stat set stream_id = (select msg_stream_id from msgcenter_dbo.msg_streams where msg_stream_title = 'eagle_default_in_csv_all')
where file_name like '%AK_PE_108%'
and stream_id = (select msg_stream_id from msgcenter_dbo.msg_streams where msg_stream_title = ' eagle_ml-2-0_default_cm_acquire_data'')


The result is now the same as it would be without the “18000 issue” :

In our case we will have to repeat these three steps for all the stuck files.

It is OK to run the first update statement for all the stuck files, then wait for all to complete,  and then recover them all using the third statement.