The story happens to be about a PeopleSoft system, but it could have happened to any application. It starts with a pair of ERP systems. Changes in the HR system generate cases in the CRM system that the back office HR helpdesk processes. The HR system sends a message to the CRM system, where it may turn into one or more workflow messages. Some HR actions can give rise to many hundreds or even thousands of CRM workflows. For each Workflow, a run control record is created on a table that is keyed on the operator ID and a Run Control ID (an arbitrary ID to identify the control record).
The vanilla version of this code used a sequence number generated, in the usual PeopleSoft platform agnostic way, from a table. In extreme situations, a batch HR process can generate many thousands of workflows. These messages are processed by the subscription servers of an application server domain. This system is busy enough to have several of these domains, each with several subscription handler processes. So isn't long before contention on the row level lock taken during the sequence generation table became an issue. It was noticed that adding additional subscription handlers didn't increase throughput, and that database locking also impeded online users who would also execute this code.
So a developer changed the code and now the run control was a string derived by concatenating the operator ID (originally it used just the first three characters of the operator ID) and the current Time. In case this run control already existed (usually because a message generated several work flows) the code goes into a loop, generating a new run control from the operator and time each time. Adding more subscription handlers still doesn't increase message throughput - if anything it makes it worse.
Three more facts to bear in mind:
The situation for the message processing application server is even more severe. It can generate hundreds or even thousands of workflows in an evening. The subscription handlers execute the loop in the code until a time of day when they have not processed a message before.
On a day that I was on this site, I heard that a worker in the HR call centre had had nothing to do all day, and then at 5.30 in the evening, as she was about to go home, 15 new cases appeared on her list and she was not happy (call centre workers are measured on the number of cases they process and the length of their queues). Sure enough, we found that there were over 50000 records, 3600 records per hour for user PSAPPS continuously from 3am to 5.30pm. When we got to 86400 records processing of workflow from inbound messages would have stopped completely! We also found that some operators have generated over 40 messages in a given minute and so the save time of some of their operations is starting to be seriously affected.
The correct solution to all of this would be to have used an Oracle sequence. You get a unique number every time, and there is no contention on its allocation. In this application, there is no objection to any gaps in the sequence.
The original code suffered from database locking problems that effectively serialised run control processing. The code change worked well in test and initially worked fairly well in production. Although the message processing application servers could only produce a maximum of one workflow per second before they would start to execute the loop, which they could do over 100 times per second. Thus database locking was replaced with high CPU consumption in the application server as it looped, and additional SQL execution and logical I/O in the database leading to increased latch contention and also increasing database CPU consumption. As the problem became more severe, other parts of the application also suffered as users waited for their requests to be processed by an available process in the application server. It all amounts to a viscous and downward spiral.
Three more facts to bear in mind:
- The time is in the format HH24:MI:SS.
- The publish and subscribe application server processes all connect as the operator PSAPPS.
- The run control table is not purged.
Function CreateRunControl(...) Returns string
REM &run_cntl_id = String(GetNextNumberWithGaps( RB_WF_RTEKEYMAX.WF_SCHED_NBR_NEXT, 1999999999, 1));
/* workaround to avoid dead locks ... */
&run_cntl_id = %OperatorId String(%Time);
&chk_rc = "N";
SQLExec("SELECT 'Y' FROM PS_RB_RUN_CNTL_WF WHERE RUN_CNTL_ID = :1", &run_cntl_id, &chk_rc);
While &chk_rc = "Y"
&chk_rc = "N";
&run_cntl_id = %OperatorId String(%Time);
SQLExec("SELECT 'Y' FROM PS_RB_RUN_CNTL_WF WHERE RUN_CNTL_ID = :1", &run_cntl_id, &chk_rc);
End-While;
...
&return = &recRouteKeys.Insert(); /*Inserts run control*/
So yesterday when user FRED generated a workflow at 10.27.42 the run control is 'FRED10.27.42'. If today he generates another workflow at 10.27.42 the code will loop until 10.27.43 until it can generate run control that has not been used. In fact, Fred can never generate a run control at 10.27.42 ever again. Also, if a single operation generates 3 run controls then it will take at least 2 extra seconds to allocate the second and third run control.
The situation for the message processing application server is even more severe. It can generate hundreds or even thousands of workflows in an evening. The subscription handlers execute the loop in the code until a time of day when they have not processed a message before.
On a day that I was on this site, I heard that a worker in the HR call centre had had nothing to do all day, and then at 5.30 in the evening, as she was about to go home, 15 new cases appeared on her list and she was not happy (call centre workers are measured on the number of cases they process and the length of their queues). Sure enough, we found that there were over 50000 records, 3600 records per hour for user PSAPPS continuously from 3am to 5.30pm. When we got to 86400 records processing of workflow from inbound messages would have stopped completely! We also found that some operators have generated over 40 messages in a given minute and so the save time of some of their operations is starting to be seriously affected.
The correct solution to all of this would be to have used an Oracle sequence. You get a unique number every time, and there is no contention on its allocation. In this application, there is no objection to any gaps in the sequence.
The original code suffered from database locking problems that effectively serialised run control processing. The code change worked well in test and initially worked fairly well in production. Although the message processing application servers could only produce a maximum of one workflow per second before they would start to execute the loop, which they could do over 100 times per second. Thus database locking was replaced with high CPU consumption in the application server as it looped, and additional SQL execution and logical I/O in the database leading to increased latch contention and also increasing database CPU consumption. As the problem became more severe, other parts of the application also suffered as users waited for their requests to be processed by an available process in the application server. It all amounts to a viscous and downward spiral.
Conclusions
- Oracle recognised the problems of sequence generation, so they created Sequences. You've paid good money for them, so use them.
- While date and time functions can appear to produce unique values, you need to think carefully about what will happen in a busy multi-user system.
2 comments :
I can see the main point is being instructive about using Oracle sequences. But couldn't the problem also be solved by including date in the run control id? PS. That while loop is pounding really hard on the database, it should have contained at least a second delay between iterations.
The interim solution was to do exactly that. We added the date or the run control ID. However, this limits the loop to one insertion per second.
Post a Comment