Monday, February 24, 2025

Quarantined SQL Plans for PS/Queries

This follows on from my previous post, Management of Long Running PS/Queries Cancelled by Resource Manager.

From 19c, on Engineered Systems only (such as Exadata and Exadata Cloud Service) the 'Oracle Database automatically quarantines the plans for SQL statements terminated by … the Resource Manager for exceeding resource limits.

The Resource Manager can set a maximum estimated execution time for a SQL statement, for example, 20 minutes. If a statement execution exceeds this limit, then the Resource Manager terminates the statement. However, the statement may run repeatedly before being terminated, wasting 20 minutes of resources each time it is executed.

Starting in Oracle Database 19c, if a statement exceeds the specified resource limit, then the Resource Manager terminates the execution and “quarantines” the plan. To quarantine the plan means to put it on a blacklist of plans that the database will not execute. Note that the plan is quarantined, not the statement itself.'

[Oracle SQL Tuning Guide: 4.7 About Quarantined SQL Plans]

When an attempt is made to execute a quarantined execution plan an error is produced: ORA-56955: quarantined plan used.

Oracle does not log timed-out or quarantined queries.  On V$SQL and V$SQLSTAT, AVOIDED_EXECUTIONS records the number of times a SQL query has been prevented from running.  However, this will not stay in the library cache long on a PeopleSoft system, due to the continual parse of dynamically generated SQL statements.  As of Oracle 19.20, it is not recorded in AWR on DBA_HIST_SQLSTAT.  

If an error condition occurs during a PSQUERY process run on the process scheduler, the process terminates with an error.  The SQL statement and the error message are recorded in the Application Engine message log.  As demonstrated in the previous blog, we can detect such failures by inspecting the message log of the PSQUERY process that did not finish successfully (ie. it has an end time, but the run status does not indicate success).

Matching Quarantine Directives to Cancelled Queries

Quarantine directives are visible via DBA_SQL_QUARANTINE, including SQL text and execution plan hash value.

It would be useful to know which quarantine directive relates to which query.  However, it is not easy to match the SQL in the PeopleSoft message log entries with that in the quarantine entries. The SQL text in the message log can have multiple spaces. These are stripped out in the DBA_SQL_QUARANTINE view where the normalised SQL statement is visible.  

The timestamp of creation and last execution of the quarantine directive is stored on it, but matching these to when the query was running can result in false positives.

Also, you cannot tell which quarantine directive was created by which consumer group.  The maximum CPU timeout is recorded on DBA_SQL_QUARANTINE.  In my example, it is only possible to distinguish the originating consumer group because the two consumer groups happen to have different timeouts.

A method that matches exactly, but only returns partial rows is to:

  • Obtain ASH data for queries terminated by the resource manager.  It can be matched by timestamp, MODULE, and ACTION (provided that EnableAEMonitoring is enabled).
    • Profile the ASH to find the statement that took the longest during each PSQUERY process, and that is almost certain to be the SQL query.  Thus obtaining the SQL_ID, SQL Plan Hash Value and consumer group ID.  It is also possible to determine the total database time for the query, and the database time spent on CPU.
    • The consumer group name can then be obtained from DBA_HIST_RSRC_CONSUMER_GROUP
  • Obtain the SQL text for the long-running query.  It would also have to be captured by an AWR snapshot.  This does often occur because it was a long-running SQL, but it is not certain.
  • The signature for the SQL statement (not the force-matching signature) can be derived using the SQLTEXT_TO_SIGNATURE function in DBMS_SQLTUNE.  This can be matched to the signature recorded in DBA_SQL_QUARANTINE.
  • You can have multiple quarantine directives for the same signature (i.e. the same SQL statement), each with a different plan hash value.  
    • NB: The plan hash value on DBA_SQL_QUARANTINE is the adaptive plan hash value (with all of its possible plan alternatives), and therefore it matches SQL_FULL_PLAN_HASH_VALUE in the ASH data, and not SQL_PLAN_HASH_VALUE (the plan that actually executed).
Note that
  • When a query executes until timed-out, producing ORA-00040, you usually can find the SQL statement in the AWR repository and so generate the signature to exactly match the quarantine record.
  • When an attempt is made to run a quarantined statement and execution plan, you usually cannot find the SQL statement because it hasn't run for long enough to produce an ASH sample.  Even when it has, you also have to rely on the statement having been captured previously by AWR.  Those conditions only come together occasionally.
This matching process is done by this query: message_log_checker-psquery2.sql.  Below is a sample output.  
  • We can see the quarantine directives that were created when the resource manager cancelled a query, raising error ORA-00040: active time limit exceeded - call aborted.  
  • However, where quarantine directives have prevented SQL from executing, raising error ORA-56955: quarantined plan used, the ASH data from the event that originally created the directive has since been purged, so we cannot use it to match directives.
Mon Feb 24                                                                                                                                                           page    1
                                                     PS/Queries terminated by Resource Manager/quarantined Execution Plan

                                                          Public/                                                              ASH
                                                          Private                                Ru Oracle      Exec    ASH    CPU Message Log
     P.I. DBNAME  OPRID    RUNCNTLID                      Query   QRYNAME                        St Err. #      Secs   Secs   Secs Date/Time Stamp              SQL_ID
--------- ------- -------- ------------------------------ ------- ------------------------------ -- --------- ------ ------ ------ ---------------------------- -------------
   SQL Plan   Full Plan                                                                                      CPU
 Hash Value  Hash Value Consumer Group Name                   SIGNATURE Quarantine Name                      Time  Quarantine Created           Quarantine Last Executed
----------- ----------- ------------------------- --------------------- ------------------------------------ ----- ---------------------------- ----------------------------
…
 31452465 FSPROD  USR0001  GBR_JRNL_LN_DETAIL_ACCT        Public  GBR_JRNL_LN_DETAIL_ACCT        10 ORA-56955     36     10     10 20-FEB-25 06.28.03.578218 PM 0wm9g6xkys12h
 4009529842   653370716 PSQUERY_BATCH_GROUP         5584654620166156419

 31451318 FSPROD  USR0002  GBR_JRNL_LN_DETAIL_ACCT        Public  GBR_JRNL_LN_DETAIL_ACCT        10 ORA-56955     36               20-FEB-25 02.36.38.590841 PM


 31451292 FSPROD  USR0002  GBR_JRNL_LN_DETAIL_ACCT        Public  GBR_JRNL_LN_DETAIL_ACCT        10 ORA-56955     36               20-FEB-25 02.30.51.777351 PM


 31438602 FSPROD  USR0003  1                              Private DK_GBR_GL_DETAIL_NEW           10 ORA-00040  28316  28275  14203 18-FEB-25 11.39.19.502412 PM 5qrbrf775whky
 3446094907  3491308607 PSQUERY_BATCH_GROUP        16266909742923016361 SQL_QUARANTINE_f3gxc76u48u59d019243f 14400 18-FEB-25 11.49.33.091081 PM

 31437925 FSPROD  USR0004  16                             Private TB_TEST2                       10 ORA-00040  17684  17654  17541 18-FEB-25 06.09.14.060615 PM 06xqrgj18wp05
 4256462904  2062199471 PSQUERY_BATCH_GROUP         6341132966559464532 SQL_QUARANTINE_5h01uuscnrg2n7aeaaaaf 14400 18-FEB-25 06.17.20.679769 PM

 31437907 FSPROD  USR0004  16                             Private TB_TEST2                       10 ORA-00040  17694  17695  17592 18-FEB-25 06.04.05.942470 PM 4yurn75y2p0t2
 3232504707   121066138 PSQUERY_BATCH_GROUP         4966087806133732884 SQL_QUARANTINE_49usqjjc001hn0737529a 14400 18-FEB-25 06.17.24.869185 PM
 …

Thursday, February 20, 2025

Management of Long Running PS/Queries Cancelled by Resource Manager CPU Limit

I have written previously about using the Oracle database resource manager to prioritise the allocation of CPU to different processes in a PeopleSoft system.  I proposed a sample resource plan that can be used as a starting point to build a resource plan that meets a system's specific objectives and requirements.

This post looks at 

  • How to configure the resource manager to cancel long-running queries,
  • What happens when it does and what PeopleSoft users experience,
  • How the system administrators can monitor such queries,
  • What action could they take?

Configuring SQL Cancellation in a Resource Manager Consumer Group

The sample resource plan, PSFT_PLAN, contains various server consumer groups.  It relies upon MODULE and ACTION being set by enabling PeopleSoft instrumentation (EnableAEMonitoring=1) and/or the psftapi_store_prcsinstance trigger on PSPRCSRQST (see Effective PeopleSoft Performance Monitoring),
  • PSQUERY_BATCH_GROUP
    • Applies to scheduled PSQUERY Application Engine Programs
    • Limited to 4 hours on CPU (or estimated at >= 4 hours)
  • PSQUERY_ONLINE
    • Applies to queries run online via the PeopleSoft Internet Architecture (PIA).
    • There is no resource manager limit for this consumer group.  
      • The PIA has a session timeout (default 20 minutes).  
      • The ICQuery Tuxedo service that runs the queries also has a timeout (default 20 minutes)
    • When the resource manager cancels a SQL call, it simply raises an Oracle error that appears in a PIA message box without further explanation.  It is better to let the PIA timeouts handle online queries in a more controlled fashion.
    • To prevent queries continuing to run on the database after the client has been terminated
  • LOW_LIMITED_GROUP
  • Applies to SQL*Plus, SQL Developer and Toad.
  • Limited to 2 hours on CPU (or estimated at >= 2 hours)

  • Recommendations: 

    • Users should generally be encouraged to schedule queries that will take more than a few minutes to run on the process scheduler.  
    • Resist the temptation to increase either the PIA inactivity timeout or ICQuery service timeout from the delivered setting of 20 minutes. 

    Plan Directives

    Plan directives are created with DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE
      DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
        'PSFT_PLAN', 'PSQUERY_BATCH_GROUP'
        ,mgmt_p6 => 1
        ,switch_group => 'CANCEL_SQL'
        ,switch_time => 14400
        ,switch_estimate => TRUE 
        ,switch_for_call => TRUE
      );
    
      DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
        'PSFT_PLAN', 'PSQUERY_ONLINE_GROUP'
        ,mgmt_p6 => 90
      );
    
      DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
        'PSFT_PLAN', 'LOW_LIMITED_GROUP'
        ,mgmt_p8 => 1
        ,switch_group => 'CANCEL_SQL'
        ,switch_time => 7200
        ,switch_elapsed_time => 7200
        ,switch_estimate => TRUE 
        ,switch_for_call => TRUE
      );
    Four parameters control cancellation behaviour.
    • SWITCH_GROUP specifies the consumer group to which the session is switched when a switch condition is met.  If the group switches to CANCEL_SQL the current call is cancelled, raising error ORA-00400.
    • SWITCH_TIME specified the number of seconds on CPU (not elapsed time).
    • If SWITCH_ESTIMATE is true, the resource manager also switches group if the estimated run time is greater than the switch time
    • SWITCH_FOR_CALL is set to true so that if the consumer group is switched, it is then restored to the original consumer group at the end of the top call.  Thus a persistent session is not permanently switched. This is important if switching an application engine server (PSAESRV) session.

    Cancellation Behaviour

    The resource manager can cancel long-running queries in these consumer groups raising ORA-00040: active time limit exceeded - call aborted 
    • The query may be cancelled immediately because the estimated execution time is greater than the limit.  
    • Otherwise, it is quite likely to run for an elapsed time that is greater than the CPU time limit. The database session is only on CPU if the event reported on the session is NULL.  Otherwise, it is doing something else.
    • Some time will be consumed in the client process, during which the database session will be idle waiting for the next fetch request from the client.  This is usually reported as event SQL*Net message from client.
    • Some of the database time may not be on CPU because it may be doing something else, such as physical IO.
    • The database session may be held back by the resource manager allocating CPU to higher priority processes,  in which case again the session will not be on CPU, and will report being on event resmgr: cpu quantum.

    Querying the PeopleSoft Message Log

    The full message text is stored in multiple pieces in PS_MESSAGE_LOGPARM and must be reconstructed so that it can be searched for the error code.  I demonstrated this technique in another blog post: Querying the PeopleSoft Message Log with SQL.

    For this analysis, I have made some alterations to the message log query (see message_log_checker-psquery.sql).

    • This query is restricted to messages generated by PSQUERY processes that did not run to success (not run status 9).
    • PeopleSoft messages are typically defined with up to 9 substitution variables, but long SQL statements can have many more entries in PS_MESSAGE_LOGPARM.  So the PL/SQL function in this query simply appends any additional log parameter rows beyond the 9 substitution variables to the end of the generated string.
    • Once the message has been generated we can look for one of the error messages associated with the resource manager terminating a query:
      • ORA-00040: active time limit exceeded - call aborted
      • ORA-56955: quarantined plan used

    It is necessary to filter by message number because even in PS/Query users can write invalid SQL that produces other error messages.  However, all this text processing for each row retrieved makes the query quite slow.

    Here is an example output.  User USR001 ran a private query MY_TEST2 with run control 42.  It ran for 20772s (5h 46m) until terminated by the resource manager.  As explained above, the 4-hour limit is on CPU time that will be less than the elapsed time.
                                                             Public/
                                                              Private                                Ru   Exec                              Msg Msg  Msg
         P.I. DBNAME  OPRID    RUNCNTLID                      Query   QRYNAME                        St   Secs DTTM_STAMP_SEC               Seq Set  Nbr
    --------- ------- -------- ------------------------------ ------- ------------------------------ -- ------ ---------------------------- --- --- ----
    MSG
    ---------------------------------------------------------------------------------------------------------------------------------------------------- 
     12395311 FSPROD  USR001   42                             Private MY_TEST2                       10  20772 10-FEB-25 04.41.47.384694 PM   1  65   30
    File: C:\PT860P13B_2403250500-retail\peopletools\src\pssys\qpm.cppSQL error. Stmt #: 8526  Error Position: 189  Return: 40 - ORA-00040: active time
     limit exceeded - call abortedFailed SQL stmt: SELECT A.LEDGER, A.FISCAL_YEAR, A.BUSINESS_UNIT, …
                                                                                                               10-FEB-25 04.41.47.421800 PM   2  50  380
    Error in running query because of SQL Error, Code=40, Message=ORA-00040: active time limit exceeded - call aborted
    …
    From Oracle 19c on Exadata, timed-out statements are automatically quarantined.  If a quarantined statement is run and a quarantined execution plan is generated, then error ORA-56955 is generated immediately.  Therefore, it can also be detected in the logs.  The query searches for both messages.
                                                              Public/
                                                              Private                                Ru   Exec                              Msg Msg  Msg
         P.I. DBNAME  OPRID    RUNCNTLID                      Query   QRYNAME                        St   Secs DTTM_STAMP_SEC               Seq Set  Nbr
    --------- ------- -------- ------------------------------ ------- ------------------------------ -- ------ ---------------------------- --- --- ----
    MSG
    ----------------------------------------------------------------------------------------------------------------------------------------------------
     12319513 FSPROD  USR002   Transactions                   Public  GBR_JRNL_LINE_DTL_ACCT         10     25 13-FEB-25 11.13.35.746644 PM   1  65   30
    File: C:\PT860P13B_2403250500-retail\peopletools\src\pssys\qpm.cppSQL error. Stmt #: 8526  Error Position: 2783  Return: -8581 - ORA-56955: quarant
    ined plan usedFailed SQL stmt: SELECT A.FISCAL_YEAR, A.ACCOUNTING_PERIOD, A.JRNL_HDR_STATUS, B.LED
    …
                                                                                                               13-FEB-25 11.13.35.814112 PM   2  50  380
    Error in running query because of SQL Error, Code=-8581, Message=ORA-56955: quarantined plan used
    I discuss automatic SQL quarantine for PS/Query in a subsequent blog.

    Opinion

    So far, I have explained how to set a maximum CPU time limit for PS/Queries in a resource manager consumer group and how to detect cancelled PS/Queries by examining the message log.

    The final stage is to close the feedback loop and go back to the users producing the queries, find out what they are trying to do, and why the queries are running for such a long time.