Thursday, June 26, 2008

Discrepancy in Timings between Application Engine and Oracle SQL Trace

I would like to draw attention to the risk of uncritically accepting performance information from either a tool or an instrumented program. It is always necessary to consider whether you can believe what you are being told, or whether something is lying to you.

In this case, I want to highlight in the Application Engine Timings report. Take the following example from Time and Labor. The AE Timings report claims that the process spends 91.5% of the execution time in SQL. This would lead you to believe that this process is affected by the performance of the SQL, and that you need to do some SQL performance tuning.

---------------------------------------------------------------------------------
Total run time : 1411.6
Total time in application SQL : 1291.7 Percent time in application SQL : 91.5%
Total time in PeopleCode : 119.4 Percent time in PeopleCode : 8.5%
Total time in cache : 0.1 Number of calls to cache : 3
---------------------------------------------------------------------------------

However, all is not quite as it appears. Faced with a potential SQL problem, I usually turn to Oracle SQL Trace. The following is taken from a TKPROF analysis of the trace file for the same process.

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 153883 1.80 2.41 0 4 0 0
Execute 192806 429.94 464.96 8398 276000 588543 123140
Fetch 52740 2.75 3.55 590 202419 0 15265
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 399429 434.49 470.93 8988 478423 588543 138405

Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 196521 0.00 0.13
SQL*Net message from client 196521 0.32 849.98
...
SQL*Net more data to client 114884 0.00 1.91
SQL*Net more data from client 250825 0.05 1.88
...

The database reports 849.98s for the idle event SQL*Net message from client, and only 470.93s on non-idle activity. So only 36% of the total elapsed time is spent in the database, the other 64% is time that the Application Engine process is busy executing code.

Why the discrepancy?

This is certainly a very chatty process. There are nearly 200,000 round-trip SQL*net messages between AE and Oracle. It could be a network latency problem. However, very little time is spent on the SQL*Net more data events so it is unlikely that there is a significant network component.

Let's look at the statements that have the longest times in AE timing report .

                         PeopleSoft Application Engine Timings
(All timings in seconds)

C o m p i l e E x e c u t e F e t c h Total
SQL Statement Count Time Count Time Count Time Time
------------------------------ ------- -------- ------- -------- ------- -------- --------
PeopleCode

DELETE PS_SCH_ADHOC_DTL 36540 60.6 0 0.0 60.6
INSERT PS_SCH_ADHOC_DTL 71887 885.6 0 0.0 885.6
Record.SelectByKey SCH_ADHOC_DTL 36596 118.0 36596 0.0 118.0
--------
1291.7

In less than 24 minutes, some statement have been executed very frequently. So, there were lots of SQL*Net messages, because there were lots of SQL statements. These SQL statements are submitted from PeopleCode. It is very likely we are in some sort of procedural loop, and therein lies the problem.

PeopleSoft programs can only measure the time taken to execute SQL from the perspective of the client program. In this case the timings are calculated by instrumentation inside Application Engine and PeopleCode. Meanwhile, the timings in the Oracle trace are obtained from instrumentation within the Oracle kernel.

There is still a fair amount of code in PeopleSoft that executes during the time counted as execution of SQL Statements. This is not just the SQL*Net libraries. I think that the timings include most of the time it takes for sqlexec() and Row set functions to execute. The discrepancy with the Oracle time mounts with every SQL operation.

In this case, there is very little to be gained by tuning the SQL. We can't do much about the PeopleCode because it is delivered vanilla functionality. All we can do is to bring more CPU to bear upon the problem by running multiple concurrent instances of the process.

Tuesday, June 03, 2008

Oracle Optimizer Statistics and Optimizer Dynamic Sampling with PeopleSoft Temporary Records

PeopleSoft Temporary Records are used for working storage during Application Engine programs. Typically, AE programs truncate and repopulate the tables before using them. PeopleSoft recognised the need to keep the statistics on these tables in line with the data that they contain, and so used the %UpdateStats macro in many places in delivered programs to update the statistics.

However, frequently gathering statistics on even small tables can become time consuming. Recently, I have been working on PeopleSoft Time and Labor. This makes heavy use of temporary records. In a single execution of TL_TIMEADMIN, several tables associated with temporary records are truncated, repopulated and analyzed many times. I discussed the problem with excessive use of truncate elsewhere giving rise to Local Write Wait.

Oracle also recognised this problem, and in version 9 of the database they introduced Optimizer Dynamic Sampling, where the database samples the data to generate statistics at statement parse time.

I am still testing, but on Oracle 10gR2 (version 10.2.0.3) I have obtained improvements in performance and stability of T&L AE processes by:
  1. Deleting optimizer statistics on tables associated with temporary record in order to force the optimizer to sample at parse time
  2. Locking optimizer statistics to prevents the %UpdateStats macro from putting them back on. Tables with locked statistics are also omitted by GATHER_SCHEMA_STATS and GATHER_DATABASE_STATS (unless the force option is specified) and hence also by the delivered maintenance window job to refresh stale statistics.
  3. Implementing alternative DDL model that uses a PL/SQL packaged function to suppress the error when attempting to collect statistics on table whose statistics are locked (see %UpdateStats() -v- Optimizer Dynamic Sampling). This also addresses the the mix-up in the DDL models
  4. The final piece of the puzzle has been to set OPTIMIZER_DYNAMIC_SAMPLING to 4 at instance level. I certainly have had problems with this parameter set to the default of 2.
The dynamic sampling levels are described in the Performance Tuning Guide 14.5.6.4.
  • Level 2: Apply dynamic sampling to all unanalyzed tables.
  • Level 3: As Level 2, plus all tables for which standard selectivity estimation used a guess for some predicate that is a potential dynamic sampling predicate.
  • Level 4: As Level 3, plus all tables that have single-table predicates that reference 2 or more columns.
So the next stage is to identify working storage records and their associated tables.
I started off looking for tables that had recently been analysed.


The following script identifies all instances of temporary tables associated with temporary records, and then deletes and locks the statistics. I started by restricting it to list of specific tables, but I think it would be perfectly reasonable to take this approach with all temporary records.


BEGIN
 FOR x IN (
  SELECT /*+LEADING(o i r v)*/ t.table_name, t.last_analyzed, t.num_rows
  ,      s.stattype_locked
  FROM pstemptblcntvw i
    INNER JOIN psrecdefn r
    ON r.recname = i.recname
    AND r.rectype = '7'
  , psoptions o
  , user_tables t
     LEFT OUTER JOIN user_tab_statistics s
     ON  s.table_name = t.table_name
     AND s.partition_name IS NULL
  , (SELECT rownum row_number
     FROM   psrecdefn 
     WHERE  ROWNUM <= 100) v                 
  WHERE  v.row_number <= i.temptblinstances + o.temptblinstances
  AND    t.table_name = DECODE(r.sqltablename,' ','PS_'||r.recname,r.sqltablename)
                      ||DECODE(v.row_number*r.rectype,100,'',LTRIM(TO_NUMBER(v.row_number))) 
/*---------------------------------------------------------------------            
--AND    r.recname IN('TL_PMTCH1_TMP' --TL_TA000600.SLCTPNCH.STATS1.S…
--                   ,'TL_PMTCH2_TMP' --TL_TA000600.CALC_DUR.STATS1.S…)
-----------------------------------------------------------------------*/
  AND   (/*  t.num_rows        IS NOT NULL --not analyzed 
        OR   t.last_analyzed   IS NOT NULL --not analyzed
        OR*/ s.stattype_locked IS     NULL --stats not locked
        ) 
) LOOP
  IF x.last_analyzed IS NOT NULL THEN --delete stats
   dbms_output.put_line('Deleting Statistics on '||user||'.'||x.table_name);
   dbms_stats.delete_table_stats(ownname=>user,tabname=>x.table_name,force=>TRUE);
  END IF;
  IF x.stattype_locked IS NULL THEN --lock stats
   dbms_output.put_line('Locking Statistics on '||user||'.'||x.table_name); 
   dbms_stats.lock_table_stats(ownname=>user,tabname=>x.table_name);
  END IF;
 END LOOP;
END;
/


Updated 11.2.2009: I have updated my advice on the use of Optimiser Dynamic Sampling (see %UpdateStats -v- Optimizer Dynamic Sampling. I still consider this to be a useful feature, but I have found scenarios where Oracle has not chosen a better plan that it did choose with explicitly gathered statistics. Therefore, I still suggest locking statistics on temporary working storage record, but where batch programs have been coded to explicitly update statistics then dbms_stats should be called with the force option to override the lock. I have updated my
DDL model wrapper script accordingly.

The scripts in this posting can be downloaded from my website