Monday, November 03, 2014

Filtering PeopleTools SQL from Performance Monitor Traces


I have been doing some on-line performance tuning on a PeopleSoft Financials system using PeopleSoft Performance Monitor (PPM).  End-users have collect verbose PPM traces. Usually, when I use PPM in a production system, all the components are fully cached by the normal activity of the user (except when the application server caches have recently been cleared).  However, when working in a user test environment it is common to find that the components are not fully cached. This presents two problems.
  • The application servers spend quite a lot of time executing queries on the PeopleTools tables to load the components, pages and PeopleCode into their caches. We can see in the screenshot of the component trace that there is a warning message that component objects are not fully cached, and that these  cache misses skew timings.
  • In verbose mode, the PPM traces collect a lot of additional transactions capturing executions and fetches against PeopleTools tables. The PPM analytic components cannot always manage the resultant volume of transactions.
Figure 1. Component trace as collected by PPM
Figure 1. Component trace as collected by PPM
If I go further down the same page and look in the SQL Summary, I can see SQL operations against PeopleTools tables (they are easily identifiable in that they generally do not have an underscore in the third character). Not only are 5 of the top 8 SQL operations related to PeopleTools tables, we can also see that they also account for over 13000 executions, which means there are at least 13000 rows of additional data to be read from PSPMTRANSHIST.
Figure 2. SQL Summary of PPM trace with PeopleTools SQL
Figure 2. SQL Summary of PPM trace with PeopleTools SQL
When I open the longest running server round trip (this is also referred to as a Performance Monitoring Unit or PMU), I can only load 1001 rows before I get a message warning that the maximum row limit has been reached. The duration summary and the number of executions and fetches cannot be calculated and hence 0 is displayed.
Figure 3: Details of longest PMU with PeopleTools SQL
Figure 3: Details of longest PMU with PeopleTools SQL

Another consequence of the PeopleTools data is that it can take a long time to open the PMU tree. There is no screenshot of the PMU tree here because in this case I had so much data that I couldn't open it before the transaction timed out!

Solution 

My solution to this problem is to delete the transactions that relate to PeopleTools SQL and correct the durations, and the number of executions and fetches held in summary transactions. The rationale is that these transactions would not normally occur in significant quantities in a real production system, and there is not much I can do about them when they do.
The first step is to clone the trace. I could work on the trace directly, but I want to preserve the original data.
PPM transactions are held in the table PSPMTRANSHIST. They have a unique identifier PM_INSTANCE_ID. A single server round trip, also called a Performance Monitoring Unit (PMU), will consist of many transactions. They can be shown as a tree and each transaction has another field PM_PARENT_INST_ID which holds the instance of the parent. This links the data together and we can use hierarchical queries in Oracle SQL to walk the tree. Another field PM_TOP_INST_ID identifies the root transaction in the tree.
Cloning a PPM trace is simply a matter of inserting data into PSPMTRANSHIST. However, when I clone a PPM trace I have to make sure that the instance numbers are distinct but still link correctly. In my system I can take a very simple approach. All the instance numbers actually collected by PPM are greater than 1016. So, I will simply use the modulus function to consistently alter the instances to be different. This approach may break down in future, but it will do for now.
On an Oracle database, PL/SQL is a simple and effective way to write simple procedural processes.  I have written two anonymous blocks of code.
Note that the cloned trace will be purged from PPM like any other data by the delivered PPM archive process.

REM xPT.sql
BEGIN --duplicate PPM traces
  FOR i IN (
    SELECT h.*
    FROM   pspmtranshist h
    WHERE  pm_perf_trace != ' ' /*rows must have a trace name*/
--  AND    pm_perf_trace = '9b. XXXXXXXXXX' /*I could specify a specific trace by name*/
    AND    pm_instance_id > 1E16 /*only look at instance > 1e16 so I do not clone cloned traces*/
  ) LOOP
    INSERT INTO pspmtranshist 
    (PM_INSTANCE_ID, PM_TRANS_DEFN_SET, PM_TRANS_DEFN_ID, PM_AGENTID, PM_TRANS_STATUS,
    OPRID, PM_PERF_TRACE, PM_CONTEXT_VALUE1, PM_CONTEXT_VALUE2, PM_CONTEXT_VALUE3,
    PM_CONTEXTID_1, PM_CONTEXTID_2, PM_CONTEXTID_3, PM_PROCESS_ID, PM_AGENT_STRT_DTTM,
    PM_MON_STRT_DTTM, PM_TRANS_DURATION, PM_PARENT_INST_ID, PM_TOP_INST_ID, PM_METRIC_VALUE1,
    PM_METRIC_VALUE2, PM_METRIC_VALUE3, PM_METRIC_VALUE4, PM_METRIC_VALUE5, PM_METRIC_VALUE6,
    PM_METRIC_VALUE7, PM_ADDTNL_DESCR)
    VALUES
    (MOD(i.PM_INSTANCE_ID,1E16) /*apply modulus to instance number*/
    ,i.PM_TRANS_DEFN_SET, i.PM_TRANS_DEFN_ID, i.PM_AGENTID, i.PM_TRANS_STATUS,
    i.OPRID, 
    SUBSTR('xPT'||i.PM_PERF_TRACE,1,30) /*adjust trace name*/,
    i.PM_CONTEXT_VALUE1, i.PM_CONTEXT_VALUE2, i.PM_CONTEXT_VALUE3,
    i.PM_CONTEXTID_1, i.PM_CONTEXTID_2, i.PM_CONTEXTID_3, i.PM_PROCESS_ID, i.PM_AGENT_STRT_DTTM,
    i.PM_MON_STRT_DTTM, i.PM_TRANS_DURATION, 
    MOD(i.PM_PARENT_INST_ID,1E16), MOD(i.PM_TOP_INST_ID,1E16), /*apply modulus to parent and top instance number*/
    i.PM_METRIC_VALUE1, i.PM_METRIC_VALUE2, i.PM_METRIC_VALUE3, i.PM_METRIC_VALUE4, i.PM_METRIC_VALUE5, 
    i.PM_METRIC_VALUE6, i.PM_METRIC_VALUE7, i.PM_ADDTNL_DESCR);
  END LOOP;
  COMMIT;
END;
/ 
Now I will work on the cloned trace. I want to remove certain transaction.
  • PeopleTools SQL. Metric value 7 reports the SQL operation and SQL table name. So if the first word is SELECT and the second word is a PeopleTools table name then it is a PeopleTools SQL operation. A list of PeopleTools tables can be obtained from the object security table PSOBJGROUP.
  • Implicit Commit transactions. This is easy - it is just transaction type 425. 
Having deleted the PeopleTools transactions, I must also
  • Correct transaction duration for any parents of transaction. I work up the hierarchy of transactions and deduct the duration of the transaction that I am deleting from all of the parent.
  • Transaction types 400, 427 and 428 all record PeopleTools SQL time (metric 66). When I come to that transaction I also deduct the duration of the deleted transaction from the PeopleTools SQL time metric in an parent transaction.
  • Delete any children of the transactions that I delete. 
  • I must also count each PeopleTools SQL Execution transaction (type 408) and each PeopleTools SQL Fetch transaction (type 414) that I delete. These counts are also deducted from the summaries on the parent transaction 400. 
The summaries in transaction 400 are used on the 'Round Trip Details' components, and if they are not adjusted you can get misleading results. Without the adjustments, I have encountered PMUs where more than 100% of the total duration is spent in SQL - which is obviously impossible.
Although this technique of first cloning the whole trace and then deleting the PeopleTools operations can be quite slow, it is not something that you are going to do very often. 
REM xPT.sql
REM (c)Go-Faster Consultancy Ltd. 2014
set serveroutput on echo on
DECLARE 
  l_pm_instance_id_m4 INTEGER;
  l_fetch_count INTEGER;
  l_exec_count INTEGER;
BEGIN /*now remove PeopleTools SQL transaction and any children and adjust trans durations*/
  FOR i IN (
    WITH x AS ( /*returns PeopleTools tables as defined in Object security*/
      SELECT o.entname recname
      FROM   psobjgroup o
      WHERE  o.objgroupid = 'PEOPLETOOLS'
      AND    o.enttype = 'R'
    )
    SELECT h.pm_instance_id, h.pm_parent_inst_id, h.pm_trans_duration, h.pm_trans_defn_id
    FROM   pspmtranshist h
           LEFT OUTER JOIN x
           ON h.pm_metric_value7 LIKE 'SELECT '||x.recname||'%'
           AND x.recname = upper(regexp_substr(pm_metric_value7,'[^ ,]+',8,1)) /*first word after select*/
    WHERE  pm_perf_trace like 'xPT%' /*restrict to cloned traces*/
--  AND    pm_perf_trace = 'xPT9b. XXXXXXXXXX' /*work on a specific trace*/
    AND    pm_instance_id < 1E16 /*restrict to cloned traces*/
    AND   (   x.recname IS NOT NULL 
           OR h.pm_trans_defn_id IN(425 /*Implicit Commit*/))
    ORDER BY pm_instance_id DESC
  ) LOOP
    l_pm_instance_id_m4 := TO_NUMBER(NULL);
 
    IF i.pm_parent_inst_id>0 AND i.pm_trans_duration>0 THEN
      FOR j IN(
        SELECT  h.pm_instance_id, h.pm_parent_inst_id, h.pm_top_inst_id, h.pm_trans_defn_id
        ,       d.pm_metricid_3, d.pm_metricid_4
        FROM    pspmtranshist h
          INNER JOIN pspmtransdefn d
          ON         d.pm_trans_defn_set = h.pm_trans_defn_set
          AND        d.pm_trans_defn_id = h.pm_trans_Defn_id
        START WITH h.pm_instance_id = i.pm_parent_inst_id
        CONNECT BY prior h.pm_parent_inst_id = h.pm_instance_id 
      ) LOOP
        /*decrement parent transaction times*/
        IF j.pm_metricid_4 = 66 /*PeopleTools SQL Time (ms)*/ THEN --decrement metric 4 on transaction 400
          --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value4 by '||i.pm_trans_duration);
          UPDATE pspmtranshist 
          SET    pm_metric_value4 = pm_metric_value4 - i.pm_trans_duration
          WHERE  pm_instance_id = j.pm_instance_id
          AND    pm_trans_Defn_id = j.pm_trans_defn_id
          AND    pm_metric_value4 >= i.pm_trans_duration
          RETURNING pm_instance_id INTO l_pm_instance_id_m4;
        ELSIF j.pm_metricid_3 = 66 /*PeopleTools SQL Time (ms)*/ THEN --SQL time on serialisation
          --dbms_output.put_line('ID:'||i.pm_instance_id||' Type:'||i.pm_trans_defn_id||' decrement metric_value3 by '||i.pm_trans_duration);
          UPDATE pspmtranshist 
          SET    pm_metric_value3 = pm_metric_value3 - i.pm_trans_duration
          WHERE  pm_instance_id = j.pm_instance_id
          AND    pm_trans_Defn_id = j.pm_trans_defn_id
          AND    pm_metric_value3 >= i.pm_trans_duration;
        END IF;

        UPDATE pspmtranshist 
        SET    pm_trans_duration = pm_trans_duration - i.pm_trans_duration
        WHERE  pm_instance_id = j.pm_instance_id
        AND    pm_trans_duration >= i.pm_trans_duration;
      END LOOP;
    END IF;

    l_fetch_count := 0;
    l_exec_count := 0;
    FOR j IN( /*identify transaction to be deleted and any children*/
      SELECT  pm_instance_id, pm_parent_inst_id, pm_top_inst_id, pm_trans_defn_id, pm_metric_value3
      FROM    pspmtranshist
      START WITH pm_instance_id = i.pm_instance_id
      CONNECT BY PRIOR pm_instance_id = pm_parent_inst_id 
    ) LOOP
      IF j.pm_trans_defn_id = 408 THEN /*if PeopleTools SQL*/
        l_exec_count := l_exec_count + 1;
      ELSIF j.pm_trans_defn_id = 414 THEN /*if PeopleTools SQL Fetch*/
        l_fetch_count := l_fetch_count + j.pm_metric_value3;
      END IF;
      DELETE FROM pspmtranshist h /*delete tools transaction*/
      WHERE h.pm_instance_id = j.pm_instance_id;
   END LOOP;

   IF l_pm_instance_id_m4 > 0 THEN 
     --dbms_output.put_line('ID:'||l_pm_instance_id_m4||' Decrement '||l_exec_Count||' executions, '||l_fetch_count||' fetches');
     UPDATE pspmtranshist  
     SET    pm_metric_value5 = pm_metric_value5 - l_exec_count
     ,      pm_metric_value6 = pm_metric_value6 - l_fetch_count
     WHERE  pm_instance_id = l_pm_instance_id_m4;
    l_fetch_count := 0;
    l_exec_count := 0;
   END IF;

  END LOOP;
END;
/
Now, I have a second PPM trace that I can open in the analytic component.
Figure 4: Original and Cloned PPM traces
Figure 4: Original and Cloned PPM traces


When I open the cloned trace, both timings in the duration summary have reduced as have the number of executions and fetches.  The durations of the individual server round trips have also reduced.
Figure 5: Component Trace without PeopleTools transactions
Figure 5: Component Trace without PeopleTools transactions

All of the PeopleTools SQL operations have disappeared from the SQL summary.
Figure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions
Figure 6: SQL Summary of PPM trace after removing PeopleTools SQL transactions

The SQL summary now only has 125 rows of data.
Figure 7: SQL Summary of PMU without PeopleTools SQL

Now, the PPM tree component opens quickly and without error.
Figure 8: PMU Tree after removing PeopleTools SQL
Figure 8: PMU Tree after removing PeopleTools SQL

There may still be more transactions in a PMU than I can show in a screenshot, but I can now find the statement that took the most time quite quickly.

Figure 9: Long SQL transaction further down same PMU tree
Figure 9: Long SQL transaction further down same PMU tree

Conclusions 

I think that it is reasonable and useful to remove PeopleTools SQL operations from a PPM trace.
In normal production operation, components will mostly be cached, and this approach renders traces collected in non-production environments both usable in the PPM analytic components and more realistic for performance tuning. However, it is essential that when deleting some transactions from a PMU, that summary data held in other transactions in the same PMU are also corrected so that the metrics remain consistent.