Thursday, June 17, 2010

Configuring Large PeopleSoft Application Servers

Occasionally, I see very large PeopleSoft systems running on large proprietary Unix servers with many CPUs.  In an extreme case, I needed to configure application server domains with up to 14 PSAPPSRV processes per domain (each domain was on a virtual server with 8 CPU cores, co-resident with the Process Scheduler).

The first and most important point to make is don't have too many server processes.  If you run out of CPU or if you fully utilise all the physical memory and start to page memory from disk, then you have too many server processes.  It is better to queue on a Tuxedo queue rather than the CPU run queue, or disk queue during paging.

Multiple APPQ/PSAPPSRV Queues

A piece of advice that I originally got from BEA (prior to their acquisition by Oracle) was that you should not have more than 10 server processes on a single queue in Tuxedo.  Otherwise, you are likely to suffer from contention on the IPC queue structure because processes must acquire exclusive access to the queue in order to enqueue a service request to the queue or dequeue a request from it.  Instead multiple queues should be configured that are both serviced by the same server processes and so advertise the same services. 

If you look at the 'large' template delivered by PeopleSoft, you will see that it produces a domain that runs between 9 and 15 PSAPPSRV processes.  This does not conform to the advice I received from BEA.  I repeated this advice in PeopleSoft for the Oracle DBA.  Though I cannot now find the source for it, I stand by it.  I have recently been able to conduct some analysis to confirm it on a real production system.  Domains with two queues of 8 PSAPPPSRV server process each out performed domains with only a single queue.

Load Balancing Across Queues

If the same service is advertised on multiple queues, then Tuxedo recommended that you should specify realistic service loads and use Tuxedo load balancing to determine where to enqueue requests.  I want to emphasise that I am talking about load balancing across queues within a Tuxedo domain, and not about load balancing across Tuxedo domains in the web server.

This is what the Tuxedo documentation says about load balancing:

"Load balancing is a technique used by the BEA Tuxedo system for distributing service requests evenly among servers that offer the same service. Load balancing avoids overburdening some servers while leaving others idle or infrequently used. Before sending a request to a service routine, the BEA Tuxedo system identifies all servers capable of handling the request and selects the one most appropriate for maintaining a balanced load across all the servers in the configuration.


You can control whether a load-balancing algorithm is used on the system as a whole. Such as algorithm should be used only when necessary, that is, only when a service is offered by servers that use more than one queue. Services offered by only one server, or by multiple servers in a Multiple Server, Single Queue (MSSQ) do not need load balancing. The LDBAL parameter for these services should be set to N. In other cases, you may want to set LDBAL to Y."

It doesn't state that load balancing is mandatory for multi-queue domains, and only hints that it might improve performance.  If load balancing is not used, the listener process puts the messages on the first empty queue (one where no requests are queued).  If all queues have requests the listener round-robins between the queues.

You could consider giving ICScript, GetCertificate and other services with small service times a higher Tuxedo Service priority.  This means they jump the queue 9 times out of 10.  ICScript is generally used during navigation, GetCertificate is used at log on.  Giving these services higher priority will mean they perform well even when the system is busy.  Users often need to do several mouse clicks to navigate around the system, but these services are usually quick.  This will improve the user experience without changing the overall performance of the system.

Data

I have recently been able to test the performance of a domains with up to 14 PSAPPSRVs on a single IPC queue, versus domains with two queues with up to 7 PSAPPSRVs each, both with and without Tuxedo queue balancing.   These results come from a real production system where the multiple queue configuration was implemented on 2 of the 4 application servers.  The system has a short-lived weekly peak period of on-line processing.  During that time Tuxedo spawns additional PSAPPSRV processes, and so I get different sets of times for different numbers of process. 

The timings are produced from transactions sampled by PeopleSoft Performance Monitor.  I capture the number of spawned processes using the Tuxmon scripts on my website that use tmadmin to collect Tuxedo metrics.

1 Queue2 Queue

Server Processes per Queue

Number of Services

Mean ICPanel Service Time

Server Processes per Queue

Number of Services

Mean ICPanel Service Time

6

2,616

1.33

3

6945

1.05

7

1,949

0.97




8

1,774

1.06

4

7595

1.16

9

1,713

1.02




10

1,553

1.25

5

4629

1.17

11

1,250

1.30




12

969

1.32

6

3397

1.16

13

427

1.21




14

1,057

1.10

7

3445

1.13

 Total

13,308

1.17


26011

1.13

The first thing to acknowledge is that this data is quite noisy because it comes from a real production system, and the effects we are looking for are quite small.

I am satisfied that the domains with two PSAPPSRV queues generally perform better under high load, than those under 1.  Not only does the queue time increase on the single queue domain, the service time also increases.

However, I cannot demonstrate that Tuxedo Load Balancing makes a significant difference in either direction.

My results suggest that domains with multiple queues for requests handled by PSAPPSRV process perform slightly better without load balancing if there is no queue of requests, but perform slightly better if there is a queue of pending requests.  However, the difference is small.  It is not large enough to be statistically significant in my test data.

Conclusion

If you have a busy system with lots of on-line users, and sufficient hardware to resource it, then you might reach a point when you need more than 10 PSAPPSRVs.  In which case, I recommend that you configure multiple Tuxedo queues.

On the whole, I would recommend that Tuxedo Load Balancing should be configured.  I would not expect it to improve performance, but it will not degrade it either.

Friday, June 11, 2010

Life Cycle of a Process Request

Oracle's Flashback Query facility lets you query a past version of a row by using the information in the undo segment.  The VERSIONS option lets you seen all the versions that are available. Thus, it is possible to write a simple query to retrieve the all values that changed on a process request record through its life cycle.

The Oracle parameter undo_retention determines how long that data remains in the undo segment. In my example, it is set to 900 seconds, so I can only query versions in the last 15 minutes. If I attempt to go back further than this I will get an error.

column prcsinstance heading 'P.I.' format 9999999
column rownum heading '#' format 9 
column versions_starttime format a22
column versions_endtime format a22

SELECT rownum, prcsinstance
, begindttm, enddttm
, runstatus, diststatus
, versions_operation, versions_xid
, versions_starttime, versions_endtime
FROM sysadm.psprcsrqst
VERSIONS BETWEEN timestamp
systimestamp - INTERVAL '15' MINUTE AND
systimestamp 
WHERE prcsinstance = 2920185
/

#     P.I. BEGINDTTM           ENDDTTM             RU DI V VERSIONS_XID     VERSIONS_STARTTIME    VERSIONS_ENDTIME
-- -------- ------------------- ------------------- -- -- - ---------------- --------------------- ----------------------
 1  2920185 10:51:13 11/06/2010 10:52:11 11/06/2010 9  5  U 000F00070017BD63 11-JUN-10 10.52.10 AM
 2  2920185 10:51:13 11/06/2010 10:52:11 11/06/2010 9  5  U 001A002C001CB1FF 11-JUN-10 10.52.10 AM 11-JUN-10 10.52.10 AM
 3  2920185 10:51:13 11/06/2010 10:52:11 11/06/2010 9  7  U 002C001F000F87C0 11-JUN-10 10.52.10 AM 11-JUN-10 10.52.10 AM
 4  2920185 10:51:13 11/06/2010 10:52:11 11/06/2010 9  1  U 000E000A001771CE 11-JUN-10 10.52.10 AM 11-JUN-10 10.52.10 AM
 5  2920185 10:51:13 11/06/2010 10:52:02 11/06/2010 9  1  U 002A000F00125D89 11-JUN-10 10.52.01 AM 11-JUN-10 10.52.10 AM
 6  2920185 10:51:13 11/06/2010                     7  1  U 0021000B00132582 11-JUN-10 10.51.10 AM 11-JUN-10 10.52.01 AM
 7  2920185                                         6  1  U 0004002000142955 11-JUN-10 10.51.10 AM 11-JUN-10 10.51.10 AM
 8  2920185                                         5  1  I 0022002E0013F260 11-JUN-10 10.51.04 AM 11-JUN-10 10.51.10 AM

Now, I can see each of the committed versions of the record. Note that each version is the result of a different transaction ID.
Reading up from the last and earliest row in the report, you can see the history of this process request record.

  • At line 8 it was inserted (the value of psuedocolumn VERSION_OPERATION is 'I') at RUNSTATUS 5 (queued) by the component the operator used to submit the record.
  • At line 7, RUNSTATUS was updated to status 6 (Initiated) by the process scheduler.
  • At line 6 the process begins and updates the BEGINDTTM with the current database time, and sets RUNSTATUS to 7 (processing).
  • At line 5 the process completes, updates ENDDTTM to the current database time, and sets RUNSTATUS to 9 (success).
  • At line 4 the ENDDTTM is updated again. This update is performed by the Distribution Server process in the Process Scheduler domain as report output is posted to the report repository.  Note that the value is 1 second later than the VERSIONS_ENDTIME, therefore this time stamp is based on the operating system time for the host running the process scheduler. This server's clock is slightly out of sync with that of the database server.
  • At lines 3 to 1 there are 3 further updates as the distribution status is updated twice more.

For me, the most interesting point is that ENDDTTM is updated twice; first with the database time at which the process ended, and then again with the time at which any report output was successfully completed.

I frequently want measure the performance of a processes. I often write script that calculate the duration of the process as being the difference between ENDDTTM and BEGINDTTM, but now it is clear that this includes the time taken to post the report and log files to the report repository.

For Application Engine processes, you can still recover the time when the process ended. If batch timings are enabled and written to the database, the BEGINDTTM and ENDDTTM are logged in PS_BAT_TIMINGS_LOG.

select * from ps_bat_timings_log where process_instance = 2920185

PROCESS_INSTANCE PROCESS_NAME OPRID                          RUN_CNTL_ID
---------------- ------------ ------------------------------ ------------------------------
BEGINDTTM           ENDDTTM             TIME_ELAPSED TIME_IN_PC TIME_IN_SQL TRACE_LEVEL
------------------- ------------------- ------------ ---------- ----------- -----------
TRACE_LEVEL_SAM
---------------
         2920185 XXX_XXX_XXXX 52630500                       16023
10:51:12 11/06/2010 10:52:02 11/06/2010        49850      35610       13730        1159
            128

You can see above that ENDDTTM is the time when the process ended.

That gives me some opportunities. For Application Engine programs, I can measure the amount of time taken to posting report content, separately from the process execution time.  This query shows me that this particular process took 49 seconds, but the report output took a further 9 seconds to post.

SELECT r.begindttm begindttm
, NVL(l.enddttm, r.enddttm) enddttm
, (NVL(l.enddttm, r.enddttm)-r.begindttm)*86400 exec_secs
, r.enddttm posttime
, (r.enddttm-NVL(l.enddttm, r.enddttm))*86400 post_secs
FROM sysadm.psprcsrqst r
    LEFT OUTER JOIN sysadm.ps_bat_timings_log l
    ON l.process_instance = r.prcsinstance
WHERE r.prcsinstance = 2920185

BEGINDTTM           ENDDTTM              EXEC_SECS POSTTIME             POST_SECS
------------------- ------------------- ---------- ------------------- ----------
10:51:13 11/06/2010 10:52:02 11/06/2010         49 10:52:11 11/06/2010          9  

For more detail on the Flashback Query syntax see the Oracle SQL Reference.