Monday, January 26, 2009

Factors Affecting Performance of Concurrent Truncate of Working Storage Tables

A longer version of this posting, with detailed experimental results, is available on my website.

Over the past year, I have seen problems with Local Write Wait in the Oracle database on two different Oracle systems. One occasion was in a PeopleSoft Time and Labour batch process, the other was in a custom PL/SQL process in non-PeopleSoft system.

In both cases, normal tables in the databases were being used for temporary working storage before that data was then written to another table. The content of the working storage tables was then cleared out by periodically truncating them. In order to increase overall batch throughput, several instances of the program were run in parallel. The resulting concurrent truncate operations contended with each other, and the processes did not scale well.

I have written about this subject previously. These problems have prompted me to do some research and testing, I am now able to make definite recommendations.

Oracle Note 334822.1 (which I have also quoted before) provides a good technical description of the database’s internal behaviour. Put simply; truncate (and drop) operations serialise. Only one process can truncate at any one time. If you have multiple concurrent processes all trying to truncate their own working storage tables, you could experience performance problems. Such processes not scale well as the number of concurrent processes increases.

Real Problems

In the case of the non-PeopleSoft PL/SQL process, I was able to recreate the working storage tables as Global Temporary Tables (GTTs) that deleted the rows on commit because the process committed only when each transaction was complete. Local write wait totally disappeared in this case. Temporary objects do not need to be recovered, so this mechanism does not apply to them.

The PeopleSoft scenario involved one of the ‘Time & Labor’ batch processes, TL_TIMEADMIN. However, GTTs cannot easily be introduced into the T&L batches because there are ‘restartable’. Therefore, the contents of temporary working storage tables need to be preserved after the process and its session terminates. This precludes the use of GTTs.

The combination of Local Write Wait and enq: RO - fast object reuse accounted for 31% of the total response time. This is a significant proportion of the total response time.
  • local write wait occurs, as the name suggests, when the session is waiting for its own write operations. The RO enqueue is used to protect the buffer cache chain while it is scanned for dirty blocks in an object for the database writer to then write to the data files.
  • enq: RO - fast object reuse occurs when a process waits to acquire the RO enqueue, in other words, while somebody else is truncating or dropping an object.
Two factors affect the time for which the RO enqueue is held:
  1. The time taken to write the blocks to disk. Processes that are frequently truncating temporary working storage are also doing a lot of DML operations to populate the working storage and other tables. The disks under the data files are going to be busy. If the disk becomes a bottleneck, the duration of the local write wait will certainly increase.
  2. The time taken to scan the buffer cache for dirty blocks to be written to disk and flushed from cache. The larger the buffer cache, the longer it will take to find these blocks.
The Metalink note also suggests using a different block size, saying that "a separate buffer pool for temporary tables will also reduce RO enqueue". It is not clear whether it is more important to have a different block size or a separate buffer pool. I wanted to find out which factor was more important.

Tests

I created a simple test to model the behaviour of T&L. I created pairs of simple tables, populated one of each pair, and then repeatedly copied the data back and forth between them, truncating the source after the copy. The test script has evolved into a PL/SQL package procedure, mainly so that the tests could be submitted to and run concurrently by the Oracle job scheduler. There are also procedures to create, populate, and drop the pairs of working storage tables. The scripts can be downloaded from my website.

I was able to run the same controlled test in a variety of scenarios. I have run the tests on Oracle 10.2.0.3 on various platforms with similar results. A detailed set of results are available in the longer version of this document on my website.

General Recommendations

If you have to store temporarily working data in a database table, it is much better to use a Global Temporary Table, although the design of the application may preclude this. It is not possible to do this with data used by restartable Application Engine processes, because the contents of the GTT would be lost when the process terminates.

The Metalink note references unpublished bug 414780 in which a PeopleSoft customer reported this problem, but “they seemed to fix it by changing some PeopleSoft code to implement delete rather than truncate on small temporary tables”. However, my tests show that this probably degraded performance further. The individual delete statements take longer than the truncate operations, and the overall test times increased. Although the truncate operations serialise on the RO enqueue and wait for local writes, this is still better than deleting the data and waiting for the undo information to be written to the redo log. Furthermore, although the truncate operations did not scale well, the delete operations exhibited negative scalability for the same volumes and concurrency. They became bottlenecked on redo log.
  • Using a recycle pool of the same block size as the rest of the database was not effective; possibly because these pools use the same LRU latches.
  • Using a larger non-default block size improved performance of truncate. The performance with 32Kb blocks was better than with 16Kb.
  • Using a larger uniform extent size also improved performance for just the truncate operations and the test as a whole. Fewer, larger extents were involved, and hence less time was spent on CPU and row cache lock. The overall throughput truncate operations degraded as the number of processes increased, although, the throughput of the test as whole did scale.
  • The presence or absence of indexes did not have a significant effect on the relative test timings, and does not alter my advice.
  • The effect of truncating with the REUSE STORAGE option is less clear cut. There are no waits on row cache lock because the blocks do not have to be cleared out of the buffer cache, but on the other hand more time is spent on local write wait because all the dirty blocks have to be written to disk, hence the RO enqueue is held for longer and more time is spent on enq: RO - fast object reuse. If you are using an AUTOALLOCATE tablespace then you would be better to use REUSE STORAGE option, but generally you would be slightly better to use a larger uniform extent size and not to use the REUSE STORAGE option.
PeopleSoft Recommendations

Over time, PeopleSoft batch processing has moved slightly away from SQR and COBOL. These types of process cannot be restarted, and so tables used for temporary working storage within the process can usually be recreated as Global Temporary Tables. This will produce better performance and scalability that any option that involves retaining the permanent table.

However, we are seeing more processing in PeopleSoft applications done with Application Engine. If restart has been disabled for an Application Engine program, then temporary records can also be rebuilt as Global Temporary Tables, because their contents does not need to be preserved for another session to pick up.

Otherwise, move the temporary records and their indexes to tablespace with a 32Kb block size. The change of assigned tablespace can be managed within Application Designer, and released like any other patch or customisation. A 32Kb buffer cache must be created in the database instance. Sizing this is going to be a trade-off between how much memory can be taken from other activities to cache just working storage tables, and how much physical I/O you are going to have to wait for. Oracle’s Automatic Shared Memory Management is of no assistance here (until Oracle 11g), the KEEP, RECYCLE, and other block size buffer caches must be sized manually (see Oracle Reference Manual for SGA_TARGET).

No change to the application code is required. There is no performance improvement to be obtained by customising the application code, either to add the REUSE STORAGE option to the TRUNCATE TABLE commands, nor to use DELETE commands instead.

Added 4.7.2009:Tests on Oracle 10.2.0.3 have shown that there is an advantage to putting the working storage tables into a Non-ASSM tablespace. ASSM introduces additional blocks to map the storage in a segment. These are also maintained during a truncate. No having to maintain these blocks saved me 25% on the elapsed run time on some T&L batch processes.

Oracle Bug 4224840/4260477

Added 2.4.2009:Unfortunately, nothing is quite as simple as it seems. If you have a transaction that locks more than 4095 rows in a 32Kb block you can encounter block corruption (bug 4224840). The fix/workaround in Oracle 10g (bug 4260477) is that a transaction will fail with this message before the corruption occurs.

There is an excellent explanation of this problem, and a test script to reproduce it, on Hermant's Oracle DBA Blog

ORA-08007: Further changes to this block by this transaction not allowed

This error will not be resolved until Oracle 11g. However, it does not occur with smaller block sizes. The workaround is either to commit more frequently, or to move the table concerned back to a tablespace with a smaller block size. I have run into this with Time & Labor in a particular scenario.

5 comments :

Noons said...

Hey David:
This is only for 9.2.0.4, isn't it?
I went through all the related entries in Metalink and it seems to be specific to that version, with patches available in later versions and all fixed up in 10g.
Is that indeed the case?

David Kurtz said...

Use of the RO enqueue during truncate operations and the resulting serialisation is not a fault in Oracle. It is how the database works. My tests have been run on 10.2.0.3 and on various platforms.

Anonymous said...

Can you use a partitioned table, with a partition for each process?

David Kurtz said...

Partitioning the working storage tables for Application Engine isn't viable because the default behaviour is to truncate the whole of the temporary table. Deleting the rows for a single process (which happens when you run out of instances temporary records and have to use the shared table) is very likely to be slower than truncating.

If you could devise a partitioning strategy that would map processes to partitions, and if you could also change the code to truncate just the partition on which you are working, you will have the same contention problems with the RO enqueue.

However, partitioning permanent result tables is a very effective strategy for processes such as PeopleSoft Global Payroll where you define ranges of EMPLIDs, and processes different ranges in different process. GP uses Cobol processes, which are not restartable. The working storage tables can be recreated as Global Temporary tables, and there is no inter-process contention on delete or truncate. Nor is there any need for consistent read on the partitioned tables.

Mike said...

Hi David,

To optimize the performance of our batch process i.e. T&L Load, Paysheet, Paycalc, Payconfirm, Benefit calc plus other custom process we tried placing PSINDEX tablespace to 32KB block and tested the process and we experienced performance improvement for the batch process. However for online navigation (OLTP) performance was degraded because of the contention. So we analyzed and selected the indexes used by online activity back to 8kb block size and performance improved. So we rolled out the changes in production. However instead of improving the performance in production it became worse. We did check init.ora, DB architecture, DB Patches, OS, Hardware layout everything is the same, except Production have more resources allocated and more memory available.

Have you seen other clients implementing multiple Datablocks within a same instance for PeopleSoft application on 11gR2, Unicode Characterset, running on PeopleTools 8.51 and PeopleSoft HCM 9.1.

Any insights or experience you can share will be appreciated.

Thanks