Tuesday, July 14, 2009

Oracle Terminated Connection Timeout

I have recently come across situations on two different PeopleSoft sites where ad-hoc queries continue to run on the Oracle database server long after the Application Server process, which is the Oracle session client, has terminated. Often, queries perform poorly because they are poorly coded, but that is another story. To help guard against this situation Oracle has mechanism called Terminated Connection Timeout (also known as Dead Connection Detection (DCD) when it was introduced in Net8).

Oracle Support Note 615782.1 explains the mechanism. "DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter files and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file on the database server side. When the timer expires, SQL*Net on the server sends a 'probe' packet, essentially an empty SQL*Net packet, to the client. If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection's resources."

Thus, if a PeopleSoft operator initiates an ad-hoc query that runs on the PSQRYSRV server for longer than the ICQuery service time-out (default 1200 seconds), then Tuxedo will terminate and restart the busy server process. However, the query will continue run on the database server until the current fetch operation returns. In the case of a query performing a large sort or hash operation, it might be a long time before the first row is returned. All the while, the query is continuing to consume resources on the database.

PeopleTools 8.44, also introduced the ability to kill a query via the Query Monitor that had reached a maximum run time. This is one of the functions of the PSMONITORSRV server process (see Oracle Support Note 624339.1. The maximum run time is specified in a permission list (see Security Administration PeopleBook Permission List Query Profile) and then the ability to kill queries that have timed out can be enabled or disabled system-wide (see PeopleSoft Query PeopleBook Query Administration. It will kill the application server process that submitted the query, but for the same reasons, the query may continue to run on the database.

Thus, setting Terminated Connection Timeout is not merely a good idea for a PeopleSoft system running on Oracle, it is effectively mandatory. Otherwise. some PeopleSoft functionality simply won't work as intended.

What is an appropriate value for SQLNET.EXPIRE_TIME?

The value for this parameter is the time between successive SQL*Net probes sent by the Oracle shadow server process to the client. Setting it is a balance between the maximum time that a query can be left to consume resources after a client process terminates, against the additional overhead of every client process sending a probe every few minutes.

The SQL*Net documents often talk about additional network traffic generated by DCD. This was a consideration in the past on client-server applications that ran across a wide area network. However, it is rarely a consideration in relatively modern systems such as PeopleTools 8, the database connections are made by Application Server and Process Scheduler, which are usually physically close to the database server.

The time-out can be set independently of any of the other time-outs for the Application Server and Web Server. Documents on Metalink often suggest 5 or 10 minutes, and I don't think that is unreasonable.

My thanks to Colin Kilpatrick who prompted me to look at this again.

1 comment :

Anonymous said...

Thera are a lot of problem with SQLNET.EXPIRE_TIME in 10R2, simple it doesn't work on various platform.
I recomend to use 11g client and param"
tcp.connect_timeout
sqlnet.send_timeout
sqlnet.recv_timeout
This work's more efficient, it's really work :)