Oracle stuff: SQLNET expire

Thursday, 13 September 2007

SQLNET expire_time

Had a lot of problems with SQLNET over our VPN recently. The problem is that any queries that did not return a result for long periods of time would just hang. This typically occurred when doing counts from large tables. If the query took longer than about 15 mins then ,intermittently, it would hang. Sometimes it would work fine and the next time hang, no error message...nothing. This made it virtually impossible to handle or code around. Oddly, if you just logged into SQLPLUS and did nothing for 24 hours there was never a problem.

I discovered that the hanging was nothing to do with the database but had to be somewhere in the network. I concluded this after I found I could reproduce the problem simply by doing a dbms_lock.sleep(2100) (35 mins). This proved the database was not hanging but had something to do with the comms between the sqlnet client and the database server. Furthermore when the problem occurred the server was showing that it was 'waiting for client' even after the allotted sleep time had finished. I tried many different combinations of Oracle client and server versions running both Windows and Unix, all produced the same behaviour...so not a version bug.

During a long and difficult investigation we discovered no packets are sent between client and server while a long running SQL executes. When the server eventually responds, having finished the query execution, the firewall fails to sent the packet on. The details of why this should happen we have never worked out.

Someone came up with the bright idea of turning on EXPIRE_TIME (EXPIRE_TIME=5) in the sqlnet.ora on the server. This causes the server to check the client is still alive every 5 minutes. If the client has died then SQLNET tidys up it's processes on the server. A side effect of this checking is that a conversation occurs between server and client every 5 minutes, enough to convince the firewall that the session is still valid and not chop the comms between the two.

After a brief hiccup (discussed in next blog entry) it fixed the problem and SQLNET is now 100% reliable.

The whole investigation took approximately one month and many, many man hours and the solution was depressingly simple!

6 comments:

NYCynik said...: This was very helpful! Thank you for posting this. We are having a similar issue with a client who recently moved a server. Thanks!; 8 June 2010 at 08:15
Bond, James Bond said...: Firewalls will typically time out a session after a period of inactivity, depending on the brand of firewall, ie checkpoint typically 60 mins, netscreen 30 mins and so on. This is to prevent denial of service attacks (as long held sessions tie up finite resources) DBA's typically complain that the firewall has broken their DB, because it worked fine in the lab with no firewall. Application keepalives, or Dead host detection is the best solution for this, glad you figured that out.; 24 March 2011 at 08:43
pant said...: Really Thanks.It will more helpfull; 2 May 2011 at 22:59
SriRaj said...: Nice! Topic
It has help me a lot.

------
SriRaj; 15 July 2011 at 05:09
Unknown said...: Thanx for the insight! I was actually reading this a day or so after we implemented the expire_time but your post increases my confidence that it will do the trick in our situation. I'm connecting to oracle using OLE DB from a job using SQL Server Integration Services and that is scheduled to run every 5 minutes. And we can see from our logs that a connection is first dropped and then all subsequent connects fails, seemingly because the failed connection is orphaned and never actually cleaned up. I'm hoping that the expire_time setting will make sure that things are properly cleaned up so we can connect again at the next scheduled run of the job.; 18 October 2012 at 01:36
Gil said...: This was very, very helpful. Thank You!; 12 April 2013 at 22:05

Oracle stuff

Thursday, 13 September 2007

SQLNET expire_time

6 comments:

Blog Archive

Contributors