Had a lot of problems with SQLNET over our VPN recently. The problem is that any queries that did not return a result for long periods of time would just hang. This typically occurred when doing counts from large tables. If the query took longer than about 15 mins then ,intermittently, it would hang. Sometimes it would work fine and the next time hang, no error message...nothing. This made it virtually impossible to handle or code around. Oddly, if you just logged into SQLPLUS and did nothing for 24 hours there was never a problem.
I discovered that the hanging was nothing to do with the database but had to be somewhere in the network. I concluded this after I found I could reproduce the problem simply by doing a dbms_lock.sleep(2100) (35 mins). This proved the database was not hanging but had something to do with the comms between the sqlnet client and the database server. Furthermore when the problem occurred the server was showing that it was 'waiting for client' even after the allotted sleep time had finished. I tried many different combinations of Oracle client and server versions running both Windows and Unix, all produced the same behaviour...so not a version bug.
During a long and difficult investigation we discovered no packets are sent between client and server while a long running SQL executes. When the server eventually responds, having finished the query execution, the firewall fails to sent the packet on. The details of why this should happen we have never worked out.
Someone came up with the bright idea of turning on EXPIRE_TIME (EXPIRE_TIME=5) in the sqlnet.ora on the server. This causes the server to check the client is still alive every 5 minutes. If the client has died then SQLNET tidys up it's processes on the server. A side effect of this checking is that a conversation occurs between server and client every 5 minutes, enough to convince the firewall that the session is still valid and not chop the comms between the two.
After a brief hiccup (discussed in next blog entry) it fixed the problem and SQLNET is now 100% reliable.
The whole investigation took approximately one month and many, many man hours and the solution was depressingly simple!
Thursday, 13 September 2007
Subscribe to:
Post Comments (Atom)
6 comments:
This was very helpful! Thank you for posting this. We are having a similar issue with a client who recently moved a server. Thanks!
Firewalls will typically time out a session after a period of inactivity, depending on the brand of firewall, ie checkpoint typically 60 mins, netscreen 30 mins and so on. This is to prevent denial of service attacks (as long held sessions tie up finite resources) DBA's typically complain that the firewall has broken their DB, because it worked fine in the lab with no firewall. Application keepalives, or Dead host detection is the best solution for this, glad you figured that out.
Really Thanks.It will more helpfull
Nice! Topic
It has help me a lot.
------
SriRaj
Thanx for the insight! I was actually reading this a day or so after we implemented the expire_time but your post increases my confidence that it will do the trick in our situation. I'm connecting to oracle using OLE DB from a job using SQL Server Integration Services and that is scheduled to run every 5 minutes. And we can see from our logs that a connection is first dropped and then all subsequent connects fails, seemingly because the failed connection is orphaned and never actually cleaned up. I'm hoping that the expire_time setting will make sure that things are properly cleaned up so we can connect again at the next scheduled run of the job.
This was very, very helpful. Thank You!
Post a Comment