Help with Failure

Giganews Newsgroups
Subject: Help with Failure
Posted by:  Phil Shrimpton (phil@nospam.co.uk)
Date: Sun, 31 Aug 2003

Hi,

Difficult one this, any help would be appreciated.

We have an NT Service based application that uses TidTCPServer/Client
(Indy 8).  This application receives small (<1Kb) 'packets' from remote
clients at very regular intervals, but the clients are not connected
'constantly'.  Basically the process is...

Client Connect
Handshake
Send Packet
Client Disconnect

..and the whole process takes only a few milliseconds.

This system works fine in our test lab, and on other sites, except one
customers setup.  Basically what happens is after a few days our
'server' stops responding to clients, and needs to be restarted.  After
putting in extensive logging and stack tracing into the system, the only
difference (we can see) between when the system is working, and when it
is not is that there is no TidListerner thread in the stack trace (which
makes sense as it is not listening).  All the other 'code' in the
application seems to be working as expected.

We also added logging code to the system to log every exception raised
in the OnExecute event of the TidTCPServer, but other than a few 100054
errors a day, nothing else of use showed up.

Unfortunately the where this application is running is on a very secure
site and it is hard enough to get people on site, let alone install
debug tools and 'test versions'.  The customer is also getting fed up
with us sending new versions (containing logging stuff, and fixes to
things we are guessing are going wrong), and now just want a fixed
version.  So we are basically down to our last shot.  At this stage we
are happy not to find (and fix) the real error, as long as we can detect
when it goes down and recover itself.

So my questions are...

- What do people think would cause the TidListerner thread to
'disappear'
- Ideas for additional error logging and other logging to help track
down the problem.
- How to detect failure, and best way to recover from it.

If we can't find a solution in the newsgroups etc., we are considering
using 'commercial support', but are concerned that without being to
recreate this issue on anything but this customers setup (which access
to will be impossible), nothing much can be done.

I am able to post code snippets/units if anyone wishes to have a look,
but due to the sensitivity of some of the code, I am unable to post the
complete, compileable system.

Many thanks for any thoughts you may have.

Cheers

Phil

Replies