This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

1.5.25-11: Process not exiting fully on Win2k3 server


I'm continuing to debug the problem I reported earlier [1], where some
executables called from within shell scripts don't fully terminate.
This happens fairly often while sourcing the default /etc/profile, but
not always, and not always in the same place.  When this happens, if I
use windows tools to look at the processes running, I see that the
WinPID process has exited, but that the PID (bash.exe) process has not.
The process is reported in ps, but is not killable (since kill doesn't
think that it exists).  The entries exist in /proc/<pid>/, but are not
usable.  (/proc/<pid>/{cwd,root} point to <defunct>, cat
/proc/<pid>/status gives a newline character, etc.)

I believe that I've isolated approximately where in the code this seems
to go wrong.  For the purposes of this example, a /bin/bash instance
with PID 3436 spawned another /bin/bash with PID 6036 which spawned
/bin/echo with PID 1932.  This shows up in ps as follows:

$ ps
      PID    PPID    PGID     WINPID  TTY  UID    STIME COMMAND
     6036    3436    3436       1932  con  500   Mar 18 /usr/bin/echo

Looking through strace output, echo finishes its processing:
...
16:45:02 [main] echo 6036 pinfo::exit: Calling ExitProcess n 0x0,
exitcode 0x0
...
16:45:02 [main] echo 6036 pinfo::exit: Calling ExitProcess n 0x0,
exitcode 0x0
...

Later in the strace log, the parent bash process considers whether or
not to clean up the process:
16:45:02 [sig] bash 3436 checkstate: nprocs 2
16:45:02 [sig] bash 3436 stopped_or_terminated: considering pid 6036
16:45:02 [sig] bash 3436 stopped_or_terminated: considering pid 5652
16:45:02 [sig] bash 3436 remove_proc: removing procs[1], pid 5652,
nprocs 2
16:45:02 [main] bash 3436 wait4: 0 = WaitForSingleObject (...)
16:45:02 [main] bash 3436 wait4: intpid -1, status 0x23BAA8, w->status
0, options 0, res 5652
16:45:02 [sig] bash 3436 checkstate: returning 1

And again:

16:45:02 [main] bash 3436 checkstate: nprocs 1
16:45:02 [main] bash 3436 stopped_or_terminated: considering pid 6036
16:45:02 [main] bash 3436 checkstate: no matching terminated children
found
16:45:02 [main] bash 3436 checkstate: returning -1

Looking in sigproc.cc, I believe that it passes the first few
conditions, since PID 3436 is greater than 0, and is different from the
child PID of 6036.  Given that remove_proc was never called for 6036,
stopped_or_terminated must have returned 0.  That means that one of the
following conditions must be true:

  if (!((terminated = (child->process_state == PID_EXITED)) ||
      ((w->options & WUNTRACED) && child->stopsig)))
    return 0;

Attempts to attach gdb to any of the processes 1932, 6036, or 3436 were
all unsuccessful, and pstack does not appear to exist in cygwin.  (And
even if it did, I'd be surprised if it could attach to a process to
which gdb could not.)  Since there are other users on this machine who
rely on cygwin to do their work, I'd really rather not recompile the dll
to add in more debugging output to determine what is going on here.

Could anyone with better knowledge of the source guess at the problem?
Is there some other method I could use to get that information out of
the released binaries?  Any other suggestions of workarounds or
alternate approaches?

Thanks in advance for any help!

[1] http://cygwin.com/ml/cygwin/2008-03/msg00322.html

-Sam 



--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]