This is the mail archive of the
cygwin
mailing list for the Cygwin project.
RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
- From: Igor Peshansky <pechtcha at cs dot nyu dot edu>
- To: Ernie Coskrey <Ernie dot Coskrey at steeleye dot com>
- Cc: cygwin at cygwin dot com
- Date: Wed, 1 Aug 2007 11:50:25 -0400 (EDT)
- Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
- References: <76087731258D2545B1016BB958F00ADA1234D7@STEELPO.steeleye.com> <76087731258D2545B1016BB958F00ADA123578@STEELPO.steeleye.com> <76087731258D2545B1016BB958F00ADA123580@STEELPO.steeleye.com>
- Reply-to: cygwin at cygwin dot com
On Wed, 1 Aug 2007, Ernie Coskrey wrote:
> > -----Original Message-----
> > From: Igor Peshansky
> >
> > On Tue, 31 Jul 2007, Ernie Coskrey wrote:
> >
> > > I've run into a problem with cygwin 1.5.20-1 and pdksh 5.2.14. We've
> > > got a pdksh.exe process that is spinning, using all the CPU.
> > >
> > > This scenario is very hard to reproduce, but has happened on our test
> > > systems occasionally. It occurred recently, and I currently have gdb
> > > attached to the process and have the symbols loaded.
> >
> > I assume you've rebuilt pdksh from source, since the packaged binary is
> > stripped... Do you also have the symbols for the Cygwin DLL?
>
> Yes, I've built both pdksh and cygwin1.dll from source and have the
> symbols.
Ok, great.
> > > I see that pdksh is continually calling "sigsuspend()", which is
> > > immediately returning from cancelable_wait due to the fact that the
> > > signal_arrived event is set.
> >
> > Do you mean the sigpause() call? Can you see which signal it attempts
> > to suspend? Can you email me (privately, if you wish) the stack dump
> > from gdb?
>
> It's sigsuspend() in j_waitj - line 1191 in jobs.c. It calls
> sigsuspend(&sm_default), and sm_default is 0 (no signals are blocked).
>
> This immediately returns, and I see that j->state is still PRUNNING
> every time.
Hmm. That should be fine -- since it does not know the job terminated,
the status reflects pdksh's current knowledge. And since sm_default is 0,
sigsuspend ought to be the no-op it is.
> > > I also see that pdksh is waiting for a subprocess to complete, and
> > > has a handle to the PID of that process - however the process has
> > > long since terminated.
> >
> > That's normal (I think). Cygwin may not deliver SIGCHLD immediately
> > after process termination. Until pdksh gets SIGCHLD, it'll keep the
> > process handle.
> >
> > > It appears that something went wrong during delivery of SIGCHLD.
> >
> > Does this happen before or after j_sigchld() gets invoked?
>
> I suspect that j_sigchld never got invoked, or didn't run properly, but
> can't definitvely prove that.
It's set as a handler for SIGCHLD. If your theory is correct, and SIGCHLD
delivery is interrupted, then it won't have executed.
> > > I've got two questions related to this:
> > >
> > > - have there been changes between 1.5.20-1 and 1.5.24-2, or the
> > > latest snapshot, that might have fixed this issue? We've done some
> > > limited testing with 1.5.24-2 and haven't seen this happen yet, but
> > > as I said the it only happens rarely.
> >
> > Quite possibly. There were changes to signal handling since 1.5.20,
> > IIRC. Unless I'm mistaken, there's even a patch for a race condition
> > in process handling code (though it's not in 1.5.24, I think).
> >
> > > - is there anything I can look at in gdb to help identify what the
> > > issue is?
> > >
> > > Any suggestions would be appreciated!
> >
> > Posting a sequence of steps that reliably reproduces the problem for
> > you would be great (but not necessarily easy).
>
> I wish I could supply this, but the problem happens very rarely. I've
> run many thousands of test shell iterations and haven't seen it reoccur
> yet.
Right, which is why it's not necessarily easy. :-)
> > As I said above, a stack dump (with full pdksh symbols) would help...
> > That might mean that you'd need to build an unstripped pdksh and
> > attempt to reproduce the problem again.
> > Igor
>
> Here's a stack trace of the thread where the spin is occurring. The
> other threads in the process are quiet - the signal thread is is
> ReadFile as expected, and the other threads are all in stub routines
> doing WaitForSingleObject.
>
> (gdb) bt
> #0 handle_sigsuspend (tempmask=0)
> at ../../../../src/winsup/cygwin/exceptions.cc:694
> #1 0x61094b93 in sigsuspend (set=0x42db80)
> at ../../../../src/winsup/cygwin/signal.cc:477
> #2 0x610917b8 in _sigfe () at
> ../../../../src/winsup/cygwin/cygserver.h:82
> #3 0x0022c588 in ?? ()
> #4 0x600301dc in ?? ()
^^^^^^^^^^
This frame is suspicious. It's some sort of a DLL, but not Cygwin1.dll.
Can you use Process Explorer or something to find out what DLL is loaded
in that range? It might be something injected by an antivirus/firewall on
the list of dodgy apps...
> #5 0x006854d8 in ?? ()
> #6 0x00000003 in ?? ()
> #7 0x0022c588 in ?? ()
> #8 0x006874b8 in ?? ()
> #9 0x006854d8 in ?? ()
> #10 0x00000003 in ?? ()
> #11 0x0022c5a8 in ?? ()
> #12 0x004126e0 in waitlast () at ../src/jobs.c:729
> #13 0x004126e0 in waitlast () at ../src/jobs.c:729
> #14 0x0040b160 in expand (
> cp=0x6874b8
> "\001R\001M\001T\001I\001N\001S\001R\001E\001A\001S\001O\001N\001=\003$L
> KBIN/ins_list -d \"$EQVRMTSYS\" -t \"$EQVRMTTAG\" 2>NUL: | cut -d\001
> -f8", wp=0x22c6b0, f=32) at ../src/eval.c:533
> #15 0x0040a654 in evalstr (
> cp=0x6874b8
> "\001R\001M\001T\001I\001N\001S\001R\001E\001A\001S\001O\001N\001=\003$L
> KBIN/ins_list -d \"$EQVRMTSYS\" -t \"$EQVRMTTAG\" 2>NUL: | cut -d\001
> -f8", f=32) at ../src/eval.c:113
> #16 0x0040d80a in comexec (t=0x6871e0, tp=0x0, ap=0x687350, flags=0)
> at ../src/exec.c:555
> #17 0x0040cc7d in execute (t=0x6871e0, flags=0) at ../src/exec.c:155
> #18 0x0040ce39 in execute (t=0x687778, flags=0) at ../src/exec.c:192
> #19 0x0040d311 in execute (t=0x686620, flags=1) at ../src/exec.c:367
> #20 0x004124c1 in exchild (t=0x686620, flags=74, close_fd=0)
> at ../src/jobs.c:641
> #21 0x0040cdf6 in execute (t=0x686620, flags=10) at ../src/exec.c:185
> #22 0x0040ce62 in execute (t=0x688470, flags=0) at ../src/exec.c:195
> #23 0x0040d311 in execute (t=0x684ee0, flags=0) at ../src/exec.c:367
> #24 0x0041766e in shell (s=0x6839b8, toplevel=1) at ../src/main.c:616
> #25 0x00417204 in main (argc=6, argv=0x61171f74) at ../src/main.c:429
>
> Please let me know if there's any other information that would be
> useful. Thanks!
It would also be nice to find out what static library is linked into the
0x0022XXXX space...
Igor
--
http://cs.nyu.edu/~pechtcha/
|\ _,,,---,,_ pechtcha@cs.nyu.edu | igor@watson.ibm.com
ZZZzz /,`.-'`' -. ;-;;,_ Igor Peshansky, Ph.D. (name changed!)
|,4- ) )-,_. ,\ ( `'-' old name: Igor Pechtchanski
'---''(_/--' `-'\_) fL a.k.a JaguaR-R-R-r-r-r-.-.-. Meow!
Belief can be manipulated. Only knowledge is dangerous. -- Frank Herbert
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/