This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

signal delivery problem (with pthreads)


Hello,

In short:
First signal delivery on call of pthread_kill(tid, SIGUSR1) works fine
but second one delivers signal with _delay_ and _twice_, and then my
own daemon crash. This is on cygwin1.dll 1.5.10-3 and snapshot of
2004-08-21. But on cygwin1.dll 1.5.7-1 all works _fine_ and _long_ time.

In detail:
(Sorry for my bad English.)
(And sorry for long letter.)
(And much more sorry if this is not appropriate place to ask this question.)

I'm writing a daemon. Multi-threaded and TCP/IP-oriented. And it works.
On Linux and FreeBSD. And on cygwin 1.5.7-1. But there is a problem
which I'm currently trying to solve: one of the servicing client thread
hangs up with 100% CPU usage in random time periods (but with probability
nearly equal to 1) and do not response to wakeup signals.

The first investigation brought me an idea that the problem is lying in
select() call in cygwin implementation. (I have only one socket in
select() call and nothing more.) But maybe I'm wrong.

Anyway, in that moment I made the decision to update cygwin1.dll because
my cygwin1.dll 1.5.7-1 is too old and maybe a newer version solves my
problem. I got 1.5.10-3 and... my daemon crashed :( on very simple action.
I had never seen something like this with 1.5.7-1.

Then I got the (latest on that moment) snapshot of cygwin1.dll -
2004-08-21. And my daemon crashed again with same symptoms.

gdb 2003-09-20-cvs (cygwin-special) didn't help because it received
SIGSEGV in KERNEL32!IsBadWritePtr on my daemon run before the piece
of code with problem was executed.

But strace (cygwin) 1.21 gave some interesting results.

Here is some data and logs.

$ strace -o strace7.log -t -m sigp  yolopd -l yolopd7.log
$ strace -o straceB.log -t -m sigp  yolopd -l yolopdB.log

yolopd7.log (1.5.7-1) and yolopdB.log (snapshot 20040821) are
my daemon logs. Just in case.

SIGUSR1 is used by main thread to awake other sleeping threads which
perform some actions.

The SIGUSR1 handler is very simple:

static void handler_sigusr1(int sig)
{
#ifdef DEBUG
    log_msg("got SIGUSR1");
#endif
    // do nothing, only for exit from sleep()
}


$ grep -i sig yolopd7.log
2004-08-27 13:03:41 yolopd[195]: setting signal handlers done.
2004-08-27 13:03:53 yolopd[195]: ATM001: send SIGUSR1 (refresh).
2004-08-27 13:03:53 yolopd[195]: got SIGUSR1.
2004-08-27 13:04:24 yolopd[195]: ATM001: send SIGUSR1 (refresh).
2004-08-27 13:04:24 yolopd[195]: got SIGUSR1.
[...and then normal exit...]

$ grep -i sig yolopdB.log
2004-08-27 13:01:35 yolopd[149]: setting signal handlers done.
2004-08-27 13:01:50 yolopd[149]: ATM001: send SIGUSR1 (refresh).
2004-08-27 13:01:50 yolopd[149]: got SIGUSR1.
2004-08-27 13:01:59 yolopd[149]: ATM001: send SIGUSR1 (refresh).
2004-08-27 13:02:20 yolopd[149]: got SIGUSR1.
2004-08-27 13:02:20 yolopd[149]: got SIGUSR1.      <-- twice!
[...immediate crash...]

$ ls -l strace?.log
-rw-r--r--    1 Valery   None         6277 Aug 27 13:04 strace7.log
-rw-r--r--    1 Valery   None       171388 Aug 27 13:02 straceB.log

Compare the size of strace log files!  >:-[  ]

$ grep -c pending strace?.log
strace7.log:5
straceB.log:679       <-- !!!

$ grep -c "signal.*not delivered" strace?.log
strace7.log:0
straceB.log:6         <-- !!!

$ grep -E '(arriv|returning.*signal 30)' strace7.log
13:03:53 [sig] yolopd 195 _threadinfo::interrupt_setup: armed signal_arrived 0x60, sig 30, res 1
13:03:53 [main] yolopd 195 reset_signal_arrived: reset signal_arrived
13:03:53 [unknown (0x6A)] yolopd 195 sig_send: returning 0x0 from sending signal 30
13:04:24 [sig] yolopd 195 _threadinfo::interrupt_setup: armed signal_arrived 0x60, sig 30, res 1
13:04:24 [main] yolopd 195 reset_signal_arrived: reset signal_arrived
13:04:24 [unknown (0x6A)] yolopd 195 sig_send: returning 0x0 from sending signal 30

$ grep -E '(arriv|returning.*signal 30)' straceB.log
13:01:50 [sig] yolopd 149 _cygtls::interrupt_setup: armed signal_arrived 0x9C, sig 30, res 1
13:01:50 [unknown (0xC3)] yolopd 149 sig_send: returning 0x0 from sending signal 30
13:01:51 [unknown (0xC8)] yolopd 149 reset_signal_arrived: reset signal_arrived
13:02:00 [unknown (0xC3)] yolopd 149 sig_send: returning 0x0 from sending signal 30
13:02:20 [unknown (0xC8)] yolopd 149 reset_signal_arrived: reset signal_arrived
13:02:20 [sig] yolopd 149 _cygtls::interrupt_setup: armed signal_arrived 0x9C, sig 30, res 1
13:02:20 [unknown (0xC8)] yolopd 149 reset_signal_arrived: reset signal_arrived

Two things which I've found out in strace?.log.

First. In strace7.log exactly one "reset_signal_arrived" record after
each armed signal. In straceB.log for first signal there is _two_
"reset_signal_arrived" records with ~30 seconds interval.

Second. In strace7.log "reset_signal_arrived" record is present _before_
"returning 0x0 from sending signal 30". In straceB.log - _after_.

Maybe I'm saying nonsense and problem is placed in my daemon code
but not in cygwin but...

Whole trace files (strace7.log & straceB.log) are attached to this letter.

I can't make small source code to reproduce the problem. The conditions
of run of my daemon is very specific.

I played myself with this problem about a week.
And I no any idea to make something more.

Please, any suggestion?
(for signals delivering; with hang of threads I'll play myself later)

gcc 2.95.3-10 (cygwin special)   <-- not 3.x
Windows NT 4.0 Workstation SP6a+
AMD Athlon 800 MHz, 512 MB RAM

WBR,
 Valery

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]