This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Cygwin multithreading performance
- From: Mark Geisert <mark at maxrnd dot com>
- To: cygwin-developers at cygwin dot com
- Date: Tue, 8 Dec 2015 22:10:24 -0800
- Subject: Re: Cygwin multithreading performance
- Authentication-results: sourceware.org; auth=none
- References: <20151126093427 dot GJ2755 at calimero dot vinschen dot de> <5656DDEF dot 9070603 at maxrnd dot com> <5662C199 dot 7040906 at maxrnd dot com> <CABPLAST5EnifrAQ2xKZmohKhyxQHh=K3x3DeCL+BTdHN8nN98w at mail dot gmail dot com> <566367C8 dot 5020703 at maxrnd dot com> <CABPLASSY3WWpHAeh=5gqRKdg85M8Wzkrq9qMaDhzhk2zvxgcOw at mail dot gmail dot com> <5663EB9A dot 40002 at maxrnd dot com> <CABPLASQZrDMnN32GG3-qRsnHhjsoroaY7ti1wx5uASDqdU7M+g at mail dot gmail dot com> <5666B61F dot 9050209 at maxrnd dot com> <20151208153438 dot GL22073 at calimero dot vinschen dot de> <20151208170244 dot GA10155 at calimero dot vinschen dot de>
Corinna Vinschen wrote:
On Dec 8 16:34, Corinna Vinschen wrote:
On Dec 8 02:51, Mark Geisert wrote:
caller 0x018014CC77, count 1, L, /oss/src/winsup/cygwin/thread.cc:475
caller 0x018014CD00, count 1, U, /oss/src/winsup/cygwin/thread.cc:496
caller 0x018014CDAF, count 432, L, /oss/src/winsup/cygwin/thread.cc:971
caller 0x018014CDE6, count 432, U, /oss/src/winsup/cygwin/thread.cc:982
caller 0x018014D07E, count 1, L, /oss/src/winsup/cygwin/thread.cc:1946
caller 0x018014D090, count 1, U, /oss/src/winsup/cygwin/thread.cc:1951
caller 0x018014D7E6, count 1, L, /oss/src/winsup/cygwin/thread.cc:525
caller 0x018014D7FF, count 1, U, /oss/src/winsup/cygwin/thread.cc:533
caller 0x018014EDD7, count 1, U, /oss/src/winsup/cygwin/thread.cc:2400
caller 0x018014EE97, count 1, L, /oss/src/winsup/cygwin/thread.cc:2389
This is interesting. I'm not sure if anything in the rest of the
output shows how much is wasted on the above two calls, though.
thread.cc:971 and thread.cc:982 are pthread_setcancelstate, and it's
called pretty often as part of stdio functions. Every stdio function
which has to lock the FILE structure also calls pthread_setcancelstate
to disable and reenable cancellation before and after locking. That's
almost any stdio function.
This may be one of the problems which lower performance, but there's no
easy or quick way around that, AFAICS.
There's also the fact that, even for tools using __fsetlocking to disable
stdio locking, pthread_setcancelstate will still be called unconditionally.
The question here is, if that's wrong and pthread_setcancelstate should be
skipped if the application sets FSETLOCKING_BYCALLER.
For a start, I simply removed the mutex lock/unlock in calls to
pthread_setcancelstate and pthread_setcanceltype. These locks are
completely unnecessary. These functions are only called for the current
thread anyway.
I'm just creating a developer snapshot which I'll upload to
https://cygwin.com/snapshots/ in half an hour at the latest. Please
have a look if your testcase behaves better now.
I'll give the snapshot a try, thanks, but it's hard for me to see :) how
eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway
will help much. I may have unduly influenced anybody following this by only
showing detailed stats for the last two locks that I picked at random.
What I'm more focused on are the locks being heavily used by git and heavily
contended by the four git threads in OP's testcase. Here are the caller stats
for this set of locks...
lock latency buckets: <=1 <=10 <=100 <=1000 >1000
mtx 4908/06002A00F0 lks 155066, ulks 155066, 66359 4395 78895 4742 675
caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
caller 0x010045557F, count 15, U, /usr/src/git/builtin/pack-objects.c:1760
caller 0x010045588D, count 15, L, /usr/src/git/builtin/pack-objects.c:1944
caller 0x01004558AC, count 15, U, /usr/src/git/builtin/pack-objects.c:1947
caller 0x01004598AC, count 15, L, /usr/src/git/builtin/pack-objects.c:2039
caller 0x01004599B9, count 11, U, /usr/src/git/builtin/pack-objects.c:2078
caller 0x0100459A7B, count 4, U, /usr/src/git/builtin/pack-objects.c:2078
caller 0x018014DB4F, count 244, U, /oss/src/winsup/cygwin/thread.cc:1254
caller 0x018014DB93, count 15, L, /oss/src/winsup/cygwin/thread.cc:1286
caller 0x018014DBD3, count 229, L, /oss/src/winsup/cygwin/thread.cc:1286
mtx 4908/0600FA6860 lks 154745, ulks 154745, 56092 3217 64883 25435 5118
caller 0x0100455659, count 1011, L, /usr/src/git/builtin/pack-objects.c:1640
caller 0x0100455687, count 1011, U, /usr/src/git/builtin/pack-objects.c:1642
caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630
mtx 4908/0603498600 lks 316825, ulks 316825, 146254 4986 155345 9686 554
caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
caller 0x01004552C4, count 15148, U, /usr/src/git/builtin/pack-objects.c:1705
caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837
If you look at the ">1000" column, you see that these 3 locks account for
675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case for
just this 1000+ usec latency bucket. The next column to the left (a faster
bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863, times
101us best case for this bucket, giving 4.026sec spent waiting for these locks.
I probably need to modify my stats gathering program to display the total time
spent waiting to make it more clear how things add up for each mutex across the
whole testcase. Again, these stats are not timing how long locks are being
held, but rather how long it takes just to acquire the lock.
It may be, when all is said and done, that this is the price one pays for POSIX
compliant pthread_mutexes. It just kind of bugs me that the testcase works fine
under MinGW using MS CriticalSection ops but we don't seem to have something
similarly light-weight in our pthread_mutex implementation. Or I'm missing it :).
Thanks for reading,
..mark