This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Cygwin multithreading performance

From: Mark Geisert <mark at maxrnd dot com>
To: cygwin-developers at cygwin dot com
Date: Tue, 8 Dec 2015 22:10:24 -0800
Subject: Re: Cygwin multithreading performance
Authentication-results: sourceware.org; auth=none
References: <20151126093427 dot GJ2755 at calimero dot vinschen dot de> <5656DDEF dot 9070603 at maxrnd dot com> <5662C199 dot 7040906 at maxrnd dot com> <CABPLAST5EnifrAQ2xKZmohKhyxQHh=K3x3DeCL+BTdHN8nN98w at mail dot gmail dot com> <566367C8 dot 5020703 at maxrnd dot com> <CABPLASSY3WWpHAeh=5gqRKdg85M8Wzkrq9qMaDhzhk2zvxgcOw at mail dot gmail dot com> <5663EB9A dot 40002 at maxrnd dot com> <CABPLASQZrDMnN32GG3-qRsnHhjsoroaY7ti1wx5uASDqdU7M+g at mail dot gmail dot com> <5666B61F dot 9050209 at maxrnd dot com> <20151208153438 dot GL22073 at calimero dot vinschen dot de> <20151208170244 dot GA10155 at calimero dot vinschen dot de>

Corinna Vinschen wrote:

On Dec  8 16:34, Corinna Vinschen wrote:

On Dec  8 02:51, Mark Geisert wrote:

   caller 0x018014CC77, count      1, L, /oss/src/winsup/cygwin/thread.cc:475
   caller 0x018014CD00, count      1, U, /oss/src/winsup/cygwin/thread.cc:496
   caller 0x018014CDAF, count    432, L, /oss/src/winsup/cygwin/thread.cc:971
   caller 0x018014CDE6, count    432, U, /oss/src/winsup/cygwin/thread.cc:982
   caller 0x018014D07E, count      1, L, /oss/src/winsup/cygwin/thread.cc:1946
   caller 0x018014D090, count      1, U, /oss/src/winsup/cygwin/thread.cc:1951
   caller 0x018014D7E6, count      1, L, /oss/src/winsup/cygwin/thread.cc:525
   caller 0x018014D7FF, count      1, U, /oss/src/winsup/cygwin/thread.cc:533
   caller 0x018014EDD7, count      1, U, /oss/src/winsup/cygwin/thread.cc:2400
   caller 0x018014EE97, count      1, L, /oss/src/winsup/cygwin/thread.cc:2389


This is interesting.  I'm not sure if anything in the rest of the
output shows how much is wasted on the above two calls, though.

thread.cc:971 and thread.cc:982 are pthread_setcancelstate, and it's
called pretty often as part of stdio functions.  Every stdio function
which has to lock the FILE structure also calls pthread_setcancelstate
to disable and reenable cancellation before and after locking.  That's
almost any stdio function.

This may be one of the problems which lower performance, but there's no
easy or quick way around that, AFAICS.

There's also the fact that, even for tools using __fsetlocking to disable
stdio locking, pthread_setcancelstate will still be called unconditionally.
The question here is, if that's wrong and pthread_setcancelstate should be
skipped if the application sets FSETLOCKING_BYCALLER.


For a start, I simply removed the mutex lock/unlock in calls to
pthread_setcancelstate and pthread_setcanceltype.  These locks are
completely unnecessary.  These functions are only called for the current
thread anyway.

I'm just creating a developer snapshot which I'll upload to
https://cygwin.com/snapshots/ in half an hour at the latest.  Please
have a look if your testcase behaves better now.

I'll give the snapshot a try, thanks, but it's hard for me to see :) howeliminating a few hundred locks and unlocks that are 75% under 1 usec anywaywill help much. I may have unduly influenced anybody following this by onlyshowing detailed stats for the last two locks that I picked at random.

What I'm more focused on are the locks being heavily used by git and heavilycontended by the four git threads in OP's testcase. Here are the caller statsfor this set of locks...


lock latency buckets:                            <=1   <=10  <=100 <=1000  >1000
mtx  4908/06002A00F0 lks 155066, ulks 155066,  66359   4395  78895   4742    675
  caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
  caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
  caller 0x010045557F, count     15, U, /usr/src/git/builtin/pack-objects.c:1760
  caller 0x010045588D, count     15, L, /usr/src/git/builtin/pack-objects.c:1944
  caller 0x01004558AC, count     15, U, /usr/src/git/builtin/pack-objects.c:1947
  caller 0x01004598AC, count     15, L, /usr/src/git/builtin/pack-objects.c:2039
  caller 0x01004599B9, count     11, U, /usr/src/git/builtin/pack-objects.c:2078
  caller 0x0100459A7B, count      4, U, /usr/src/git/builtin/pack-objects.c:2078
  caller 0x018014DB4F, count    244, U, /oss/src/winsup/cygwin/thread.cc:1254
  caller 0x018014DB93, count     15, L, /oss/src/winsup/cygwin/thread.cc:1286
  caller 0x018014DBD3, count    229, L, /oss/src/winsup/cygwin/thread.cc:1286

mtx  4908/0600FA6860 lks 154745, ulks 154745,  56092   3217  64883  25435   5118
  caller 0x0100455659, count   1011, L, /usr/src/git/builtin/pack-objects.c:1640
  caller 0x0100455687, count   1011, U, /usr/src/git/builtin/pack-objects.c:1642
  caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
  caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630

mtx  4908/0603498600 lks 316825, ulks 316825, 146254   4986 155345   9686    554
  caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
  caller 0x01004552C4, count  15148, U, /usr/src/git/builtin/pack-objects.c:1705
  caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
  caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
  caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837

If you look at the ">1000" column, you see that these 3 locks account for675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case forjust this 1000+ usec latency bucket. The next column to the left (a fasterbucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863, times101us best case for this bucket, giving 4.026sec spent waiting for these locks.

I probably need to modify my stats gathering program to display the total timespent waiting to make it more clear how things add up for each mutex across thewhole testcase. Again, these stats are not timing how long locks are beingheld, but rather how long it takes just to acquire the lock.

It may be, when all is said and done, that this is the price one pays for POSIXcompliant pthread_mutexes. It just kind of bugs me that the testcase works fineunder MinGW using MS CriticalSection ops but we don't seem to have somethingsimilarly light-weight in our pthread_mutex implementation. Or I'm missing it :).

Thanks for reading,

..mark

Follow-Ups:
- Re: Cygwin multithreading performance
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]