Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Feb 2012 11:17:41 -0500
From:      Jung-uk Kim <jkim@FreeBSD.org>
To:        "davidxu@freebsd.org" <davidxu@freebsd.org>
Cc:        Alexander Kabaev <kan@freebsd.org>, "threads@freebsd.org" <threads@freebsd.org>, Julian Elischer <julian@freebsd.org>, FreeBSD Stable <freebsd-stable@freebsd.org>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Message-ID:  <201202171117.46626.jkim@FreeBSD.org>
In-Reply-To: <4F3E39EF.3030209@gmail.com>
References:  <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3E0A90.6080400@freebsd.org> <4F3E39EF.3030209@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday 17 February 2012 06:28 am, David Xu wrote:
> On 2012/2/17 16:06, Julian Elischer wrote:
> > On 2/16/12 11:41 PM, Julian Elischer wrote:
> >> adding jkim as he seems to be the last person working with TSC.
> >>
> >> On 2/16/12 6:42 PM, David Xu wrote:
> >>> On 2012/2/17 10:19, Julian Elischer wrote:
> >>>> On 2/16/12 5:56 PM, David Xu wrote:
> >>>>> On 2012/2/17 8:42, Julian Elischer wrote:
> >>>>>> Adding David Xu for his thoughts since he reqrote the code
> >>>>>> in quesiton in revision 213098
> >>>>>>
> >>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
> >>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
> >>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
> >>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
> >>>>>>>>>> The program fio (an IO test in ports) uses pthreads
> >>>>>>>>>>
> >>>>>>>>>> the following code (from fio-2.0.3, but its in earlier
> >>>>>>>>>> code too) has suddenly started misbehaving.
> >>>>>>>>>>
> >>>>>>>>>>          clock_gettime(CLOCK_REALTIME,&t);
> >>>>>>>>>>          t.tv_sec += seconds + 10;
> >>>>>>>>>>
> >>>>>>>>>>          pthread_mutex_lock(&mutex->lock);
> >>>>>>>>>>
> >>>>>>>>>>          while (!mutex->value&&  !ret) {
> >>>>>>>>>>                  mutex->waiters++;
> >>>>>>>>>>                  ret =
> >>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
> >>>>>>>>>>                  mutex->waiters--;
> >>>>>>>>>>          }
> >>>>>>>>>>
> >>>>>>>>>>          if (!ret) {
> >>>>>>>>>>                  mutex->value--;
> >>>>>>>>>>                  pthread_mutex_unlock(&mutex->lock);
> >>>>>>>>>>          }
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> It turns out that 'ret' sometimes comes back instantly
> >>>>>>>>>> (on my machine) with a
> >>>>>>>>>> value of 60 (ETIMEDOUT)
> >>>>>>>>>> despite the fact that we set the timeout 10 seconds into
> >>>>>>>>>> the future.
> >>>>>>>>>>
> >>>>>>>>>> Has anyone else seen anything like this?
> >>>>>>>>>> (and yes the condition variable attribute have been set
> >>>>>>>>>> to use the REALTIME clock).
> >>>>>>>>>
> >>>>>>>>> But why?
> >>>>>>>>>
> >>>>>>>>> Just a hypothesis that maybe there is some issue with
> >>>>>>>>> time keeping on that system.
> >>>>>>>>> How would that code work out for you with MONOTONIC?
> >>>>>>>>
> >>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and
> >>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
> >>>>>>>> i.e. random early returns with ETIMEDOUT.
> >>>>>>>>
> >>>>>>>> I think we will try move out machine forward to a newer
> >>>>>>>> -stable to see if it resolves.
> >>>>>>>
> >>>>>>> Kan upgraded the machine today to today's 9.x branch tip
> >>>>>>> and the problem still occurs.
> >>>>>>> 8.x does not have this problem.
> >>>>>>>
> >>>>>>> I have not got a 9-RELEASE machine to test on.. so I can
> >>>>>>> not tell if this came in with the burst of stuff
> >>>>>>> that came in after the 9.x branch was unfrozen after the
> >>>>>>> release of 9.0.
> >>>>>
> >>>>> I am trying to reproduce the problem,  do you have complete
> >>>>> sample code to test ?
> >>>>
> >>>> I'm still looking the exact set
> >>>> but on my machine (4 cpus) the program from ports sysutils/fio
> >>>> exhibits the problem when used with
> >>>> kern.timecounter.hardware=TSC-low and with the following
> >>>> config file:
> >>>>
> >>>> pu05 # cat config.fio
> >>>>
> >>>> [global]
> >>>> #clocksource=cpu
> >>>> direct=1
> >>>> rw=randread
> >>>> bs=4096
> >>>> fill_device=1
> >>>> numjobs=16
> >>>> iodepth=16
> >>>> #ioengine=posixaio
> >>>> #ioengine=psync
> >>>> ioengine=psync
> >>>> group_reporting
> >>>> norandommap
> >>>> time_based
> >>>> runtime=60000
> >>>> randrepeat=0
> >>>>
> >>>> [file1]
> >>>> filename=/dev/ada0
> >>>>
> >>>> pu05 #
> >>>> pu05 # fio config.fio
> >>>> fio: this platform does not support process shared mutexes,
> >>>> forcing use of threads. Use the 'thread' option to get rid of
> >>>> this warning. file1: (g=0): rw=randread, bs=4K-4K/4K-4K,
> >>>> ioengine=psync, iodepth=16 ...
> >>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync,
> >>>> iodepth=16 fio 2.0.3
> >>>> Starting 15 threads and 1 process
> >>>> fio: job startup hung? exiting.
> >>>> fio: 5 jobs failed to start
> >>>> Segmentation fault (core dumped)
> >>>> pu05#
> >>>>
> >>>>
> >>>> The reason 5 jobs failed to start is because the parent timed
> >>>> out on them immediately.
> >>>> It didn't time out on 10 of them apparently.
> >>>>
> >>>>
> >>>> if I set the timer to ACPI-fast it works as expected..
> >>>
> >>> maybe following code can check to see if TSC-LOW works by let
> >>> the thread run
> >>> on each cpu.
> >>>
> >>> gettimeofday(&prev, NULL);
> >>> int cpu = 0;
> >>> for (;;) {
> >>>      cpuset_t set;
> >>>      cpu = ++cpu % 4;
> >>>      CPU_ZERO(&set);
> >>>      CPU_SET(cpu, &set);
> >>>      pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
> >>>      gettimeofday(&cur, NULL);
> >>>      if ( timercmp(&prev, &cur, >=)) {
> >>>         abort();
> >>>    }
> >>> }
> >
> > pu05# sysctl kern.timecounter.hardware=TSC-low
> > kern.timecounter.hardware: ACPI-fast -> TSC-low
> > pu05# ./test
> > ^C
> > pu05# cat test.c
> >
> > #include <stdlib.h>
> > #include <sys/param.h>
> > #include <sys/cpuset.h>
> > #include <pthread_np.h>
> >
> > #include <sys/time.h>
> >
> > main()
> > {
> >     int cpu = 0;
> >     struct timeval prev, cur;
> >
> >     gettimeofday(&prev, NULL);
> >     for (;;) {
> >          cpuset_t set;
> >          cpu = ++cpu % 4;
> >          CPU_ZERO(&set);
> >          CPU_SET(cpu, &set);
> >          pthread_setaffinity_np(pthread_self(), sizeof(set),
> > &set); gettimeofday(&cur, NULL);
> >          if ( timercmp(&prev, &cur, >)) {
> >             abort();
> >        }
> >        prev = cur;
> >     }
> > }
> >
> > pu05# ./test
> >
> > minutes pass.......
> >
> > ^C
> > pu05#
> >
> > so it looks as if the TSC is working ok..
> > I'm just going to check that the program is actually moving
> > CPU... yes it is moving around but I can't tell at what speed.
> > (according to top).
> >
> > so we are still left with a question of "where is the problem?"
> >
> > kernel TSC driver?
> > generic gettimeofday() code?
> > pthreads cond code?
> > the application?
>
> I am running the fio test on my notebook which is using TSC-low,
> it is on 9.0-RC3, I can not reproduce the problem for
> minutes, then I interrupt it with ctrl-c:
>
> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt

Your CPU is single-package, dual-core, and SMT-enabled.  All cores 
should be in perfect sync.

Jung-uk Kim



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201202171117.46626.jkim>