From owner-freebsd-stable@FreeBSD.ORG Fri Feb 17 16:17:55 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from [127.0.0.1] (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by hub.freebsd.org (Postfix) with ESMTP id ADFB41065674; Fri, 17 Feb 2012 16:17:54 +0000 (UTC) (envelope-from jkim@FreeBSD.org) From: Jung-uk Kim To: "davidxu@freebsd.org" Date: Fri, 17 Feb 2012 11:17:41 -0500 User-Agent: KMail/1.6.2 References: <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3E0A90.6080400@freebsd.org> <4F3E39EF.3030209@gmail.com> In-Reply-To: <4F3E39EF.3030209@gmail.com> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201202171117.46626.jkim@FreeBSD.org> Cc: Alexander Kabaev , "threads@freebsd.org" , Julian Elischer , FreeBSD Stable , Andriy Gapon Subject: Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Feb 2012 16:17:55 -0000 On Friday 17 February 2012 06:28 am, David Xu wrote: > On 2012/2/17 16:06, Julian Elischer wrote: > > On 2/16/12 11:41 PM, Julian Elischer wrote: > >> adding jkim as he seems to be the last person working with TSC. > >> > >> On 2/16/12 6:42 PM, David Xu wrote: > >>> On 2012/2/17 10:19, Julian Elischer wrote: > >>>> On 2/16/12 5:56 PM, David Xu wrote: > >>>>> On 2012/2/17 8:42, Julian Elischer wrote: > >>>>>> Adding David Xu for his thoughts since he reqrote the code > >>>>>> in quesiton in revision 213098 > >>>>>> > >>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote: > >>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote: > >>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote: > >>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following: > >>>>>>>>>> The program fio (an IO test in ports) uses pthreads > >>>>>>>>>> > >>>>>>>>>> the following code (from fio-2.0.3, but its in earlier > >>>>>>>>>> code too) has suddenly started misbehaving. > >>>>>>>>>> > >>>>>>>>>> clock_gettime(CLOCK_REALTIME,&t); > >>>>>>>>>> t.tv_sec += seconds + 10; > >>>>>>>>>> > >>>>>>>>>> pthread_mutex_lock(&mutex->lock); > >>>>>>>>>> > >>>>>>>>>> while (!mutex->value&& !ret) { > >>>>>>>>>> mutex->waiters++; > >>>>>>>>>> ret = > >>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t); > >>>>>>>>>> mutex->waiters--; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> if (!ret) { > >>>>>>>>>> mutex->value--; > >>>>>>>>>> pthread_mutex_unlock(&mutex->lock); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> It turns out that 'ret' sometimes comes back instantly > >>>>>>>>>> (on my machine) with a > >>>>>>>>>> value of 60 (ETIMEDOUT) > >>>>>>>>>> despite the fact that we set the timeout 10 seconds into > >>>>>>>>>> the future. > >>>>>>>>>> > >>>>>>>>>> Has anyone else seen anything like this? > >>>>>>>>>> (and yes the condition variable attribute have been set > >>>>>>>>>> to use the REALTIME clock). > >>>>>>>>> > >>>>>>>>> But why? > >>>>>>>>> > >>>>>>>>> Just a hypothesis that maybe there is some issue with > >>>>>>>>> time keeping on that system. > >>>>>>>>> How would that code work out for you with MONOTONIC? > >>>>>>>> > >>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and > >>>>>>>> CLOCK_MONOTONIC, and they both had the same problem.. > >>>>>>>> i.e. random early returns with ETIMEDOUT. > >>>>>>>> > >>>>>>>> I think we will try move out machine forward to a newer > >>>>>>>> -stable to see if it resolves. > >>>>>>> > >>>>>>> Kan upgraded the machine today to today's 9.x branch tip > >>>>>>> and the problem still occurs. > >>>>>>> 8.x does not have this problem. > >>>>>>> > >>>>>>> I have not got a 9-RELEASE machine to test on.. so I can > >>>>>>> not tell if this came in with the burst of stuff > >>>>>>> that came in after the 9.x branch was unfrozen after the > >>>>>>> release of 9.0. > >>>>> > >>>>> I am trying to reproduce the problem, do you have complete > >>>>> sample code to test ? > >>>> > >>>> I'm still looking the exact set > >>>> but on my machine (4 cpus) the program from ports sysutils/fio > >>>> exhibits the problem when used with > >>>> kern.timecounter.hardware=TSC-low and with the following > >>>> config file: > >>>> > >>>> pu05 # cat config.fio > >>>> > >>>> [global] > >>>> #clocksource=cpu > >>>> direct=1 > >>>> rw=randread > >>>> bs=4096 > >>>> fill_device=1 > >>>> numjobs=16 > >>>> iodepth=16 > >>>> #ioengine=posixaio > >>>> #ioengine=psync > >>>> ioengine=psync > >>>> group_reporting > >>>> norandommap > >>>> time_based > >>>> runtime=60000 > >>>> randrepeat=0 > >>>> > >>>> [file1] > >>>> filename=/dev/ada0 > >>>> > >>>> pu05 # > >>>> pu05 # fio config.fio > >>>> fio: this platform does not support process shared mutexes, > >>>> forcing use of threads. Use the 'thread' option to get rid of > >>>> this warning. file1: (g=0): rw=randread, bs=4K-4K/4K-4K, > >>>> ioengine=psync, iodepth=16 ... > >>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, > >>>> iodepth=16 fio 2.0.3 > >>>> Starting 15 threads and 1 process > >>>> fio: job startup hung? exiting. > >>>> fio: 5 jobs failed to start > >>>> Segmentation fault (core dumped) > >>>> pu05# > >>>> > >>>> > >>>> The reason 5 jobs failed to start is because the parent timed > >>>> out on them immediately. > >>>> It didn't time out on 10 of them apparently. > >>>> > >>>> > >>>> if I set the timer to ACPI-fast it works as expected.. > >>> > >>> maybe following code can check to see if TSC-LOW works by let > >>> the thread run > >>> on each cpu. > >>> > >>> gettimeofday(&prev, NULL); > >>> int cpu = 0; > >>> for (;;) { > >>> cpuset_t set; > >>> cpu = ++cpu % 4; > >>> CPU_ZERO(&set); > >>> CPU_SET(cpu, &set); > >>> pthread_setaffinity_np(pthread_self(), sizeof(set), &set); > >>> gettimeofday(&cur, NULL); > >>> if ( timercmp(&prev, &cur, >=)) { > >>> abort(); > >>> } > >>> } > > > > pu05# sysctl kern.timecounter.hardware=TSC-low > > kern.timecounter.hardware: ACPI-fast -> TSC-low > > pu05# ./test > > ^C > > pu05# cat test.c > > > > #include > > #include > > #include > > #include > > > > #include > > > > main() > > { > > int cpu = 0; > > struct timeval prev, cur; > > > > gettimeofday(&prev, NULL); > > for (;;) { > > cpuset_t set; > > cpu = ++cpu % 4; > > CPU_ZERO(&set); > > CPU_SET(cpu, &set); > > pthread_setaffinity_np(pthread_self(), sizeof(set), > > &set); gettimeofday(&cur, NULL); > > if ( timercmp(&prev, &cur, >)) { > > abort(); > > } > > prev = cur; > > } > > } > > > > pu05# ./test > > > > minutes pass....... > > > > ^C > > pu05# > > > > so it looks as if the TSC is working ok.. > > I'm just going to check that the program is actually moving > > CPU... yes it is moving around but I can't tell at what speed. > > (according to top). > > > > so we are still left with a question of "where is the problem?" > > > > kernel TSC driver? > > generic gettimeofday() code? > > pthreads cond code? > > the application? > > I am running the fio test on my notebook which is using TSC-low, > it is on 9.0-RC3, I can not reproduce the problem for > minutes, then I interrupt it with ctrl-c: > > http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt > http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt > http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt Your CPU is single-package, dual-core, and SMT-enabled. All cores should be in perfect sync. Jung-uk Kim