Date: Fri, 17 Feb 2012 10:42:24 +0800 From: David Xu <listlog2011@gmail.com> To: Julian Elischer <julian@freebsd.org> Cc: Alexander Kabaev <kan@freebsd.org>, threads@freebsd.org, FreeBSD Stable <freebsd-stable@freebsd.org>, davidxu@freebsd.org, Andriy Gapon <avg@freebsd.org> Subject: Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10) Message-ID: <4F3DBE90.5030305@gmail.com> In-Reply-To: <4F3DB91A.2090806@freebsd.org> References: <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3D3E2D.9090100@FreeBSD.org> <4F3D6FDD.9050808@freebsd.org> <4F3D89CD.9050309@freebsd.org> <4F3DA27A.3090903@freebsd.org> <4F3DB3DB.2060603@gmail.com> <4F3DB91A.2090806@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2012/2/17 10:19, Julian Elischer wrote: > On 2/16/12 5:56 PM, David Xu wrote: >> On 2012/2/17 8:42, Julian Elischer wrote: >>> Adding David Xu for his thoughts since he reqrote the code in >>> quesiton in revision 213098 >>> >>> On 2/16/12 2:57 PM, Julian Elischer wrote: >>>> On 2/16/12 1:06 PM, Julian Elischer wrote: >>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote: >>>>>> on 15/02/2012 23:41 Julian Elischer said the following: >>>>>>> The program fio (an IO test in ports) uses pthreads >>>>>>> >>>>>>> the following code (from fio-2.0.3, but its in earlier code too) >>>>>>> has suddenly started misbehaving. >>>>>>> >>>>>>> clock_gettime(CLOCK_REALTIME,&t); >>>>>>> t.tv_sec += seconds + 10; >>>>>>> >>>>>>> pthread_mutex_lock(&mutex->lock); >>>>>>> >>>>>>> while (!mutex->value&& !ret) { >>>>>>> mutex->waiters++; >>>>>>> ret = >>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t); >>>>>>> mutex->waiters--; >>>>>>> } >>>>>>> >>>>>>> if (!ret) { >>>>>>> mutex->value--; >>>>>>> pthread_mutex_unlock(&mutex->lock); >>>>>>> } >>>>>>> >>>>>>> >>>>>>> It turns out that 'ret' sometimes comes back instantly (on my >>>>>>> machine) with a >>>>>>> value of 60 (ETIMEDOUT) >>>>>>> despite the fact that we set the timeout 10 seconds into the >>>>>>> future. >>>>>>> >>>>>>> Has anyone else seen anything like this? >>>>>>> (and yes the condition variable attribute have been set to use >>>>>>> the REALTIME clock). >>>>>> But why? >>>>>> >>>>>> Just a hypothesis that maybe there is some issue with time >>>>>> keeping on that system. >>>>>> How would that code work out for you with MONOTONIC? >>>>> >>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and CLOCK_MONOTONIC, >>>>> and they both had the same problem.. >>>>> i.e. random early returns with ETIMEDOUT. >>>>> >>>>> I think we will try move out machine forward to a newer -stable to >>>>> see if it resolves. >>>> Kan upgraded the machine today to today's 9.x branch tip and the >>>> problem still occurs. >>>> 8.x does not have this problem. >>>> >>>> I have not got a 9-RELEASE machine to test on.. so I can not tell >>>> if this came in with the burst of stuff >>>> that came in after the 9.x branch was unfrozen after the release of >>>> 9.0. >>>> >>>> >>> >> I am trying to reproduce the problem, do you have complete sample >> code to test ? > > I'm still looking the exact set > but on my machine (4 cpus) the program from ports sysutils/fio > exhibits the problem when used with > kern.timecounter.hardware=TSC-low and with the following config file: > > pu05 # cat config.fio > > [global] > #clocksource=cpu > direct=1 > rw=randread > bs=4096 > fill_device=1 > numjobs=16 > iodepth=16 > #ioengine=posixaio > #ioengine=psync > ioengine=psync > group_reporting > norandommap > time_based > runtime=60000 > randrepeat=0 > > [file1] > filename=/dev/ada0 > > pu05 # > pu05 # fio config.fio > fio: this platform does not support process shared mutexes, forcing > use of threads. Use the 'thread' option to get rid of this warning. > file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16 > ... > file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16 > fio 2.0.3 > Starting 15 threads and 1 process > fio: job startup hung? exiting. > fio: 5 jobs failed to start > Segmentation fault (core dumped) > pu05# > > > The reason 5 jobs failed to start is because the parent timed out on > them immediately. > It didn't time out on 10 of them apparently. > > > if I set the timer to ACPI-fast it works as expected.. maybe following code can check to see if TSC-LOW works by let the thread run on each cpu. gettimeofday(&prev, NULL); int cpu = 0; for (;;) { cpuset_t set; cpu = ++cpu % 4; CPU_ZERO(&set); CPU_SET(cpu, &set); pthread_setaffinity_np(pthread_self(), sizeof(set), &set); gettimeofday(&cur, NULL); if ( timercmp(&prev, &cur, >=)) { abort(); } }
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3DBE90.5030305>