Date: Thu, 16 Feb 2012 23:41:31 -0800 From: Julian Elischer <julian@freebsd.org> To: davidxu@freebsd.org Cc: David Xu <listlog2011@gmail.com>, Alexander Kabaev <kan@freebsd.org>, Andriy Gapon <avg@freebsd.org>, threads@freebsd.org, FreeBSD Stable <freebsd-stable@freebsd.org>, Jung-uk Kim <jkim@freebsd.org> Subject: Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10) Message-ID: <4F3E04AB.2000508@freebsd.org> In-Reply-To: <4F3DBE90.5030305@gmail.com> References: <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3D3E2D.9090100@FreeBSD.org> <4F3D6FDD.9050808@freebsd.org> <4F3D89CD.9050309@freebsd.org> <4F3DA27A.3090903@freebsd.org> <4F3DB3DB.2060603@gmail.com> <4F3DB91A.2090806@freebsd.org> <4F3DBE90.5030305@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
adding jkim as he seems to be the last person working with TSC. On 2/16/12 6:42 PM, David Xu wrote: > On 2012/2/17 10:19, Julian Elischer wrote: >> On 2/16/12 5:56 PM, David Xu wrote: >>> On 2012/2/17 8:42, Julian Elischer wrote: >>>> Adding David Xu for his thoughts since he reqrote the code in >>>> quesiton in revision 213098 >>>> >>>> On 2/16/12 2:57 PM, Julian Elischer wrote: >>>>> On 2/16/12 1:06 PM, Julian Elischer wrote: >>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote: >>>>>>> on 15/02/2012 23:41 Julian Elischer said the following: >>>>>>>> The program fio (an IO test in ports) uses pthreads >>>>>>>> >>>>>>>> the following code (from fio-2.0.3, but its in earlier code too) >>>>>>>> has suddenly started misbehaving. >>>>>>>> >>>>>>>> clock_gettime(CLOCK_REALTIME,&t); >>>>>>>> t.tv_sec += seconds + 10; >>>>>>>> >>>>>>>> pthread_mutex_lock(&mutex->lock); >>>>>>>> >>>>>>>> while (!mutex->value&& !ret) { >>>>>>>> mutex->waiters++; >>>>>>>> ret = >>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t); >>>>>>>> mutex->waiters--; >>>>>>>> } >>>>>>>> >>>>>>>> if (!ret) { >>>>>>>> mutex->value--; >>>>>>>> pthread_mutex_unlock(&mutex->lock); >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> It turns out that 'ret' sometimes comes back instantly (on my >>>>>>>> machine) with a >>>>>>>> value of 60 (ETIMEDOUT) >>>>>>>> despite the fact that we set the timeout 10 seconds into the >>>>>>>> future. >>>>>>>> >>>>>>>> Has anyone else seen anything like this? >>>>>>>> (and yes the condition variable attribute have been set to >>>>>>>> use the REALTIME clock). >>>>>>> But why? >>>>>>> >>>>>>> Just a hypothesis that maybe there is some issue with time >>>>>>> keeping on that system. >>>>>>> How would that code work out for you with MONOTONIC? >>>>>> >>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and >>>>>> CLOCK_MONOTONIC, and they both had the same problem.. >>>>>> i.e. random early returns with ETIMEDOUT. >>>>>> >>>>>> I think we will try move out machine forward to a newer -stable >>>>>> to see if it resolves. >>>>> Kan upgraded the machine today to today's 9.x branch tip and the >>>>> problem still occurs. >>>>> 8.x does not have this problem. >>>>> >>>>> I have not got a 9-RELEASE machine to test on.. so I can not >>>>> tell if this came in with the burst of stuff >>>>> that came in after the 9.x branch was unfrozen after the release >>>>> of 9.0. >>>>> >>>>> >>>> >>> I am trying to reproduce the problem, do you have complete sample >>> code to test ? >> >> I'm still looking the exact set >> but on my machine (4 cpus) the program from ports sysutils/fio >> exhibits the problem when used with >> kern.timecounter.hardware=TSC-low and with the following config file: >> >> pu05 # cat config.fio >> >> [global] >> #clocksource=cpu >> direct=1 >> rw=randread >> bs=4096 >> fill_device=1 >> numjobs=16 >> iodepth=16 >> #ioengine=posixaio >> #ioengine=psync >> ioengine=psync >> group_reporting >> norandommap >> time_based >> runtime=60000 >> randrepeat=0 >> >> [file1] >> filename=/dev/ada0 >> >> pu05 # >> pu05 # fio config.fio >> fio: this platform does not support process shared mutexes, forcing >> use of threads. Use the 'thread' option to get rid of this warning. >> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16 >> ... >> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16 >> fio 2.0.3 >> Starting 15 threads and 1 process >> fio: job startup hung? exiting. >> fio: 5 jobs failed to start >> Segmentation fault (core dumped) >> pu05# >> >> >> The reason 5 jobs failed to start is because the parent timed out >> on them immediately. >> It didn't time out on 10 of them apparently. >> >> >> if I set the timer to ACPI-fast it works as expected.. > maybe following code can check to see if TSC-LOW works by let the > thread run > on each cpu. > > gettimeofday(&prev, NULL); > int cpu = 0; > for (;;) { > cpuset_t set; > cpu = ++cpu % 4; > CPU_ZERO(&set); > CPU_SET(cpu, &set); > pthread_setaffinity_np(pthread_self(), sizeof(set), &set); > gettimeofday(&cur, NULL); > if ( timercmp(&prev, &cur, >=)) { > abort(); > } > } > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3E04AB.2000508>