Date: Fri, 17 Feb 2012 17:30:29 -0800 From: Julian Elischer <julian@freebsd.org> To: Jung-uk Kim <jkim@freebsd.org> Cc: Alexander Kabaev <kan@freebsd.org>, "threads@freebsd.org" <threads@freebsd.org>, FreeBSD Stable <freebsd-stable@freebsd.org>, "davidxu@freebsd.org" <davidxu@freebsd.org>, Andriy Gapon <avg@freebsd.org> Subject: Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10) Message-ID: <4F3EFF35.8040901@freebsd.org> In-Reply-To: <201202171117.46626.jkim@FreeBSD.org> References: <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3E0A90.6080400@freebsd.org> <4F3E39EF.3030209@gmail.com> <201202171117.46626.jkim@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Friday 17 February 2012 06:28 am, David Xu wrote: >> On 2012/2/17 16:06, Julian Elischer wrote: >>> On 2/16/12 11:41 PM, Julian Elischer wrote: >>>> adding jkim as he seems to be the last person working with TSC. >>>> >>>> On 2/16/12 6:42 PM, David Xu wrote: >>>>> On 2012/2/17 10:19, Julian Elischer wrote: >>>>>> On 2/16/12 5:56 PM, David Xu wrote: >>>>>>> On 2012/2/17 8:42, Julian Elischer wrote: >>>>>>>> Adding David Xu for his thoughts since he reqrote the code >>>>>>>> in quesiton in revision 213098 >>>>>>>> >>>>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote: >>>>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote: >>>>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote: >>>>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following: >>>>>>>>>>>> The program fio (an IO test in ports) uses pthreads >>>>>>>>>>>> >>>>>>>>>>>> the following code (from fio-2.0.3, but its in earlier >>>>>>>>>>>> code too) has suddenly started misbehaving. >>>>>>>>>>>> >>>>>>>>>>>> clock_gettime(CLOCK_REALTIME,&t); >>>>>>>>>>>> t.tv_sec += seconds + 10; >>>>>>>>>>>> >>>>>>>>>>>> pthread_mutex_lock(&mutex->lock); >>>>>>>>>>>> >>>>>>>>>>>> while (!mutex->value&& !ret) { >>>>>>>>>>>> mutex->waiters++; >>>>>>>>>>>> ret = >>>>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t); >>>>>>>>>>>> mutex->waiters--; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> if (!ret) { >>>>>>>>>>>> mutex->value--; >>>>>>>>>>>> pthread_mutex_unlock(&mutex->lock); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> It turns out that 'ret' sometimes comes back instantly >>>>>>>>>>>> (on my machine) with a >>>>>>>>>>>> value of 60 (ETIMEDOUT) >>>>>>>>>>>> despite the fact that we set the timeout 10 seconds into >>>>>>>>>>>> the future. >>>>>>>>>>>> >>>>>>>>>>>> Has anyone else seen anything like this? >>>>>>>>>>>> (and yes the condition variable attribute have been set >>>>>>>>>>>> to use the REALTIME clock). >>>>>>>>>>> But why? >>>>>>>>>>> >>>>>>>>>>> Just a hypothesis that maybe there is some issue with >>>>>>>>>>> time keeping on that system. >>>>>>>>>>> How would that code work out for you with MONOTONIC? >>>>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and >>>>>>>>>> CLOCK_MONOTONIC, and they both had the same problem.. >>>>>>>>>> i.e. random early returns with ETIMEDOUT. >>>>>>>>>> >>>>>>>>>> I think we will try move out machine forward to a newer >>>>>>>>>> -stable to see if it resolves. >>>>>>>>> Kan upgraded the machine today to today's 9.x branch tip >>>>>>>>> and the problem still occurs. >>>>>>>>> 8.x does not have this problem. >>>>>>>>> >>>>>>>>> I have not got a 9-RELEASE machine to test on.. so I can >>>>>>>>> not tell if this came in with the burst of stuff >>>>>>>>> that came in after the 9.x branch was unfrozen after the >>>>>>>>> release of 9.0. >>>>>>> I am trying to reproduce the problem, do you have complete >>>>>>> sample code to test ? >>>>>> I'm still looking the exact set >>>>>> but on my machine (4 cpus) the program from ports sysutils/fio >>>>>> exhibits the problem when used with >>>>>> kern.timecounter.hardware=TSC-low and with the following >>>>>> config file: >>>>>> >>>>>> pu05 # cat config.fio >>>>>> >>>>>> [global] >>>>>> #clocksource=cpu >>>>>> direct=1 >>>>>> rw=randread >>>>>> bs=4096 >>>>>> fill_device=1 >>>>>> numjobs=16 >>>>>> iodepth=16 >>>>>> #ioengine=posixaio >>>>>> #ioengine=psync >>>>>> ioengine=psync >>>>>> group_reporting >>>>>> norandommap >>>>>> time_based >>>>>> runtime=60000 >>>>>> randrepeat=0 >>>>>> >>>>>> [file1] >>>>>> filename=/dev/ada0 >>>>>> >>>>>> pu05 # >>>>>> pu05 # fio config.fio >>>>>> fio: this platform does not support process shared mutexes, >>>>>> forcing use of threads. Use the 'thread' option to get rid of >>>>>> this warning. file1: (g=0): rw=randread, bs=4K-4K/4K-4K, >>>>>> ioengine=psync, iodepth=16 ... >>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, >>>>>> iodepth=16 fio 2.0.3 >>>>>> Starting 15 threads and 1 process >>>>>> fio: job startup hung? exiting. >>>>>> fio: 5 jobs failed to start >>>>>> Segmentation fault (core dumped) >>>>>> pu05# >>>>>> >>>>>> >>>>>> The reason 5 jobs failed to start is because the parent timed >>>>>> out on them immediately. >>>>>> It didn't time out on 10 of them apparently. >>>>>> >>>>>> >>>>>> if I set the timer to ACPI-fast it works as expected.. >>>>> maybe following code can check to see if TSC-LOW works by let >>>>> the thread run >>>>> on each cpu. >>>>> >>>>> gettimeofday(&prev, NULL); >>>>> int cpu = 0; >>>>> for (;;) { >>>>> cpuset_t set; >>>>> cpu = ++cpu % 4; >>>>> CPU_ZERO(&set); >>>>> CPU_SET(cpu,&set); >>>>> pthread_setaffinity_np(pthread_self(), sizeof(set),&set); >>>>> gettimeofday(&cur, NULL); >>>>> if ( timercmp(&prev,&cur,>=)) { >>>>> abort(); >>>>> } >>>>> } >>> pu05# sysctl kern.timecounter.hardware=TSC-low >>> kern.timecounter.hardware: ACPI-fast -> TSC-low >>> pu05# ./test >>> ^C >>> pu05# cat test.c >>> >>> #include<stdlib.h> >>> #include<sys/param.h> >>> #include<sys/cpuset.h> >>> #include<pthread_np.h> >>> >>> #include<sys/time.h> >>> >>> main() >>> { >>> int cpu = 0; >>> struct timeval prev, cur; >>> >>> gettimeofday(&prev, NULL); >>> for (;;) { >>> cpuset_t set; >>> cpu = ++cpu % 4; >>> CPU_ZERO(&set); >>> CPU_SET(cpu,&set); >>> pthread_setaffinity_np(pthread_self(), sizeof(set), >>> &set); gettimeofday(&cur, NULL); >>> if ( timercmp(&prev,&cur,>)) { >>> abort(); >>> } >>> prev = cur; >>> } >>> } >>> >>> pu05# ./test >>> >>> minutes pass....... >>> >>> ^C >>> pu05# >>> >>> so it looks as if the TSC is working ok.. >>> I'm just going to check that the program is actually moving >>> CPU... yes it is moving around but I can't tell at what speed. >>> (according to top). >>> >>> so we are still left with a question of "where is the problem?" >>> >>> kernel TSC driver? >>> generic gettimeofday() code? >>> pthreads cond code? >>> the application? >> I am running the fio test on my notebook which is using TSC-low, >> it is on 9.0-RC3, I can not reproduce the problem for >> minutes, then I interrupt it with ctrl-c: >> >> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt >> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt >> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt > Your CPU is single-package, dual-core, and SMT-enabled. All cores > should be in perfect sync. > > Jung-uk Kim > mine is too, yet it still has problems.. CPU: Intel(R) Xeon(R) CPU E5420 @ 2.50GHz (2500.14-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10676 Family = 6 Model = 17 Stepping = 6 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0xce3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1> AMD Features=0x20100800<SYSCALL,NX,LM> AMD Features2=0x1<LAHF> TSC: P-state invariant, performance statistics real memory = 8589934592 (8192 MB) avail memory = 8214368256 (7833 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: <PTLTD APIC > FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3 ioapic0 <Version 2.0> irqs 0-23 on motherboard ioapic1 <Version 2.0> irqs 24-47 on motherboard
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3EFF35.8040901>