Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Feb 2012 17:30:29 -0800
From:      Julian Elischer <julian@freebsd.org>
To:        Jung-uk Kim <jkim@freebsd.org>
Cc:        Alexander Kabaev <kan@freebsd.org>, "threads@freebsd.org" <threads@freebsd.org>, FreeBSD Stable <freebsd-stable@freebsd.org>, "davidxu@freebsd.org" <davidxu@freebsd.org>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Message-ID:  <4F3EFF35.8040901@freebsd.org>
In-Reply-To: <201202171117.46626.jkim@FreeBSD.org>
References:  <4F3C2671.3090808__7697.00510795719$1329343207$gmane$org@freebsd.org> <4F3E0A90.6080400@freebsd.org> <4F3E39EF.3030209@gmail.com> <201202171117.46626.jkim@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
> On Friday 17 February 2012 06:28 am, David Xu wrote:
>> On 2012/2/17 16:06, Julian Elischer wrote:
>>> On 2/16/12 11:41 PM, Julian Elischer wrote:
>>>> adding jkim as he seems to be the last person working with TSC.
>>>>
>>>> On 2/16/12 6:42 PM, David Xu wrote:
>>>>> On 2012/2/17 10:19, Julian Elischer wrote:
>>>>>> On 2/16/12 5:56 PM, David Xu wrote:
>>>>>>> On 2012/2/17 8:42, Julian Elischer wrote:
>>>>>>>> Adding David Xu for his thoughts since he reqrote the code
>>>>>>>> in quesiton in revision 213098
>>>>>>>>
>>>>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
>>>>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
>>>>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
>>>>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
>>>>>>>>>>>> The program fio (an IO test in ports) uses pthreads
>>>>>>>>>>>>
>>>>>>>>>>>> the following code (from fio-2.0.3, but its in earlier
>>>>>>>>>>>> code too) has suddenly started misbehaving.
>>>>>>>>>>>>
>>>>>>>>>>>>           clock_gettime(CLOCK_REALTIME,&t);
>>>>>>>>>>>>           t.tv_sec += seconds + 10;
>>>>>>>>>>>>
>>>>>>>>>>>>           pthread_mutex_lock(&mutex->lock);
>>>>>>>>>>>>
>>>>>>>>>>>>           while (!mutex->value&&   !ret) {
>>>>>>>>>>>>                   mutex->waiters++;
>>>>>>>>>>>>                   ret =
>>>>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
>>>>>>>>>>>>                   mutex->waiters--;
>>>>>>>>>>>>           }
>>>>>>>>>>>>
>>>>>>>>>>>>           if (!ret) {
>>>>>>>>>>>>                   mutex->value--;
>>>>>>>>>>>>                   pthread_mutex_unlock(&mutex->lock);
>>>>>>>>>>>>           }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It turns out that 'ret' sometimes comes back instantly
>>>>>>>>>>>> (on my machine) with a
>>>>>>>>>>>> value of 60 (ETIMEDOUT)
>>>>>>>>>>>> despite the fact that we set the timeout 10 seconds into
>>>>>>>>>>>> the future.
>>>>>>>>>>>>
>>>>>>>>>>>> Has anyone else seen anything like this?
>>>>>>>>>>>> (and yes the condition variable attribute have been set
>>>>>>>>>>>> to use the REALTIME clock).
>>>>>>>>>>> But why?
>>>>>>>>>>>
>>>>>>>>>>> Just a hypothesis that maybe there is some issue with
>>>>>>>>>>> time keeping on that system.
>>>>>>>>>>> How would that code work out for you with MONOTONIC?
>>>>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and
>>>>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
>>>>>>>>>> i.e. random early returns with ETIMEDOUT.
>>>>>>>>>>
>>>>>>>>>> I think we will try move out machine forward to a newer
>>>>>>>>>> -stable to see if it resolves.
>>>>>>>>> Kan upgraded the machine today to today's 9.x branch tip
>>>>>>>>> and the problem still occurs.
>>>>>>>>> 8.x does not have this problem.
>>>>>>>>>
>>>>>>>>> I have not got a 9-RELEASE machine to test on.. so I can
>>>>>>>>> not tell if this came in with the burst of stuff
>>>>>>>>> that came in after the 9.x branch was unfrozen after the
>>>>>>>>> release of 9.0.
>>>>>>> I am trying to reproduce the problem,  do you have complete
>>>>>>> sample code to test ?
>>>>>> I'm still looking the exact set
>>>>>> but on my machine (4 cpus) the program from ports sysutils/fio
>>>>>> exhibits the problem when used with
>>>>>> kern.timecounter.hardware=TSC-low and with the following
>>>>>> config file:
>>>>>>
>>>>>> pu05 # cat config.fio
>>>>>>
>>>>>> [global]
>>>>>> #clocksource=cpu
>>>>>> direct=1
>>>>>> rw=randread
>>>>>> bs=4096
>>>>>> fill_device=1
>>>>>> numjobs=16
>>>>>> iodepth=16
>>>>>> #ioengine=posixaio
>>>>>> #ioengine=psync
>>>>>> ioengine=psync
>>>>>> group_reporting
>>>>>> norandommap
>>>>>> time_based
>>>>>> runtime=60000
>>>>>> randrepeat=0
>>>>>>
>>>>>> [file1]
>>>>>> filename=/dev/ada0
>>>>>>
>>>>>> pu05 #
>>>>>> pu05 # fio config.fio
>>>>>> fio: this platform does not support process shared mutexes,
>>>>>> forcing use of threads. Use the 'thread' option to get rid of
>>>>>> this warning. file1: (g=0): rw=randread, bs=4K-4K/4K-4K,
>>>>>> ioengine=psync, iodepth=16 ...
>>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync,
>>>>>> iodepth=16 fio 2.0.3
>>>>>> Starting 15 threads and 1 process
>>>>>> fio: job startup hung? exiting.
>>>>>> fio: 5 jobs failed to start
>>>>>> Segmentation fault (core dumped)
>>>>>> pu05#
>>>>>>
>>>>>>
>>>>>> The reason 5 jobs failed to start is because the parent timed
>>>>>> out on them immediately.
>>>>>> It didn't time out on 10 of them apparently.
>>>>>>
>>>>>>
>>>>>> if I set the timer to ACPI-fast it works as expected..
>>>>> maybe following code can check to see if TSC-LOW works by let
>>>>> the thread run
>>>>> on each cpu.
>>>>>
>>>>> gettimeofday(&prev, NULL);
>>>>> int cpu = 0;
>>>>> for (;;) {
>>>>>       cpuset_t set;
>>>>>       cpu = ++cpu % 4;
>>>>>       CPU_ZERO(&set);
>>>>>       CPU_SET(cpu,&set);
>>>>>       pthread_setaffinity_np(pthread_self(), sizeof(set),&set);
>>>>>       gettimeofday(&cur, NULL);
>>>>>       if ( timercmp(&prev,&cur,>=)) {
>>>>>          abort();
>>>>>     }
>>>>> }
>>> pu05# sysctl kern.timecounter.hardware=TSC-low
>>> kern.timecounter.hardware: ACPI-fast ->  TSC-low
>>> pu05# ./test
>>> ^C
>>> pu05# cat test.c
>>>
>>> #include<stdlib.h>
>>> #include<sys/param.h>
>>> #include<sys/cpuset.h>
>>> #include<pthread_np.h>
>>>
>>> #include<sys/time.h>
>>>
>>> main()
>>> {
>>>      int cpu = 0;
>>>      struct timeval prev, cur;
>>>
>>>      gettimeofday(&prev, NULL);
>>>      for (;;) {
>>>           cpuset_t set;
>>>           cpu = ++cpu % 4;
>>>           CPU_ZERO(&set);
>>>           CPU_SET(cpu,&set);
>>>           pthread_setaffinity_np(pthread_self(), sizeof(set),
>>> &set); gettimeofday(&cur, NULL);
>>>           if ( timercmp(&prev,&cur,>)) {
>>>              abort();
>>>         }
>>>         prev = cur;
>>>      }
>>> }
>>>
>>> pu05# ./test
>>>
>>> minutes pass.......
>>>
>>> ^C
>>> pu05#
>>>
>>> so it looks as if the TSC is working ok..
>>> I'm just going to check that the program is actually moving
>>> CPU... yes it is moving around but I can't tell at what speed.
>>> (according to top).
>>>
>>> so we are still left with a question of "where is the problem?"
>>>
>>> kernel TSC driver?
>>> generic gettimeofday() code?
>>> pthreads cond code?
>>> the application?
>> I am running the fio test on my notebook which is using TSC-low,
>> it is on 9.0-RC3, I can not reproduce the problem for
>> minutes, then I interrupt it with ctrl-c:
>>
>> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt
>> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt
>> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt
> Your CPU is single-package, dual-core, and SMT-enabled.  All cores
> should be in perfect sync.
>
> Jung-uk Kim
>

mine is too, yet it still has problems..
CPU: Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz (2500.14-MHz 
K8-class CPU)
   Origin = "GenuineIntel"  Id = 0x10676  Family = 6  Model = 17  
Stepping = 6
   
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
   
Features2=0xce3bd<SSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1>
   AMD Features=0x20100800<SYSCALL,NX,LM>
   AMD Features2=0x1<LAHF>
   TSC: P-state invariant, performance statistics
real memory  = 8589934592 (8192 MB)
avail memory = 8214368256 (7833 MB)
Event timer "LAPIC" quality 400
ACPI APIC Table: <PTLTD          APIC >
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
  cpu0 (BSP): APIC ID:  0
  cpu1 (AP): APIC ID:  1
  cpu2 (AP): APIC ID:  2
  cpu3 (AP): APIC ID:  3
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 24-47 on motherboard




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F3EFF35.8040901>