From owner-svn-src-all@FreeBSD.ORG Wed Jul 25 18:00:43 2012 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4C6001065673; Wed, 25 Jul 2012 18:00:43 +0000 (UTC) (envelope-from jim.harris@gmail.com) Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by mx1.freebsd.org (Postfix) with ESMTP id 5C8D18FC18; Wed, 25 Jul 2012 18:00:42 +0000 (UTC) Received: by wgbfm10 with SMTP id fm10so4736505wgb.1 for ; Wed, 25 Jul 2012 11:00:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=VGNZxUqeN19sKL5EevWhrAOX8ocXi3j3e/ik8tDCBzo=; b=fQesXM/bjJlNsfoiB48T2Rq2IrA0oyZn2OcRPW6lBGhGzuGiGRHOBkjIQV+2tyayMi BoFUNgBLFbBGResHbaYQisTjizhfTlKWMvyjLKSwbQWfbVcnaouoSm/hZRmWyCAk+xMZ npEstndz3x5QsNoLhJm3NuLZ3w+yl2cder4F93qWS//H7Hb4+ZDMiwVw2z7O1kUnJCGH Ev378wzNaeaX+LIwI1KqzEEJI+mtSDvsYAr4855iN4ZuoiuFxgmU7hH2qvmT68hn0eA9 V8HxmDNjyMgat9taYGsjWlKYzcMtFCCLr+1Za1LNkAPrPZ5jXk02Dcmlh9rpbNu0rQNQ JqqA== MIME-Version: 1.0 Received: by 10.216.196.94 with SMTP id q72mr429421wen.149.1343239241444; Wed, 25 Jul 2012 11:00:41 -0700 (PDT) Received: by 10.216.178.212 with HTTP; Wed, 25 Jul 2012 11:00:41 -0700 (PDT) In-Reply-To: <20120725173212.GN2676@deviant.kiev.zoral.com.ua> References: <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <20120725233033.N5406@besplex.bde.org> <20120725173212.GN2676@deviant.kiev.zoral.com.ua> Date: Wed, 25 Jul 2012 11:00:41 -0700 Message-ID: From: Jim Harris To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andriy Gapon , Bruce Evans Subject: Re: svn commit: r238755 - head/sys/x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2012 18:00:43 -0000 On Wed, Jul 25, 2012 at 10:32 AM, Konstantin Belousov wrote: > On Thu, Jul 26, 2012 at 12:15:54AM +1000, Bruce Evans wrote: >> On Wed, 25 Jul 2012, Konstantin Belousov wrote: >> >> >On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote: >> >>on 25/07/2012 01:10 Jim Harris said the following: >> >>>Author: jimharris >> >>>Date: Tue Jul 24 22:10:11 2012 >> >>>New Revision: 238755 >> >>>URL: http://svn.freebsd.org/changeset/base/238755 >> >>> >> >>>Log: >> >>> Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures. >> >>> >> >>> Intel Architecture Manual specifies that rdtsc instruction is not >> >>> serialized, >> >>> so without this change, TSC synchronization test would periodically >> >>> fail, >> >>> resulting in use of HPET timecounter instead of TSC-low. This caused >> >>> severe performance degradation (40-50%) when running high IO/s >> >>> workloads due to >> >>> HPET MMIO reads and GEOM stat collection. >> >>> >> >>> Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC >> >>> synchronization >> >>> fail approximately 20% of the time. >> >> >> >>Should rather the synchronization test be fixed if it's the culprit? >> >Synchronization test for what ? >> > >> >>Or is this change universally good for the real uses of TSC? >> >> It's too slow for real uses. But synchronization code, and some uses >> that requires serialization may need it for, er, synchronization and >> serialization. >> >> It's hard to think of many uses that need serialization. I often use >> it for timing instructions. For timng a large number of instructions, >> serialization doesn't matter since errors of a few tens in a few billion >> done matter. For timing a small number of instructions, I don't want >> serialization, since the serialization invalidates the timing. >> >> Most uses in FreeBSD are for timecounters. Timecounters deliver the >> current time. This is unrelated to whatever instructions haven't >> completed when the TSC is read. Except possibly when the time needs >> to be synchronized across CPUs, and when the uncompleted instruction >> is a TSC read. >> >> >For tsc test, this means that after the change RDTSC executions are not >> >reordered on the single core among themself. As I understand, CPU has >> >no dependency noted between two reads of tsc by RDTSC, which allows >> >later read to give lower value of counter. >> >> Gak. Even when they are in the same instruction sequence? Even though >> the TSC reads fixed registers and some other instructions in the sequence >> between the TSC use these registers? The CPU would have to do significant >> register renaming to break this. > As I could only speculate, I believe that any modern CPU executes RDTSC > as at least two separate steps, one is read from internal counter, and > second is the registers update. It seems that the first kind of action > is not serialized. I have no other explanation for the Jim findings. > > I also asked Jim to test whether the cause the TSC sync test failure > is the lack of synchronization between gathering data and tasting it, > but ut appeared that the reason is genuine timecounter value going > backward. I wonder if instead of timecounter going backward, that TSC test fails because CPU speculatively performs rdtsc instruction in relation to waiter checks in smp_rendezvous_action. Or maybe we are saying the same thing. > > Sp the bug seems real, and I cannot imagine we will live with the known > defect in timecounters which can step back. >> >> >This is fixed by Intel by >> >introduction of RDTSCP instruction, which is defined to be serialization >> >point, and use of which (instead of LFENCE; RDTSC sequence) also fixes >> >test, as confirmed by Jim. >> >> This is not a fix if it is full serialization. It just gives slowness >> using a single instruction instead of a couple. >> >> >In fact, I now think that we should also apply the following patch. >> >Otherwise, consequtive calls to e.g. binuptime(9) could return decreased >> >time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the >> >tsc reads, which was done not for this reason, but apparently needed for >> >the reason too. >> > >> >diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c >> >index 085c339..229b351 100644 >> >--- a/sys/x86/x86/tsc.c >> >+++ b/sys/x86/x86/tsc.c >> >@@ -594,6 +594,7 @@ static u_int >> >tsc_get_timecount(struct timecounter *tc __unused) >> >{ >> > >> >+ rmb(); >> > return (rdtsc32()); >> >} >> >> Please don't pessimize this further. The time for rdtsc went from 6.5 >> cycles on AthlonXP to 65 cycles on core2 (mainly for for >> P-state-invariance hardware synchronization I think). Pretty soon it >> will be as slow as an HPET and heading towards an i8254. Adding rmb() >> only makes it 12 cycles slower on core2, but 16 cycles (almost 3 times) >> slower on AthlonXP. > AthlonXP does not look as interesting target for optimizations. Fom what I > can find this is PIII-era CPU. > >> >> >@@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc) >> >{ >> > uint32_t rv; >> > >> >+ rmb(); >> > __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" >> >- : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); >> >+ : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); >> > return (rv); >> >} >> >> The previous TSC-low/shrd pessimization adds only 2 cycles on AthlonXP >> and core2. I think it only "works" by backing the TSC's resolution >> so low that it usually can't see its own, or at least other TSC's lack of >> serialness. The shift count is usually 7 or 8, so the resolution is >> reduced from 1 cycle to 128 or 256. Out of order times that fall in >> the same block of 128 or 256 cycles would appear to be the same, but >> out of order times like 129 and 127 would apear to be even more out >> of order after a shift of 7 turns them into 128 and 0. >> >> Bruce