From owner-freebsd-arch@FreeBSD.ORG Sun Dec 16 07:14:18 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 017C5BF8; Sun, 16 Dec 2012 07:14:18 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au [211.29.133.169]) by mx1.freebsd.org (Postfix) with ESMTP id 865DD8FC17; Sun, 16 Dec 2012 07:14:17 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBG7E7jR006811 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 16 Dec 2012 18:14:09 +1100 Date: Sun, 16 Dec 2012 18:14:07 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Garrett Cooper Subject: Re: [RFC/RFT] calloutng In-Reply-To: <35705A81-690A-4993-B0C3-C8BC0BC89C67@gmail.com> Message-ID: <20121216175614.V1027@besplex.bde.org> References: <50CCAB99.4040308@FreeBSD.org> <20121215203458.GA22361@oddish> <35705A81-690A-4993-B0C3-C8BC0BC89C67@gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Zvi1sKHG c=1 sm=1 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=mUAV9h2nInsA:10 a=6I5d2MoRAAAA:8 a=QqY8FBgQtirPoTa2VqoA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Davide Italiano , freebsd-arch@FreeBSD.org, Alexander Motin , FreeBSD Current , Mark Johnston X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 Dec 2012 07:14:18 -0000 On Sat, 15 Dec 2012, Garrett Cooper wrote: > On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote: > >> On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote: >>> Hi. >>> >>> I'm sorry to interrupt review, but as usual good ideas came during the >>> final testing, causing another round. :) Here is updated patch for >>> HEAD, that includes several new changes: >>> http://people.freebsd.org/~mav/calloutng_12_15.patch >> >> This patch breaks the libprocstat build. >> >> Specifically, the OpenSolaris sys/time.h defines the preprocessor >> symbols gethrestime and gethrestime_sec. These symbols are also defined >> in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h. >> libprocstat:zfs.c is compiled using include paths that pick up the >> OpenSolaris time.h, and with this patch _callout.h includes sys/time.h. >> >> zfs.c includes taskqueue.h (with _KERNEL defined), which includes >> _callout.h, so both time.h and zfs_context.h are included in zfs.c, and >> the symbols are thus defined twice. Gross namespace pollution. sys/_callout.h exists so that the full namespace pollution of sys/callout.h doesn't get included nested. But sys/time.h is much more polluted than sys/callout.h. However, sys/time.h is old standard pollution in sys/param.h, and sys/callout.h is not so old standard pollution in sys/systm.h. It is a bug to not include sys/param.h and sys/systm.h in most kernel source code, so these nested includes are just style bugs -- they have no effect for correct kernel source code. >> The patch below fixes the build for me. Another approach might be to >> include sys/_task.h instead of taskqueue.h at the beginning of zfs.c. Good if it works. > I had a patch open once upon a time to cleanup inclusion of sys/time.h all over the tree and deal with the sys/time.h <-> time.h pollution issue, but it got dropped due to lack of interest (20~30 apps/libs were affected IIRC and I only really got assistance in fixing the UFS and bsnmpd pieces, and gave up due to lack of response from maintainers). dtrace/zfs is a definite instigator in this pollution (I remember nasty cddl/... pollution with the compat sys/time.h header). Please use the unix newline character in mail. The above is difficult to quote. The standard sys/time.h pollution in sys/param.h is only in the kernel, and there aren't many direct includes of sys/time.h in the kernel. Userland is different and many of the direct includes were correct. But not POSIX specifies that struct timespec and struct timeval be defined in most places where they are needed, so the includes of sys/time.h are not necessary for POSIX or FreeBSD, although FreeBSD man pages still say that they are necessary. The sys/time.h <-> time.h pollution issue is also only for userland. Many places depend on one including the other, and include the wrong one themself. > Bottom line: make sure anything new you're defining isn't already defined via POSIX or other OSes, and if so please try to make the implementations match (so that eventual POSIX inclusion might be possible) and when in doubt I suggest consulting standards@ / brde@. Bruce From owner-freebsd-arch@FreeBSD.ORG Sun Dec 16 23:38:02 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 95BFAE12; Sun, 16 Dec 2012 23:38:02 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by mx1.freebsd.org (Postfix) with ESMTP id CA0258FC0C; Sun, 16 Dec 2012 23:38:01 +0000 (UTC) Received: by mail-wg0-f52.google.com with SMTP id 12so2282032wgh.31 for ; Sun, 16 Dec 2012 15:38:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=EzfZxHJTGDewpVwapnVhmR8/61Cb4ukJOWIIGg2UEeA=; b=cqJwGKUCQwqkhTfJiARESFkdJ7l/e3tOVWZrp5WEIc53HlQQQslem/tlzoCoJl5R9O T3sOT4z+Qxu2zvLi0VtAVfOlgdc6W+bDh538sYzuHbhdKSVREkoHkPxlp1yfLePXFkyE LyNbNLZ3t+uJc4YGrvaRJ0dBEpE9UC4XTQwzMhRNWWHI2Fd4uHeTtfzRokJlyKDbkge1 zc0YFOzz55VxqASWzMlAbXm8bSFWs2ZBoqU+gOvtaO20B9BfOENSyQblTfL1+NfNpuh6 nWRHl5TqXPTAEXLK1h41hYW4MwRLMOomiGV0QLLYheyysg+Slhiu29P5O/R20/r6Bd+e xqBg== Received: by 10.194.57.206 with SMTP id k14mr13795024wjq.26.1355701080847; Sun, 16 Dec 2012 15:38:00 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id i2sm8826403wiw.3.2012.12.16.15.37.58 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 16 Dec 2012 15:37:59 -0800 (PST) Sender: Alexander Motin Message-ID: <50CE5B54.3050905@FreeBSD.org> Date: Mon, 17 Dec 2012 01:37:56 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: FreeBSD Current , freebsd-arch@freebsd.org Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> In-Reply-To: <50CCAB99.4040308@FreeBSD.org> Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 Dec 2012 23:38:02 -0000 Hi. Here is one more version. Unless something new will be found/reported this may be the last one, because me and Davide are quite satisfied with the results. If everything will be fine, I think we could commit it to HEAD closer to the end of the week: http://people.freebsd.org/~mav/calloutng_12_16.patch Changes in this version: -- Removed couple of redundant variables in callout implementation, that reduced sizeof(struct callout) by two pointers and simplified some internal code. -- syscons driver was made to schedule only 1-2 callouts per second instead of 20-30 before when console is in graphical mode and there are few other things to do. Now my laptop has only about 30 interrupts per second total during idle periods with X running. -- i8254 eventtimer driver was optimized to work faster in disabled by default one-shot mode. -- Few kernel functions were added to make KPIs more complete. -- Man pages were updated. -- Some style fixes were made. On 15.12.2012 18:55, Alexander Motin wrote: > I'm sorry to interrupt review, but as usual good ideas came during the > final testing, causing another round. :) Here is updated patch for > HEAD, that includes several new changes: > http://people.freebsd.org/~mav/calloutng_12_15.patch > > The new changes are: > -- Precision and event aggregation code was reworked. Instead of > previous -prec/+prec representation, precision is now single-sided -- > -0/+prec. It allowed to significantly improve precision on long time > intervals for APIs which imply that event should not happen before the > specified time. Depending on CPU activity, mistake for long time > intervals now will never be more then 1-500ms, even if specified > precision allows more. > -- Some minor optimizations were made to reduce callout overhead and > latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer > and TSC timecounter usleep(1) call from user-level executes in just > 5-6us, instead of 7-8us before. Now it can do 180K cycles per second on > single CPU with only partial CPU load. > -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd, > setrlimit) were modified to reduce number of interrupts, also with event > aggregation by explicit specification of the acceptable events > precision. Now my Core2Duo test system has only 30 interrupts per second > in idle. If not remaining syscons events, it could easily be 15. My > IvyBridge ultrabook first time in its history shown 5.5 hours of battery > time with full screen brightness and 10 hours with lid closed. > -- Some kernel functions were added to make KPIs more complete. > > I've successfully tested this patch on amd64 and arm. > -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 01:29:41 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EAFD87E9; Mon, 17 Dec 2012 01:29:41 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by mx1.freebsd.org (Postfix) with ESMTP id 0A8AD8FC14; Mon, 17 Dec 2012 01:29:40 +0000 (UTC) Received: by mail-wi0-f170.google.com with SMTP id hq7so1722704wib.1 for ; Sun, 16 Dec 2012 17:29:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=6kLopwV/0d4yFTfYbv979vnBKt/HrQoxDC7WgiGFrSo=; b=YVeKThTQ8nMJ3hE3C+z7XjKSFEq+pi8pABbd17XMnAsUnbKXJZVX6ZdZC2bpZKyubT O0RLb6yVj2pZDafR7fIlD9ojwsqNGNOnbGwing8fZLyT+OYOWQL17Hg3om2WgBy4m8NU PuZBaFG0t7JEbar2dHm+Qc/mLSqK0enXbptu0gs9z3Lxd9via7phQ21tczfATCH43KYW AhyfEuKP0SWVcH4+VX503Ga+a40wBGXSG0n2TIV/ckSXMYuKRY+dj6u2XHYka3R0QL4B tNDrF7a2W46IlO9uDjXEd4Np98QBdqG+iGCrhijc83Bt8QsaTStADlL5wZweCiEGIE5O EkhQ== MIME-Version: 1.0 Received: by 10.194.93.40 with SMTP id cr8mr14203847wjb.16.1355707774499; Sun, 16 Dec 2012 17:29:34 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Sun, 16 Dec 2012 17:29:34 -0800 (PST) In-Reply-To: <50CE5B54.3050905@FreeBSD.org> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> Date: Sun, 16 Dec 2012 17:29:34 -0800 X-Google-Sender-Auth: whr3NmJDdOYbOmYt_qvpgXk1_Vw Message-ID: Subject: Re: [RFC/RFT] calloutng From: Adrian Chadd To: Alexander Motin Content-Type: text/plain; charset=ISO-8859-1 Cc: Davide Italiano , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 01:29:42 -0000 On 16 December 2012 15:37, Alexander Motin wrote: > Hi. > > Here is one more version. Unless something new will be found/reported this > may be the last one, because me and Davide are quite satisfied with the > results. If everything will be fine, I think we could commit it to HEAD > closer to the end of the week: > http://people.freebsd.org/~mav/calloutng_12_16.patch Hi, Don't you think one week is a little on the low side for reviewing something this critical? Would you mind approaching some of the cluster peeps and seeing if they'll run this up on the ref10* boxes and VMs, just to get some further exposure? Thanks, Adrian From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 02:31:13 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9ED58DD7; Mon, 17 Dec 2012 02:31:13 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-da0-f54.google.com (mail-da0-f54.google.com [209.85.210.54]) by mx1.freebsd.org (Postfix) with ESMTP id 4E7478FC0C; Mon, 17 Dec 2012 02:31:13 +0000 (UTC) Received: by mail-da0-f54.google.com with SMTP id n2so2210558dad.13 for ; Sun, 16 Dec 2012 18:31:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:mime-version:in-reply-to:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=Jfs/csDriVbtDw72bcSnp4VNSlZjMDVG8UxRQuJocgU=; b=pyTeD21uNMineh7fIl3yPc4JBRxOq1qQ6aGZNFFRFeltOinO7BJ/McrADjlJuYIHQS wcoT5K9CFvLExugTnuYEERwFqdPaTuCHZ2w0nO2B9cU37iiKH8ZWjnc8Vr7NCIgJZZcb /IZx7Yv4P/fFpSSM5AuU38+mUpXf9nqid/Ly3ylffjO0zqHxurZ6uloKylxDjQHUGgM3 B68d06vJmXT90hOO3o+sd/+PzO9FdcLr4hVSAK8Wm50pQNQPZlBEn6agiZcGfwSXeeuo 83uSrvTLUOjmEqw06+dmH2zdnoqaW6wkEDMUSx6qtgM9n1n3hI2HcLWaPHC46mSLbQ7r 6b9A== Received: by 10.66.84.3 with SMTP id u3mr38017849pay.51.1355711472913; Sun, 16 Dec 2012 18:31:12 -0800 (PST) Received: from [192.168.20.12] (c-24-19-191-56.hsd1.wa.comcast.net. [24.19.191.56]) by mx.google.com with ESMTPS id nm2sm7267267pbc.43.2012.12.16.18.31.11 (version=SSLv3 cipher=OTHER); Sun, 16 Dec 2012 18:31:12 -0800 (PST) References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: X-Mailer: iPhone Mail (10A523) From: Garrett Cooper Subject: Re: [RFC/RFT] calloutng Date: Sun, 16 Dec 2012 18:31:10 -0800 To: Adrian Chadd Cc: Davide Italiano , Alexander Motin , FreeBSD Current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 02:31:13 -0000 On Dec 16, 2012, at 5:29 PM, Adrian Chadd wrote: > On 16 December 2012 15:37, Alexander Motin wrote: >> Hi. >> >> Here is one more version. Unless something new will be found/reported this >> may be the last one, because me and Davide are quite satisfied with the >> results. If everything will be fine, I think we could commit it to HEAD >> closer to the end of the week: >> http://people.freebsd.org/~mav/calloutng_12_16.patch > > Hi, > > Don't you think one week is a little on the low side for reviewing > something this critical? > > Would you mind approaching some of the cluster peeps and seeing if > they'll run this up on the ref10* boxes and VMs, just to get some > further exposure? And maybe tinderbox..? Thanks, -Garrett From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 03:38:19 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 584A1C95; Mon, 17 Dec 2012 03:38:19 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by mx1.freebsd.org (Postfix) with ESMTP id 5DC168FC0C; Mon, 17 Dec 2012 03:38:17 +0000 (UTC) Received: by mail-wi0-f180.google.com with SMTP id hj13so1634686wib.13 for ; Sun, 16 Dec 2012 19:38:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=kgbYkSvIJeXPpgoSE9zP1o/keHXf6yUaW4vs/3YK6O4=; b=gjBCke9sbYnUmjHI8hbdFshBfxPDCqpWdCLmMGmQN4uDE2dz0zSBist5+Ebh7XWEq0 Iguw/I7M8R4Hfq26YSVGPPYzqF1uHXzJnonUD4+jQMSFtgThiArN/fuaeBNskUaCbTm2 tPPVB1SrrGzPXcaQdBN1CwBSN/vTdPfCNJp2NvWe/uNbfNtXRruJUqL88IXaEISPFDqK rC2Ad4WecVXY5zZCEVQsz/XD6DQTlncKQ2TCD0afBKUwAj+X0ROwFBOVpWkMhAThR3vA DtVv+qdokm7CKTgsW6AkGOp+oTfS3bieJEauJcCZmypIKLK2g7KyOmSmwTS1pj0jIzzQ GHlg== MIME-Version: 1.0 Received: by 10.180.73.202 with SMTP id n10mr12640921wiv.17.1355715496969; Sun, 16 Dec 2012 19:38:16 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Sun, 16 Dec 2012 19:38:16 -0800 (PST) In-Reply-To: References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> Date: Sun, 16 Dec 2012 19:38:16 -0800 X-Google-Sender-Auth: 5Ej0KTuR1hczwAWCIFzf3_8FMmU Message-ID: Subject: Re: [RFC/RFT] calloutng From: Adrian Chadd To: Garrett Cooper Content-Type: text/plain; charset=ISO-8859-1 Cc: Davide Italiano , Alexander Motin , FreeBSD Current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 03:38:19 -0000 On 16 December 2012 18:31, Garrett Cooper wrote: >> Would you mind approaching some of the cluster peeps and seeing if >> they'll run this up on the ref10* boxes and VMs, just to get some >> further exposure? > > And maybe tinderbox..? Tinderbox is a great idea. Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too? What else gets heavily callout /timer driven? Try some computational workloads that stress the fairness of ULE/4BSD, maybe? Adrian From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 05:52:51 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 49703739; Mon, 17 Dec 2012 05:52:51 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-ie0-f182.google.com (mail-ie0-f182.google.com [209.85.223.182]) by mx1.freebsd.org (Postfix) with ESMTP id D32CE8FC0C; Mon, 17 Dec 2012 05:52:50 +0000 (UTC) Received: by mail-ie0-f182.google.com with SMTP id s9so8564453iec.13 for ; Sun, 16 Dec 2012 21:52:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:date:from:to:cc:subject:message-id:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=pEQVqOAblMbQuUoYbG8D0QuMm6MYG5VIQNhxjQxaN3A=; b=NWNsiHw1pYoEK8NB1Bw9/Tsdv33jdbJOrJZKp33o6C8Q5P1JJutzuVAk0DATnBIQQ8 w0e+Is3+TmGdmWRa94ZWZRmxRL0zhnlnYRMJo5P5J76gepzFZaJvPZtdyPyOugAqA88U CGQH3WnkcxesjI4EWniTaxgCiOEq/z9tRo/iyaksIspEjzWbmNbRmUAEYxU5bzHxL48T xF23Sfc4wN3aRVODgjWaJcJAI1lxaGfU5h5blf29zJCD035kMcf/h8/4PlUccBd0G7JP jSKf/6kdE4731zYKU2MK2N+RExTQMIca6g6QF8Hz7gr4A3kt7e0FXboAhB86Rz+qqr7D P3kQ== X-Received: by 10.50.34.225 with SMTP id c1mr8011375igj.67.1355723569999; Sun, 16 Dec 2012 21:52:49 -0800 (PST) Received: from oddish ([66.11.160.25]) by mx.google.com with ESMTPS id ex10sm5498832igc.15.2012.12.16.21.52.48 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 16 Dec 2012 21:52:49 -0800 (PST) Date: Mon, 17 Dec 2012 00:52:41 -0500 From: Mark Johnston To: Bruce Evans Subject: Re: [RFC/RFT] calloutng Message-ID: <20121217055241.GA5228@oddish> References: <50CCAB99.4040308@FreeBSD.org> <20121215203458.GA22361@oddish> <35705A81-690A-4993-B0C3-C8BC0BC89C67@gmail.com> <20121216175614.V1027@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121216175614.V1027@besplex.bde.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Garrett Cooper , Davide Italiano , Alexander Motin , FreeBSD Current , freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 05:52:51 -0000 On Sun, Dec 16, 2012 at 06:14:07PM +1100, Bruce Evans wrote: > On Sat, 15 Dec 2012, Garrett Cooper wrote: > > > On Dec 15, 2012, at 12:34 PM, Mark Johnston wrote: > > > >> On Sat, Dec 15, 2012 at 06:55:53PM +0200, Alexander Motin wrote: > >>> Hi. > >>> > >>> I'm sorry to interrupt review, but as usual good ideas came during the > >>> final testing, causing another round. :) Here is updated patch for > >>> HEAD, that includes several new changes: > >>> http://people.freebsd.org/~mav/calloutng_12_15.patch > >> > >> This patch breaks the libprocstat build. > >> > >> Specifically, the OpenSolaris sys/time.h defines the preprocessor > >> symbols gethrestime and gethrestime_sec. These symbols are also defined > >> in cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h. > >> libprocstat:zfs.c is compiled using include paths that pick up the > >> OpenSolaris time.h, and with this patch _callout.h includes sys/time.h. > >> > >> zfs.c includes taskqueue.h (with _KERNEL defined), which includes > >> _callout.h, so both time.h and zfs_context.h are included in zfs.c, and > >> the symbols are thus defined twice. > > Gross namespace pollution. sys/_callout.h exists so that the full > namespace pollution of sys/callout.h doesn't get included nested. But > sys/time.h is much more polluted than sys/callout.h. > > However, sys/time.h is old standard pollution in sys/param.h, and > sys/callout.h is not so old standard pollution in sys/systm.h. It is > a bug to not include sys/param.h and sys/systm.h in most kernel source > code, so these nested includes are just style bugs -- they have no > effect for correct kernel source code. > > >> The patch below fixes the build for me. Another approach might be to > >> include sys/_task.h instead of taskqueue.h at the beginning of zfs.c. > > Good if it works. The diff below is what I had in mind. taskqueue.h is used so that sizeof(struct task) can be used, but I don't see why that's preferable to just including _task.h. -Mark diff --git a/lib/libprocstat/zfs.c b/lib/libprocstat/zfs.c index aa6d78e..f04eedf 100644 --- a/lib/libprocstat/zfs.c +++ b/lib/libprocstat/zfs.c @@ -27,15 +27,12 @@ */ #include +#include #define _KERNEL #include -#include #undef _KERNEL #include -#undef lbolt -#undef lbolt64 -#undef gethrestime_sec #include #include #include From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 07:57:49 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5E365C6C; Mon, 17 Dec 2012 07:57:49 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id D5AAA8FC0A; Mon, 17 Dec 2012 07:57:47 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id JAA23533; Mon, 17 Dec 2012 09:57:38 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1TkVZy-0004iV-DQ; Mon, 17 Dec 2012 09:57:38 +0200 Message-ID: <50CED06F.6080909@FreeBSD.org> Date: Mon, 17 Dec 2012 09:57:35 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Davide Italiano , Alexander Motin , FreeBSD Current , freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 07:57:49 -0000 on 17/12/2012 03:29 Adrian Chadd said the following: > On 16 December 2012 15:37, Alexander Motin wrote: >> Hi. >> >> Here is one more version. Unless something new will be found/reported this >> may be the last one, because me and Davide are quite satisfied with the >> results. If everything will be fine, I think we could commit it to HEAD >> closer to the end of the week: >> http://people.freebsd.org/~mav/calloutng_12_16.patch > > Hi, > > Don't you think one week is a little on the low side for reviewing > something this critical? Thank god that this feature was developed in a branch, it was developed for a long period of time and there were people who routinely reviewed and tested (and really used) it. And yeah, its design was announced and discussed well in advance too. > Would you mind approaching some of the cluster peeps and seeing if > they'll run this up on the ref10* boxes and VMs, just to get some > further exposure? -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 08:01:03 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4FD45EEC; Mon, 17 Dec 2012 08:01:03 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 5DB2F8FC12; Mon, 17 Dec 2012 08:01:02 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id u54so2590387wey.13 for ; Mon, 17 Dec 2012 00:01:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=V27VYDtiEN0I5WUehtWQ00IsO4YmLnQb5LP8VwO4KG0=; b=nhIBmSZK9WEeobW3CvXBLhAh5e9RmH0ua4RTaTtdMLjTY3oxwHd2AXq9Ej1y0r6pyc x9IPopjFABZqG27J4TcYt6hGjnbhpciwee+B5sv7AWZt6LahE1L+rzqpBJlYgV4zEIy2 LQxvR/X21KYmLExeGbpX1uljj8S2okDHxMMPjqc1VnRpBOcmF+mgz55RCgXwv2aGwbdl 6frw3dNmJcaQutdvf/Jx8dTGU+LaiRjeTe7AHUkM+nrtRGhIamd/fNRPwXVzMaJze0HZ NiT7QaOMmuaLTlK9pHz+o6+6eKUPoGUq3IVpl+IHj96B8t4cvymZqp33pbWukZPxtcV1 rsIA== X-Received: by 10.181.11.234 with SMTP id el10mr13968450wid.7.1355731261377; Mon, 17 Dec 2012 00:01:01 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id w5sm9979350wif.11.2012.12.17.00.00.59 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 17 Dec 2012 00:01:00 -0800 (PST) Sender: Alexander Motin Message-ID: <50CED13A.50209@FreeBSD.org> Date: Mon, 17 Dec 2012 10:00:58 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 08:01:03 -0000 On 17.12.2012 03:29, Adrian Chadd wrote: > On 16 December 2012 15:37, Alexander Motin wrote: >> Here is one more version. Unless something new will be found/reported this >> may be the last one, because me and Davide are quite satisfied with the >> results. If everything will be fine, I think we could commit it to HEAD >> closer to the end of the week: >> http://people.freebsd.org/~mav/calloutng_12_16.patch > > Don't you think one week is a little on the low side for reviewing > something this critical? It was in public development for the last half a year. IIRC, previous announce by Davide few months ago caused no any feedback. If you say you will review it in two weeks, I will wait for two weeks. But I don't want to wait without a purpose. > Would you mind approaching some of the cluster peeps and seeing if > they'll run this up on the ref10* boxes and VMs, just to get some > further exposure? Are the ref* system have any load to see anything? -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 08:14:34 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8AF6E2EC; Mon, 17 Dec 2012 08:14:34 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by mx1.freebsd.org (Postfix) with ESMTP id 8BE6D8FC0C; Mon, 17 Dec 2012 08:14:33 +0000 (UTC) Received: by mail-wi0-f180.google.com with SMTP id hj13so1731017wib.13 for ; Mon, 17 Dec 2012 00:14:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=w/zkQSr9Q5/ISD7Iud0VXFj1xFWCvFP+QpH7bdfudHo=; b=ynuSDXv0Qf5Og62Mt9CewcJLHbqKw3YlwIuU9obgspbJiV0PND35mRS+FIHMTA3Kb7 R523E/6KCNxcjG3AiRijIuMg7xUQDPSw8Hh9rv+/K3Tmq/wKc8UM6zxPx8TTPBsTL0dw ChMewPE6f6eirGnafnE55xhG5PHAev4l0xYQd/mXHkBhPgmYHDirSi3AjaOPlDmXe6dB sLsmVZ+bMLZVYEYdITuWOE0v/cdBJWvPvo7oqCPXXEsjowoArzF7X0Ce2uH6XVHGyE2m ynwLKiwQUI7BjHY+0Uh+/lDh5xDqvlBCbaI/JULJJv89OdHTULGEAClDkvZoNt55pknI GOAQ== Received: by 10.194.235.6 with SMTP id ui6mr15359807wjc.12.1355732072371; Mon, 17 Dec 2012 00:14:32 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id df2sm840955wib.0.2012.12.17.00.14.30 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 17 Dec 2012 00:14:31 -0800 (PST) Sender: Alexander Motin Message-ID: <50CED465.3010501@FreeBSD.org> Date: Mon, 17 Dec 2012 10:14:29 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Adrian Chadd Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Garrett Cooper , Davide Italiano , FreeBSD Current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 08:14:34 -0000 On 17.12.2012 05:38, Adrian Chadd wrote: > On 16 December 2012 18:31, Garrett Cooper wrote: > >>> Would you mind approaching some of the cluster peeps and seeing if >>> they'll run this up on the ref10* boxes and VMs, just to get some >>> further exposure? >> >> And maybe tinderbox..? > > Tinderbox is a great idea. > > Maybe hit up the altq/pf using crowd and see if they'll test this stuff out too? It would be good to test, though I know that at least dummynet is written awful from the point of this project with its callout_reset(&dn_timeout, 1, dummynet, NULL); It should work, but kill most of power benefits. I was promised it will be fixed after this project end. > What else gets heavily callout /timer driven? Try some computational > workloads that stress the fairness of ULE/4BSD, maybe? Schedulers are driven directly by hardclock()/statclock(), so fairness is not affected here. If CPU is not idle, it will receive full set of required events with maximum available precision. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 08:36:01 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 122BC83F; Mon, 17 Dec 2012 08:36:01 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by mx1.freebsd.org (Postfix) with ESMTP id E6AA58FC12; Mon, 17 Dec 2012 08:35:59 +0000 (UTC) Received: by mail-wi0-f170.google.com with SMTP id hq7so1865699wib.1 for ; Mon, 17 Dec 2012 00:35:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=mVs6lqD0bgHlaOfmxq94n5D5qrBcpUi17FOo3c4f24A=; b=iF8LVV3BjzZYaOPDwL1KezEXAY4ebs7svqxVvRuV+93T+BFxPWWS3nJvQ2l72NrWX4 dAZMqVM+hdJZ+bbLWfwtfWGKWVCc1G3rxuNnsebAIbZHjE/Joyw2wHgMq8Fk3RhlD+6d WU3a4zqr0wifX1O0qgkzHyImsgT3n/Ifq+y2k5V8kw4G4M9aMjO/GYGbEv5k24G6FNgj 17wYUqPYxWgvplbr3Ii5bBn7heGmdfZm/rxhg35ki4WKUn1phWX1ohCi4itbRfZdxwbf 4KsDG7L6y1VaVq3OHE3AbwAPM6Q1WSQlrY1KdqQAJS2tnjZ1qHaR7f/ODIiuohPFisqa 5xBg== MIME-Version: 1.0 Received: by 10.180.88.138 with SMTP id bg10mr14124048wib.13.1355733358907; Mon, 17 Dec 2012 00:35:58 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Mon, 17 Dec 2012 00:35:58 -0800 (PST) In-Reply-To: <50CED06F.6080909@FreeBSD.org> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50CED06F.6080909@FreeBSD.org> Date: Mon, 17 Dec 2012 00:35:58 -0800 X-Google-Sender-Auth: hOVybu-Q6r561y4QjoqUU-3x2Nw Message-ID: Subject: Re: [RFC/RFT] calloutng From: Adrian Chadd To: Andriy Gapon Content-Type: text/plain; charset=ISO-8859-1 Cc: Davide Italiano , Alexander Motin , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 08:36:01 -0000 On 16 December 2012 23:57, Andriy Gapon wrote: > > Thank god that this feature was developed in a branch, it was developed for a > long period of time and there were people who routinely reviewed and tested (and > really used) it. And yeah, its design was announced and discussed well in > advance too. I can see that; I was even there at David's presentation at BSDCan 2012 about it. Now that it's finished though, it would be nice to get some more stress testing before committing it. Just because it's been developed in a branch doesn't at all imply that it's had wide testing. It's now imminently going into the tree, so that may scare (heh!) a few people into testing it. I'm sure it'll mostly just work, that it'll not really break anything. It just to me feels that a week of warning before committing something like this is a little soon. It's _just_ been finished and the authors have been doing some last minute changes as they get better ideas on implementing things. I think it's great work, I'd just like to see some wider testing. So to put my money where my mouth is, I bought a new hard disk for my T60 last week. I'll install -HEAD on it tomorrow and give it a whirl. Thanks, Adrian From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 19:07:49 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1728C697; Mon, 17 Dec 2012 19:07:49 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id BE2248FC13; Mon, 17 Dec 2012 19:07:48 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 606BF7300A; Mon, 17 Dec 2012 20:06:20 +0100 (CET) Date: Mon, 17 Dec 2012 20:06:20 +0100 From: Luigi Rizzo To: Alexander Motin Subject: calloutng and dummynet (Re: [RFC/RFT] calloutng) Message-ID: <20121217190620.GA83203@onelab2.iet.unipi.it> References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> <50CED465.3010501@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50CED465.3010501@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: Garrett Cooper , Davide Italiano , Adrian Chadd , FreeBSD Current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 19:07:49 -0000 On Mon, Dec 17, 2012 at 10:14:29AM +0200, Alexander Motin wrote: > On 17.12.2012 05:38, Adrian Chadd wrote: ... > >Maybe hit up the altq/pf using crowd and see if they'll test this stuff > >out too? > > It would be good to test, though I know that at least dummynet is > written awful from the point of this project with its > callout_reset(&dn_timeout, 1, dummynet, NULL); > It should work, but kill most of power benefits. I was promised it will > be fixed after this project end. never trust italians :) but it is good that you are reminding me this, hopefully i will be able to give it a shot after the holidays cheers luigi From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 19:28:52 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 995E92C6; Mon, 17 Dec 2012 19:28:52 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 50B7D8FC0C; Mon, 17 Dec 2012 19:28:52 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id CF1B17300A; Mon, 17 Dec 2012 20:27:31 +0100 (CET) Date: Mon, 17 Dec 2012 20:27:31 +0100 From: Luigi Rizzo To: Davide Italiano Subject: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng Message-ID: <20121217192731.GA83405@onelab2.iet.unipi.it> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 19:28:52 -0000 [addressing the various items separately] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: > On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: ... > > - for several functions the only change is the name of an argument > > from "busy" to "us". Can you elaborate the reason for the change, > > and whether "us" means microseconds or the pronoun ?) > > > > Please see r242905 by mav@. i see the goal of this patch is to pass along the amount of time till the next timer. I wonder why the choice is to use (actually, call) the value "microseconds" rather use a bintime or something scaled and with a well defined resolution. In fact looking at the relevant diff http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905&r2=242904&pathrev=242905 cpu_idleclock() actually returns a value that is not even microseconds but 1/(2^20) seconds. The value seems to be ignored right now so it would be a good time to discuss the resolution. I am concerned that at some point (5 years from now perhaps ?) microseconds might start to become too coarse and we would like something with a more fine-grained resolution. On the other hand, for the purposes of this change, we can probably live with an upper limit of some seconds (waking up the machine once per second is not going to kill performance). So i would suggest to make the argument to these functions uint_32 or uint_64 (preferably the same for 32- and 64-bit machines), rename it to something different from 'us' and have at least 28-30 fractional bits to represent a bintime. Right now you are using a bintime with 20 fractional and 11 or 43 bits for the integer part. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 19:59:46 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3208DC41; Mon, 17 Dec 2012 19:59:46 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by mx1.freebsd.org (Postfix) with ESMTP id 679BC8FC14; Mon, 17 Dec 2012 19:59:45 +0000 (UTC) Received: by mail-wg0-f52.google.com with SMTP id 12so2782571wgh.31 for ; Mon, 17 Dec 2012 11:59:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :content-type:content-transfer-encoding; bh=CVxVzJmOM1P1VQmTd5HRCxSNaY494vnOGnGSy77hHng=; b=0q3GrivsgQyVGHiIWG0cMfWQa9rN1+za+GLH6QWvMQ1fImO1P2LUpqnPDc1af3SLWN wHLwy95OoE5CQ+k/5MxtvapYL7yk6bYj/zfcAQHwNrANB1LLaURoxpgBxcJKhFKRkJh4 scZJZceJsO4UtccppeUK84apuNoHX/1oajlG9GVbDRTL3+xBLYvEL3BmqtKxs/tovJs0 0+478P/wmJMBHTpFjKIS/swUDvsypF1C/HVbnI40WmvmG4ZW453s5lDnaLnJelmk+pA0 AxSHmqC88cbWE2BohE7IiKWg1dRnCJFYFd2p0GIma8J+NMP/DbRadjxRizharIClRezt r4kg== Received: by 10.180.97.68 with SMTP id dy4mr17502497wib.7.1355774384480; Mon, 17 Dec 2012 11:59:44 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id u6sm14120591wif.2.2012.12.17.11.59.42 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 17 Dec 2012 11:59:43 -0800 (PST) Sender: Alexander Motin Message-ID: <50CF79AD.9040600@FreeBSD.org> Date: Mon, 17 Dec 2012 21:59:41 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: rizzo@iet.unipi.it Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 19:59:46 -0000 Hi. > I wonder why the choice is to use (actually, call) the value > "microseconds" rather use a bintime or something scaled and with a > well defined resolution. It was kind of engineering choice. I've chosen microseconds, following values used by ACPI to represent CPU sleep states exit latencies. Now that is the only usage for that value. If CPUs so much reduce wakeup latencies to make this scale too coarse, this type will be the smallest of our optimization tasks. On the other side, I have some doubts that we will be able to reach supported 2048 seconds limit on the integer side. Now even completely empty idle system has about 30 interrupts per second, that is far from 0.0005. From the other side, I don't know any system where CPUs have 2048 seconds wakeup latency. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 20:02:23 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8C9A3F3; Mon, 17 Dec 2012 20:02:23 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id D94858FC13; Mon, 17 Dec 2012 20:02:22 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 68BF97300A; Mon, 17 Dec 2012 21:01:02 +0100 (CET) Date: Mon, 17 Dec 2012 21:01:02 +0100 From: Luigi Rizzo To: Davide Italiano Subject: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121217200102.GA83832@onelab2.iet.unipi.it> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 20:02:23 -0000 [again, response to another issue i raised] On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: > On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: ... > > Finally, a more substantial comment: > > - a lot of functions which formerly had only a "timo" argument > > now have "timo, bt, precision, flags". Take seltdwait() as an example. > > > > seltdwait() is not part of the public KPI. It has been modified to > avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e. > two separate functions, even though we could share most of the code is > not a clever approach, IMHO. > As I told before, seltdwait() is not exposed so we can modify its > argument without breaking anything. > > > It seems that you have been undecided between two approaches: > > for some of these functions you have preserved the original function > > that deals with ticks and introduced a new one that deals with the > > bintime, > > whereas in other cases you have modified the original function to add > > "bt, precision, flags". > > > > I'm not. All the functions which are part of the public KPI (e.g. > condvar(9), sleepq(9), *) are still available. *_flags variants have > been introduced so that consumers can take advantage of the new > 'precision tolerance mechanism' implemented. Also, *_bt variants have > been introduced. I don't see any "undecision" between the two > approaches. > Please note that now the callout backend deals with bintime, so every > time callout_reset_on() is called, the 'tick' argument passed is > silently converted to bintime. I will try to give more specific example, using the latest patch from mav http://people.freebsd.org/~mav/calloutng_12_16.patch In the manpage, for instance, the existing functions now are extended with two more variants (sometimes; msleep_spin() for instance is missing msleep_spin_bt() but perhaps that is just an oversight). .Nm sleepq_set_timeout , +.Nm sleepq_set_timeout_flags , +.Nm sleepq_set_timeout_bt , .Nm msleep , +.Nm msleep_flags , +.Nm msleep_bt , .Nm msleep_spin , +.Nm msleep_spin_flags , .Nm pause , +.Nm pause_flags , +.Nm pause_bt , .Nm tsleep , +.Nm tsleep_flags , +.Nm tsleep_bt , .Nm cv_timedwait , +.Nm cv_timedwait_bt , +.Nm cv_timedwait_flags , .Nm cv_timedwait_sig , +.Nm cv_timedwait_sig_bt , +.Nm cv_timedwait_sig_flags , .Nm callout_reset , +.Nm callout_reset_flags , .Nm callout_reset_on , +.Nm callout_reset_flags_on , +.Nm callout_reset_bt_on , If you look at the backends, they take both a timo and a bintime. -int _cv_timedwait(struct cv *cvp, struct lock_object *lock, int timo); -int _cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, int timo); +int _cv_timedwait(struct cv *cvp, struct lock_object *lock, + struct bintime *bt, struct bintime *precision, int timo, + int flags); +int _cv_timedwait_sig(struct cv *cvp, struct lock_object *lock, + struct bintime *bt, struct bintime *precision, int timo, + int flags); and then internally they call the 'timo' or the 'bt' version depending on the case + if (bt == NULL) + sleepq_set_timeout_flags(cvp, timo, flags); + else + sleepq_set_timeout_bt(cvp, bt, precision); So basically you are doing the following: + create two new variant for each existing function foo(, ... timo, ... ) old method foo_flags(, ... timo, ... ) new method foo_bt(... , bt, precision, ...) new method (the 'API explosion' i am mentioning in the Subject) + the variants are mapped to the same internal function _foo_internal(..., timo, bt, precision, flags, ...) + and then the internal function has a (runtime) conditional if (bt == NULL) { // convert timo to bt } else { // have a bt + precision } ... I would instead do the following: + create a new API function that takes only bintime+precision+flags, no ticks. I am not sure if there is even a need to have an internal name _cv_timedwait_sig( .... ) or you can just have cv_timedwait_sig_bt(...) + use a macro or an inline function to remap the old API to the (single) new one, making the argument conversion immediately. #define cv_timedwait_sig(...) cv_timedwait_sig_bt( ...) This has the advantage that conversions might be done at compile time perhaps with some advantage in terms of code and performance. + do not bother creating yet another cv_timedwait_sig_flags() function. Since it did not exist before, you have to do the conversion manually anyways, at which point you just change the argument to use bintime instead of ticks. Note that what i am proposing is a simplification of your code and should not require much extra effort. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 20:17:56 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 041128C6; Mon, 17 Dec 2012 20:17:56 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vb0-f54.google.com (mail-vb0-f54.google.com [209.85.212.54]) by mx1.freebsd.org (Postfix) with ESMTP id 853D38FC0A; Mon, 17 Dec 2012 20:17:55 +0000 (UTC) Received: by mail-vb0-f54.google.com with SMTP id l1so7768137vba.13 for ; Mon, 17 Dec 2012 12:17:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=dgTMgSHlhvv2LNpvLOg3HmEKKueXCoDfISnnecrQaH4=; b=vDNtf0VWQtycePzfTy/Fd95STGm2kvY1RTAmhZGVDExFhO20d8MVqDMK9bSHvdk29F 0Xs+wnho935+Sr8RT9/3+BVG40CBG3DZhha7OavodTqTWd0E4AwJeTyegmVpe7592/Oo eypEN23eiyly3vkgEiJ/Y3bTw+ORd88WNIRm4TPMWcyahhhDMkDDGMt00fk1YqiHOOdp 6I7kVMG33gKhsobZHWz75pWqkEaC1bNBZZjzr/SZECqHjwNwGDNktQKaAxueay9/OZ1I H2v7fhiBnD/levRsU0prVixJDEOr2zr8NN/344gpJr0jhqwex+e4uiQTRFiuBlwScPVy 0DgQ== MIME-Version: 1.0 Received: by 10.52.66.70 with SMTP id d6mr22146251vdt.30.1355775474752; Mon, 17 Dec 2012 12:17:54 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.229.136 with HTTP; Mon, 17 Dec 2012 12:17:54 -0800 (PST) In-Reply-To: <20121217192731.GA83405@onelab2.iet.unipi.it> References: <20121217192731.GA83405@onelab2.iet.unipi.it> Date: Mon, 17 Dec 2012 12:17:54 -0800 X-Google-Sender-Auth: 2atf83R1CkUURAQQX-9NZ0rna8M Message-ID: Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng From: Davide Italiano To: Luigi Rizzo Content-Type: text/plain; charset=ISO-8859-1 Cc: freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 20:17:56 -0000 On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo wrote: > [addressing the various items separately] > > On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: >> On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: > ... >> > - for several functions the only change is the name of an argument >> > from "busy" to "us". Can you elaborate the reason for the change, >> > and whether "us" means microseconds or the pronoun ?) >> > >> >> Please see r242905 by mav@. > > i see the goal of this patch is to pass along the amount of > time till the next timer. > > I wonder why the choice is to use (actually, call) the value > "microseconds" rather use a bintime or something scaled and with a > well defined resolution. > > In fact looking at the relevant diff > > http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905&r2=242904&pathrev=242905 > > cpu_idleclock() actually returns a value that is not even microseconds > but 1/(2^20) seconds. The value seems to be ignored right now > so it would be a good time to discuss the resolution. > > I am concerned that at some point (5 years from now perhaps ?) > microseconds might start to become too coarse and we would like > something with a more fine-grained resolution. On the > other hand, for the purposes of this change, we can probably > live with an upper limit of some seconds (waking up the machine > once per second is not going to kill performance). > I would talk more about power consumption problem rather than performances. Yes, you're right because now, even with calloutng changes, the CPU is woken up at least twice per second. The wheel scan, in case it doesn't find a new callout to schedule in the next half-second, schedules an interrupt half a second from "now" (where now is the time obtained using getbinuptime()/binuptime()). This is a threshold we set up empirically, and it resulted is "good" for now. But in the future someone may raise the threshold to 1 second, 10 seconds, or something like. So, I don't agree with your statement. > So i would suggest to make the argument to these functions > uint_32 or uint_64 (preferably the same for 32- and 64-bit machines), > rename it to something different from 'us' > and have at least 28-30 fractional bits to represent a bintime. > > Right now you are using a bintime with 20 fractional and 11 or 43 > bits for the integer part. > > > cheers > luigi Thanks Davide From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 21:03:59 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 33FA49C2; Mon, 17 Dec 2012 21:03:59 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 5E3B88FC12; Mon, 17 Dec 2012 21:03:58 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id u54so3022735wey.13 for ; Mon, 17 Dec 2012 13:03:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :content-type:content-transfer-encoding; bh=BSdFSZV7b+8Je0QAPEO+KvvoTC+MZ3f0TujYVgfjsJE=; b=iARPxZ6cMYac1CYvTjyUBzHk7GuXZnuNTdrCBK/1tCTohajTh5+oLIIX9k1I1LkOFR zjmYDn5gXX4U6RgQc/c7onTnOIeuY6OWBCg08Gy++1cSsvo4nMm6UiQiw59sKJdMWwJ7 0vCy8anjZ3NbXOSwAofkuTQEQX+rMAWPlaAFY9qDcETCypYCQ+PJpngIbR88OvFTehCA k/bm9LnthNCDIqdVgDjJkAsPdVBt4puaHc/DUMgcZ5WlSjNjfdKbrWxWCGWrvNITKFjX WY3SRiQh3XI7iHINYdFAZAfLR2kLaJHuuzJrvHFpahtphost9CnYIc6c5h/lupOr2VAV Avrw== Received: by 10.180.87.102 with SMTP id w6mr17883446wiz.19.1355778236465; Mon, 17 Dec 2012 13:03:56 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id i6sm12853292wix.5.2012.12.17.13.03.54 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 17 Dec 2012 13:03:55 -0800 (PST) Sender: Alexander Motin Message-ID: <50CF88B9.6040004@FreeBSD.org> Date: Mon, 17 Dec 2012 23:03:53 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 21:03:59 -0000 Hi. > I would instead do the following: I also don't very like the wide API and want to hear fresh ideas, but approaches to time measurement there are too different to do what you are proposing. Main problem is that while ticks value is relative, bintime is absolute. It is not easy to make conversion between them fast and precise. I've managed to do it, but the only function that does it now is _callout_reset_on(). All other functions are just passing values down. I am not sure I want to duplicate that code in each place, though doing it at least for for callout may be a good idea. Creating sets of three functions I had three different goals: - callout_reset() -- it is legacy variant required to keep API compatibility; - callout_reset_flags() -- it is for cases where custom precision specification needs to be added to the existing code, or where direct callout execution is needed. Conversion to bintime would additionally complicate consumer code, that I would try to avoid. - callout_reset_bt() -- API for the new code, which needs high precision and doesn't mind to operate bintime. Now there is only three such places in kernel now, and I don't think there will be much more. Respectively, these three options are replicated to other APIs where time intervals are used. PS: Please keep me in CC. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 21:12:33 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DE114D98; Mon, 17 Dec 2012 21:12:33 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 95E458FC15; Mon, 17 Dec 2012 21:12:33 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id D07967300A; Mon, 17 Dec 2012 22:11:12 +0100 (CET) Date: Mon, 17 Dec 2012 22:11:12 +0100 From: Luigi Rizzo To: Davide Italiano Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng Message-ID: <20121217211112.GA84347@onelab2.iet.unipi.it> References: <20121217192731.GA83405@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 21:12:33 -0000 On Mon, Dec 17, 2012 at 12:17:54PM -0800, Davide Italiano wrote: > On Mon, Dec 17, 2012 at 11:27 AM, Luigi Rizzo wrote: > > [addressing the various items separately] > > > > On Fri, Dec 14, 2012 at 01:57:36PM +0100, Davide Italiano wrote: > >> On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: > > ... > >> > - for several functions the only change is the name of an argument > >> > from "busy" to "us". Can you elaborate the reason for the change, > >> > and whether "us" means microseconds or the pronoun ?) > >> > > >> > >> Please see r242905 by mav@. > > > > i see the goal of this patch is to pass along the amount of > > time till the next timer. > > > > I wonder why the choice is to use (actually, call) the value > > "microseconds" rather use a bintime or something scaled and with a > > well defined resolution. > > > > In fact looking at the relevant diff > > > > http://svnweb.freebsd.org/base/projects/calloutng/sys/kern/kern_clocksource.c?r1=242905&r2=242904&pathrev=242905 > > > > cpu_idleclock() actually returns a value that is not even microseconds > > but 1/(2^20) seconds. The value seems to be ignored right now > > so it would be a good time to discuss the resolution. > > > > I am concerned that at some point (5 years from now perhaps ?) > > microseconds might start to become too coarse and we would like > > something with a more fine-grained resolution. On the > > other hand, for the purposes of this change, we can probably > > live with an upper limit of some seconds (waking up the machine > > once per second is not going to kill performance). > > > > I would talk more about power consumption problem rather than performances. > Yes, you're right because now, even with calloutng changes, the CPU is > woken up at least twice per second. > The wheel scan, in case it doesn't find a new callout to schedule in > the next half-second, schedules an interrupt half a second from "now" > (where now is the time obtained using getbinuptime()/binuptime()). > This is a threshold we set up empirically, and it resulted is "good" > for now. But in the future someone may raise the threshold to 1 > second, 10 seconds, or something like. So, I don't agree with your > statement. this is only an issue if we want to use 32 bits. If we go to 64 bits, there is enogh space for picoseconds on the fractional part, and a few years in the integer part. but keep in mind, even powerwise, i doubt the exit from deep sleep and a callout takes more than 500us so even doing that once per second gives less than 0.5 per mille increase in power compared to a machine that is always idle This is really noise that we should not worry about. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 21:54:33 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 78B86450; Mon, 17 Dec 2012 21:54:33 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-we0-f177.google.com (mail-we0-f177.google.com [74.125.82.177]) by mx1.freebsd.org (Postfix) with ESMTP id AAA328FC12; Mon, 17 Dec 2012 21:54:32 +0000 (UTC) Received: by mail-we0-f177.google.com with SMTP id x48so2973194wey.8 for ; Mon, 17 Dec 2012 13:54:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=5s2Y48be157PAg2mxRaAAWvMFobgVyxf3UwS9g2FFvc=; b=Gx9k2yP3Pey6Zpf8wudzTbPgkiJkSweuyvUi11a5lBJTVvVuqK9Cb7tr4x71K5F1O/ fBDFayr32c5mOPz1f550mB7+Pb1lDdOVyDE6jNhIevqYEmyZ2T+rw9almhqc4snjvBdZ Ej+XGehKesuh24T8mafu8p4zc8wIToyMvin5FIl7OcASX0EuNfwDLIRbeXVKZMWaAu7f 6Bd93ReDogU7xH7tU69hQ+1KyYqTwx0PZu9h45wuRaLWy+9rRlcT0SRZOzXHbjj90MWD heWjbXV6hGWZhEzoYAu5ay5WZ5A151EYpAztAkOnSwZJ2206jG+0TZHi/uN+lqRLxCJQ PgEw== MIME-Version: 1.0 Received: by 10.180.24.4 with SMTP id q4mr14835wif.19.1355779379090; Mon, 17 Dec 2012 13:22:59 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Mon, 17 Dec 2012 13:22:59 -0800 (PST) In-Reply-To: <20121217211112.GA84347@onelab2.iet.unipi.it> References: <20121217192731.GA83405@onelab2.iet.unipi.it> <20121217211112.GA84347@onelab2.iet.unipi.it> Date: Mon, 17 Dec 2012 13:22:59 -0800 X-Google-Sender-Auth: yElOrvB5M6U2oSN7CV9Sg3YZFE4 Message-ID: Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng From: Adrian Chadd To: Luigi Rizzo Content-Type: text/plain; charset=ISO-8859-1 Cc: Davide Italiano , freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 21:54:33 -0000 Personally, I'd rather see some consistently used units here.. Adrian From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 22:18:50 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3ED192C0; Mon, 17 Dec 2012 22:18:50 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id CC47C8FC16; Mon, 17 Dec 2012 22:18:49 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 529438A3FC; Mon, 17 Dec 2012 22:18:48 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBHMIlI2006709; Mon, 17 Dec 2012 22:18:47 GMT (envelope-from phk@phk.freebsd.dk) To: Alexander Motin Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng In-reply-to: <50CF79AD.9040600@FreeBSD.org> From: "Poul-Henning Kamp" References: <50CF79AD.9040600@FreeBSD.org> Date: Mon, 17 Dec 2012 22:18:47 +0000 Message-ID: <6708.1355782727@critter.freebsd.dk> Cc: Davide Italiano , freebsd-current , freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 22:18:50 -0000 -------- In message <50CF79AD.9040600@FreeBSD.org>, Alexander Motin writes: >Hi. > > > I wonder why the choice is to use (actually, call) the value > > "microseconds" rather use a bintime or something scaled and with a > > well defined resolution. > >It was kind of engineering choice. I've chosen microseconds [...] And that was the wrong choice, the format should be a binary number so arithmetic and comparisons does not become a nightmare. If people need milliseconds or microseconds, they can get that using suitable #defined multiplication factors. A 64 bit type, with 32bit before and after the binary point would be sufficient for now, and easily extensible to something larger should one or more laws of computing nature be changed. So do the following: typedef dur_t int64_t; /* signed for bug catching */ #define DURSEC ((dur_t)1 << 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 10000000) #define DURNSEC (DURSEC / 100000000000) And stop crapping around with mixed-radix numbers, even the british changed to decimal coinage to avoid that crap... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 22:39:43 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: by hub.freebsd.org (Postfix, from userid 664) id D504FC9D; Mon, 17 Dec 2012 22:39:43 +0000 (UTC) Date: Mon, 17 Dec 2012 14:39:43 -0800 From: David O'Brien To: Alexander Motin Subject: Re: [RFC/RFT] calloutng Message-ID: <20121217223943.GA88856@dragon.NUXI.org> Mail-Followup-To: obrien@freebsd.org, Alexander Motin , Davide Italiano , freebsd-arch@freebsd.org, FreeBSD Current , Mark Johnston References: <50CCAB99.4040308@FreeBSD.org> <20121215203458.GA22361@oddish> <50CCE59F.1080107@FreeBSD.org> <50CCE7F6.2090505@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50CCE7F6.2090505@FreeBSD.org> X-Operating-System: FreeBSD 10.0-CURRENT X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Davide Italiano , Mark Johnston , FreeBSD Current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: obrien@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 22:39:43 -0000 On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote: > On 15.12.2012 23:03, Alexander Motin wrote: > > Sorry, it's my fault. I've tried to save some time on patch generation > > and forgot about that change in lib/. We haven't touched user-level in > > our work except that file. Here is patch with that chunk added: > > http://people.freebsd.org/~mav/calloutng_12_15_1.patch > > And one more part I've missed is manual pages update, that probably > needs more improvements: > http://people.freebsd.org/~mav/calloutng_12_15.man.patch Perhaps use the SCM at what its good at? Sync your branch with HEAD and then do an 'svn diff ^/head' and your branch. -- -- David (obrien@FreeBSD.org) From owner-freebsd-arch@FreeBSD.ORG Mon Dec 17 22:53:34 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E9DE83BE; Mon, 17 Dec 2012 22:53:34 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id 302AF8FC17; Mon, 17 Dec 2012 22:53:34 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fy27so5898303vcb.13 for ; Mon, 17 Dec 2012 14:53:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=H6KlhSiR6C2nko2l4TkcphHSJnA30idh3OLsi8lL+jU=; b=om6qrOk72mWOCMhzNyu/HQjEU/aku5j/YI0jvbNdPuRJmhmrJbqBgSesx95FGcnwFl z35cTj3Tm7YzxjLqgFMjjnzqM3MvWyicHlNYvnISJeKCNXknFMb2QCDrWMrJjSlpYqmI WwdF931alSCgb2iIpnq5CyUS4sVNB2oC3/jkRmtOPiHI77S8L+qpbogoyXHPqCYtRw2H U3HKZ7D/zYlmP0pEQe0yROmhAhoJlqc/SYvgXOYEUmH/DFvQT2A1Pf9LooC3tYjdznTj 9XuzyF44FSNjmVLgic7+pFYURf6lyC73wuB+XWL6KuZBg/4BfzeJeww1gOxD8b6T3Xv6 vPDQ== MIME-Version: 1.0 Received: by 10.220.108.79 with SMTP id e15mr25745037vcp.61.1355784813492; Mon, 17 Dec 2012 14:53:33 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.229.136 with HTTP; Mon, 17 Dec 2012 14:53:33 -0800 (PST) In-Reply-To: <20121217223943.GA88856@dragon.NUXI.org> References: <50CCAB99.4040308@FreeBSD.org> <20121215203458.GA22361@oddish> <50CCE59F.1080107@FreeBSD.org> <50CCE7F6.2090505@FreeBSD.org> <20121217223943.GA88856@dragon.NUXI.org> Date: Mon, 17 Dec 2012 14:53:33 -0800 X-Google-Sender-Auth: xnCvV9VqCGLkO6JabtmmVnG1LTE Message-ID: Subject: Re: [RFC/RFT] calloutng From: Davide Italiano To: obrien@freebsd.org, Alexander Motin , Davide Italiano , freebsd-arch@freebsd.org, FreeBSD Current , Mark Johnston Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Dec 2012 22:53:35 -0000 On Mon, Dec 17, 2012 at 2:39 PM, David O'Brien wrote: > On Sat, Dec 15, 2012 at 11:13:26PM +0200, Alexander Motin wrote: >> On 15.12.2012 23:03, Alexander Motin wrote: >> > Sorry, it's my fault. I've tried to save some time on patch generation >> > and forgot about that change in lib/. We haven't touched user-level in >> > our work except that file. Here is patch with that chunk added: >> > http://people.freebsd.org/~mav/calloutng_12_15_1.patch >> >> And one more part I've missed is manual pages update, that probably >> needs more improvements: >> http://people.freebsd.org/~mav/calloutng_12_15.man.patch > > Perhaps use the SCM at what its good at? > > Sync your branch with HEAD and then do an 'svn diff ^/head' and your > branch. > > -- > -- David (obrien@FreeBSD.org) Last time I tried doing that the way you describe, I got an output with tons svn:mergeinfo and I didn't find a way to suppress them if not manually. e.g. Property changes on: usr.bin/procstat ___________________________________________________________________ Modified: svn:mergeinfo Merged /head/usr.bin/procstat:r236314-239017 Index: usr.bin/calendar =================================================================== --- usr.bin/calendar (.../head) (revision 239166) +++ usr.bin/calendar (.../projects/calloutng) (revision 239166) Can you help me in understanding what I did wrong? Thanks From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 09:03:52 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D8F5459E; Tue, 18 Dec 2012 09:03:52 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wg0-f46.google.com (mail-wg0-f46.google.com [74.125.82.46]) by mx1.freebsd.org (Postfix) with ESMTP id 09EF18FC17; Tue, 18 Dec 2012 09:03:51 +0000 (UTC) Received: by mail-wg0-f46.google.com with SMTP id dr13so157694wgb.13 for ; Tue, 18 Dec 2012 01:03:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=jUmn3xQGBB1Ucfa2lVNrmWszs0brmWUzyg2wEgyTH1w=; b=N3sA2gvIxnW1tXnkv0FYd6Rf+XOWCI3/kIkXIE0doJcNmouw3l0h+Z4k7Wcq1a9Ser 92Gr+GuHd5Nn02vWTeWpBVHwwnfG0E/ZN58n2uuR9XIjLnVvpquSutpEBDZ5V4YQ/VSE pxv8R47rUs52C5MkJlco+IA6+6ymffz3BJooXGkJjEG2g3mHeq8Q9U1HUvzoKa4P21kE 4DYVnaf19GBOadDebwT1/31SnQ9MkOPSDGKN8KV4m8S/ke2lILnWBbtX5z+WJl4eajOB fUGp2r6O7qTaYJ48/ZJLkNNR41Ebv3q+obKytua9Szp8dPqRs1RVnmFchbXW4v0zPYGh H80Q== X-Received: by 10.194.57.206 with SMTP id k14mr2559373wjq.26.1355821430972; Tue, 18 Dec 2012 01:03:50 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id hg17sm14947309wib.1.2012.12.18.01.03.49 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 18 Dec 2012 01:03:49 -0800 (PST) Sender: Alexander Motin Message-ID: <50D03173.9080904@FreeBSD.org> Date: Tue, 18 Dec 2012 11:03:47 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: FreeBSD Current , freebsd-arch@freebsd.org Subject: Re: [RFC/RFT] calloutng References: <50CCAB99.4040308@FreeBSD.org> <50CE5B54.3050905@FreeBSD.org> In-Reply-To: <50CE5B54.3050905@FreeBSD.org> Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 09:03:52 -0000 Experiments with dummynet shown ineffective support for very short tick-based callouts. New version fixes that, allowing to get as many tick-based callout events as hz value permits, while still be able to aggregate events and generating minimum of interrupts. Also this version modifies system load average calculation to fix some cases existing in HEAD and 9 branches, that could be fixed with new direct callout functionality. http://people.freebsd.org/~mav/calloutng_12_17.patch With several important changes made last time I am going to delay commit to HEAD for another week to do more testing. Comments and new test cases are welcome. Thanks for staying tuned and commenting. On 17.12.2012 01:37, Alexander Motin wrote: > Here is one more version. Unless something new will be found/reported > this may be the last one, because me and Davide are quite satisfied with > the results. If everything will be fine, I think we could commit it to > HEAD closer to the end of the week: > http://people.freebsd.org/~mav/calloutng_12_16.patch > > Changes in this version: > -- Removed couple of redundant variables in callout implementation, > that reduced sizeof(struct callout) by two pointers and simplified some > internal code. > -- syscons driver was made to schedule only 1-2 callouts per second > instead of 20-30 before when console is in graphical mode and there are > few other things to do. Now my laptop has only about 30 interrupts per > second total during idle periods with X running. > -- i8254 eventtimer driver was optimized to work faster in disabled by > default one-shot mode. > -- Few kernel functions were added to make KPIs more complete. > -- Man pages were updated. > -- Some style fixes were made. > > On 15.12.2012 18:55, Alexander Motin wrote: >> I'm sorry to interrupt review, but as usual good ideas came during the >> final testing, causing another round. :) Here is updated patch for >> HEAD, that includes several new changes: >> http://people.freebsd.org/~mav/calloutng_12_15.patch >> >> The new changes are: >> -- Precision and event aggregation code was reworked. Instead of >> previous -prec/+prec representation, precision is now single-sided -- >> -0/+prec. It allowed to significantly improve precision on long time >> intervals for APIs which imply that event should not happen before the >> specified time. Depending on CPU activity, mistake for long time >> intervals now will never be more then 1-500ms, even if specified >> precision allows more. >> -- Some minor optimizations were made to reduce callout overhead and >> latency by 1.5-2us. Now on Core2Duo amd64 system with LAPIC eventtimer >> and TSC timecounter usleep(1) call from user-level executes in just >> 5-6us, instead of 7-8us before. Now it can do 180K cycles per second on >> single CPU with only partial CPU load. >> -- Number of kernel subsystems (dcons, syscons, yarrow, led, atkbd, >> setrlimit) were modified to reduce number of interrupts, also with event >> aggregation by explicit specification of the acceptable events >> precision. Now my Core2Duo test system has only 30 interrupts per second >> in idle. If not remaining syscons events, it could easily be 15. My >> IvyBridge ultrabook first time in its history shown 5.5 hours of battery >> time with full screen brightness and 10 hours with lid closed. >> -- Some kernel functions were added to make KPIs more complete. >> >> I've successfully tested this patch on amd64 and arm. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 12:22:04 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BF7983BB; Tue, 18 Dec 2012 12:22:04 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 744058FC18; Tue, 18 Dec 2012 12:22:03 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 929CF7300A; Tue, 18 Dec 2012 13:20:43 +0100 (CET) Date: Tue, 18 Dec 2012 13:20:43 +0100 From: Luigi Rizzo To: Adrian Chadd Subject: Re: regarding r242905 ('us' argument to some callout functions) was Re: [RFC/RFT] calloutng Message-ID: <20121218122043.GB84347@onelab2.iet.unipi.it> References: <20121217192731.GA83405@onelab2.iet.unipi.it> <20121217211112.GA84347@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 12:22:04 -0000 On Mon, Dec 17, 2012 at 01:22:59PM -0800, Adrian Chadd wrote: > Personally, I'd rather see some consistently used units here.. bintime (or something similar) is the correct choice here. If we are concerned about the size (128 bit) then we can map it to a shorter, fixed point format, such as sign+31+32 as phk was suggesting. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 17:38:03 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DD567DB6; Tue, 18 Dec 2012 17:38:03 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 6192E8FC13; Tue, 18 Dec 2012 17:38:03 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id CA6527300A; Tue, 18 Dec 2012 18:36:43 +0100 (CET) Date: Tue, 18 Dec 2012 18:36:43 +0100 From: Luigi Rizzo To: Alexander Motin Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121218173643.GA94266@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50CF88B9.6040004@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 17:38:03 -0000 On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: > Hi. > > > I would instead do the following: > > I also don't very like the wide API and want to hear fresh ideas, but > approaches to time measurement there are too different to do what you > are proposing. Main problem is that while ticks value is relative, > bintime is absolute. It is not easy to make conversion between them fast > and precise. I've managed to do it, but the only function that does it > now is _callout_reset_on(). All other functions are just passing values > down. I am not sure I want to duplicate that code in each place, though > doing it at least for for callout may be a good idea. I am afraid the above is not convincing. Most/all of the APIs i mentioned still have the conversion from ticks to bintime, and the code in your patch is just building multiple parallel paths (one for each of the three versions of the same function) to some final piece of code where the conversion takes place. The problem is that all of this goes through a set of obfuscating macros and the end result is horrible. To be clear, i believe the work you have been doing on cleaning up callout is great, i am just saying that this is the time to look at the code from a few steps away and clean up all those design decisions that perhaps were made in a haste to make things work. I will give you another example to show how convoluted is the code now: cv_timedwait() and cv_timedwait_sig() now have three versions each (plain, bt, flags). These six are remapped through macros to two functions, _cv_timedwait() and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() maps to _cv_timedwait_sig() ) These two _cv_timedwait*() take both ticks and bintimes, and contain this sequence: + if (bt == NULL) + sleepq_set_timeout_flags(cvp, timo, flags); + else + sleepq_set_timeout_bt(cvp, bt, precision); Guess what, both sleepq_* are macros that remap to the same _sleepq_set_timeout(...) . So the above "if (bt == NULL)" is useless. But then if you dig into _sleepq_set_timeout() you'll see + if (bt == NULL) + callout_reset_flags_on(&td->td_slpcallout, timo, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); + else + callout_reset_bt_on(&td->td_slpcallout, bt, precision, + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); and again both callout_reset*() are again remapped through macros to _callout_reset_on(), so another useless "if (bt == NULL)" And in the end you have the conversion from ticks to bintime. So basically the code path for cv_timedwait() has those two useless switches and one useless extra argument, and the conversion from ticks to bintime is down deep down in _callout_reset_on() where it can only be resolved at runtime, whereas by doing the conversion at the beginning the decision could have been made at compile time. So I believe my proposal would give large simplifications in the code and lead to a much cleaner implementation of what you have designed: 1. acknowledge the fact that the only representation of time that callouts use internally is a bintime+precision, define one single function (instead of two or three or six) that implements the blessed API, and implement the others with macros or inline functions doing the appropriate conversions; 2. specifically, the *_flags() variant has no reason to exist. It can be implemented through the *_bt() variant, and being a new function the only places where you introduce it require manual modifications so you can directly invoke the new function. Again, please take this as constructive criticism, as i really like the work you have been doing and appreciate the time and effort you are putting on it cheers luigi > > Creating sets of three functions I had three different goals: > - callout_reset() -- it is legacy variant required to keep API > compatibility; > - callout_reset_flags() -- it is for cases where custom precision > specification needs to be added to the existing code, or where direct > callout execution is needed. Conversion to bintime would additionally > complicate consumer code, that I would try to avoid. > - callout_reset_bt() -- API for the new code, which needs high > precision and doesn't mind to operate bintime. Now there is only three > such places in kernel now, and I don't think there will be much more. > > Respectively, these three options are replicated to other APIs where > time intervals are used. > > PS: Please keep me in CC. > > -- > Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 17:51:33 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 31BF43EC; Tue, 18 Dec 2012 17:51:33 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id DAAE08FC0C; Tue, 18 Dec 2012 17:51:32 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id AC5847300A; Tue, 18 Dec 2012 18:50:13 +0100 (CET) Date: Tue, 18 Dec 2012 18:50:13 +0100 From: Luigi Rizzo To: Alexander Motin Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121218175013.GB94266@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20121218173643.GA94266@onelab2.iet.unipi.it> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 17:51:33 -0000 On Tue, Dec 18, 2012 at 06:36:43PM +0100, Luigi Rizzo wrote: > On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: ... > So I believe my proposal would give large simplifications in > the code and lead to a much cleaner implementation of what > you have designed: > > 1. acknowledge the fact that the only representation of time > that callouts use internally is a bintime+precision, define one > single function (instead of two or three or six) that implements > the blessed API, and implement the others with macros or > inline functions doing the appropriate conversions; > > 2. specifically, the *_flags() variant has no reason to exist. > It can be implemented through the *_bt() variant, and > being a new function the only places where you introduce it > require manual modifications so you can directly invoke > the new function. to clarify: i am not sure if now the *_bt() variant takes flags too, but my point is that the main API function should take all supported arguments (including flags) and others should simply be regarded as simplified versions. More or less what we have for sockets, with send() and sendmsg() and friend cheers luigi From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 18:10:59 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5D1BBCCD; Tue, 18 Dec 2012 18:10:59 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by mx1.freebsd.org (Postfix) with ESMTP id 883558FC0A; Tue, 18 Dec 2012 18:10:58 +0000 (UTC) Received: by mail-wg0-f53.google.com with SMTP id ei8so463317wgb.20 for ; Tue, 18 Dec 2012 10:10:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=9mYs28aKuBSA7F6k48Zxj2pbpDAEkuPkA8AqKpsPtdk=; b=asaQT2MMU7tpEWfR4+lRVRq66SQyrtGL+ubc/O1iKnpboCgodhDhUeAnYkEN52DtvX KlRG5dsPpjgG8oMv8o6C/t56iSoDp6Cwb4M9m3ic6+jT5TveFJsN4t5T/PLXv0buMEkb R3qd7Gk6/Pob4M4dwc550kq7V0xFYUG06KNWRFybCTtR5J3exWfYjvq1IvSw8sfWp7Ym h8ILiWbQywUwQtYCE2USCjGfmXG63LtrVnF96kvHtFuB2R2JzBDyl9YeeKwCsFs7n77x P2+oqn7Sc+UFwr/hC8PJCdAfX1sw6g7+s8bfFRiXoowwQ4dHCte5W8KeTSziKPLVKRDL DaIQ== X-Received: by 10.180.106.34 with SMTP id gr2mr6093537wib.18.1355853858666; Tue, 18 Dec 2012 10:04:18 -0800 (PST) Received: from pc.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id h19sm16857143wiv.7.2012.12.18.10.04.15 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 18 Dec 2012 10:04:17 -0800 (PST) Sender: Alexander Motin Message-ID: <50D0B00D.8090002@FreeBSD.org> Date: Tue, 18 Dec 2012 20:03:57 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: API explosion (Re: [RFC/RFT] calloutng) References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> In-Reply-To: <20121218173643.GA94266@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 18:10:59 -0000 On 18.12.2012 19:36, Luigi Rizzo wrote: > On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: >>> I would instead do the following: >> >> I also don't very like the wide API and want to hear fresh ideas, but >> approaches to time measurement there are too different to do what you >> are proposing. Main problem is that while ticks value is relative, >> bintime is absolute. It is not easy to make conversion between them fast >> and precise. I've managed to do it, but the only function that does it >> now is _callout_reset_on(). All other functions are just passing values >> down. I am not sure I want to duplicate that code in each place, though >> doing it at least for for callout may be a good idea. > > I am afraid the above is not convincing. > > Most/all of the APIs i mentioned still have the conversion from > ticks to bintime, and the code in your patch is just > building multiple parallel paths (one for each of the > three versions of the same function) to some final > piece of code where the conversion takes place. > > The problem is that all of this goes through a set of obfuscating > macros and the end result is horrible. > > To be clear, i believe the work you have been doing on cleaning up > callout is great, i am just saying that this is the time to look > at the code from a few steps away and clean up all those design > decisions that perhaps were made in a haste to make things work. > > I will give you another example to show how convoluted > is the code now: > > cv_timedwait() and cv_timedwait_sig() now have three > versions each (plain, bt, flags). > > These six are remapped through macros to two functions, _cv_timedwait() > and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() > maps to _cv_timedwait_sig() ) > > These two _cv_timedwait*() take both ticks and bintimes, > and contain this sequence: > > + if (bt == NULL) > + sleepq_set_timeout_flags(cvp, timo, flags); > + else > + sleepq_set_timeout_bt(cvp, bt, precision); > > Guess what, both sleepq_* are macros that remap to the same > _sleepq_set_timeout(...) . So the above "if (bt == NULL)" is useless. > > But then if you dig into _sleepq_set_timeout() you'll see > > + if (bt == NULL) > + callout_reset_flags_on(&td->td_slpcallout, timo, > + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > + else > + callout_reset_bt_on(&td->td_slpcallout, bt, precision, > + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > > and again both callout_reset*() are again remapped through > macros to _callout_reset_on(), so another useless "if (bt == NULL)" > And in the end you have the conversion from ticks to bintime. > > So basically the code path for cv_timedwait() has those > two useless switches and one useless extra argument, > and the conversion from ticks to bintime is down > deep down in _callout_reset_on() where it can only > be resolved at runtime, whereas by doing the conversion > at the beginning the decision could have been made at compile > time. > > So I believe my proposal would give large simplifications in > the code and lead to a much cleaner implementation of what > you have designed: > > 1. acknowledge the fact that the only representation of time > that callouts use internally is a bintime+precision, define one > single function (instead of two or three or six) that implements > the blessed API, and implement the others with macros or > inline functions doing the appropriate conversions; > > 2. specifically, the *_flags() variant has no reason to exist. > It can be implemented through the *_bt() variant, and > being a new function the only places where you introduce it > require manual modifications so you can directly invoke > the new function. > > Again, please take this as constructive criticism, as i really > like the work you have been doing and appreciate the time and > effort you are putting on it Your words about useless cascaded ifs touched me. Actually I've looked on _callout_reset_bt_on() yesterday, thinking about moving tick to bt conversion to separate function or wrapper. The only thing we would save in such case is single integer argument (ticks), as all others (bt, prec, flags) are used in the new world order. From the other side, to make the conversion process really effective and correct, I've used quite specific way of obtaining time, that may change in the future. I would not really like it to be inlined in every consumer function and become an ABI. So I see two possible ways: make that conversion a separate non-inline function (that will require two temporary variables to return results and will consume some time on call/return), or make callout_reset_bt_on() to have extra ticks argument, allowing to use it in one or another way without external ifs and macros. In last case all _bt functions in other APIs will also obtain ticks, bt, pr and flags arguments. Actually flags there could be used to specify time scale (monotonic or wall) and time base (relative or absolute), if we decide to implement all of them at some point. >> Creating sets of three functions I had three different goals: >> - callout_reset() -- it is legacy variant required to keep API >> compatibility; >> - callout_reset_flags() -- it is for cases where custom precision >> specification needs to be added to the existing code, or where direct >> callout execution is needed. Conversion to bintime would additionally >> complicate consumer code, that I would try to avoid. >> - callout_reset_bt() -- API for the new code, which needs high >> precision and doesn't mind to operate bintime. Now there is only three >> such places in kernel now, and I don't think there will be much more. >> >> Respectively, these three options are replicated to other APIs where >> time intervals are used. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 21:46:26 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 22788D0B; Tue, 18 Dec 2012 21:46:26 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by mx1.freebsd.org (Postfix) with ESMTP id 4C96E8FC13; Tue, 18 Dec 2012 21:46:24 +0000 (UTC) Received: by mail-wi0-f177.google.com with SMTP id hm2so740685wib.4 for ; Tue, 18 Dec 2012 13:46:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=A68M3grTghYB4r+16AGzieL1sMBAZo7c+1yR/mVHgog=; b=iaQtlPdUBNka2Ll0yOAcNFmT+A3Xdh8nkw7polEH7nqp6PlY4niYz3KUyanBWIKbfH /GZqHWm9exHES9bTyAnLN7k+37aenGG1GqixMmngrttu+y0Nok6Y3nIRjQ8zGN7BpWcw X8ZctZtsxOmOqBmv9W/lgtjXvODtdmWgLxTSjiRTwIAe+EWN50KNZmviV2+GHzeXk/IA K0A+5Aif/Y2jZ/Tp3DMr7zSDvyQWLhp3xAYUsDzbjSm+jOPTyAasWESa/ltJNIkTplZW I2AzXEe9Fe/Pru+/h6oRaKfF5qzYmz7AIA6L4o7I8Z+gsTdKB+XJk26dxsxFOCpe5CaQ VXUg== X-Received: by 10.194.118.229 with SMTP id kp5mr7520317wjb.2.1355867183807; Tue, 18 Dec 2012 13:46:23 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id bd7sm4666335wib.8.2012.12.18.13.46.21 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 18 Dec 2012 13:46:22 -0800 (PST) Sender: Alexander Motin Message-ID: <50D0E42B.6030605@FreeBSD.org> Date: Tue, 18 Dec 2012 23:46:19 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: API explosion (Re: [RFC/RFT] calloutng) References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> In-Reply-To: <50D0B00D.8090002@FreeBSD.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 21:46:26 -0000 On 18.12.2012 20:03, Alexander Motin wrote: > On 18.12.2012 19:36, Luigi Rizzo wrote: >> On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: >>>> I would instead do the following: >>> >>> I also don't very like the wide API and want to hear fresh ideas, but >>> approaches to time measurement there are too different to do what you >>> are proposing. Main problem is that while ticks value is relative, >>> bintime is absolute. It is not easy to make conversion between them fast >>> and precise. I've managed to do it, but the only function that does it >>> now is _callout_reset_on(). All other functions are just passing values >>> down. I am not sure I want to duplicate that code in each place, though >>> doing it at least for for callout may be a good idea. >> >> I am afraid the above is not convincing. >> >> Most/all of the APIs i mentioned still have the conversion from >> ticks to bintime, and the code in your patch is just >> building multiple parallel paths (one for each of the >> three versions of the same function) to some final >> piece of code where the conversion takes place. >> >> The problem is that all of this goes through a set of obfuscating >> macros and the end result is horrible. >> >> To be clear, i believe the work you have been doing on cleaning up >> callout is great, i am just saying that this is the time to look >> at the code from a few steps away and clean up all those design >> decisions that perhaps were made in a haste to make things work. >> >> I will give you another example to show how convoluted >> is the code now: >> >> cv_timedwait() and cv_timedwait_sig() now have three >> versions each (plain, bt, flags). >> >> These six are remapped through macros to two functions, _cv_timedwait() >> and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() >> maps to _cv_timedwait_sig() ) >> >> These two _cv_timedwait*() take both ticks and bintimes, >> and contain this sequence: >> >> + if (bt == NULL) >> + sleepq_set_timeout_flags(cvp, timo, flags); >> + else >> + sleepq_set_timeout_bt(cvp, bt, precision); >> >> Guess what, both sleepq_* are macros that remap to the same >> _sleepq_set_timeout(...) . So the above "if (bt == NULL)" is useless. >> >> But then if you dig into _sleepq_set_timeout() you'll see >> >> + if (bt == NULL) >> + callout_reset_flags_on(&td->td_slpcallout, timo, >> + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); >> + else >> + callout_reset_bt_on(&td->td_slpcallout, bt, precision, >> + sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); >> >> and again both callout_reset*() are again remapped through >> macros to _callout_reset_on(), so another useless "if (bt == NULL)" >> And in the end you have the conversion from ticks to bintime. >> >> So basically the code path for cv_timedwait() has those >> two useless switches and one useless extra argument, >> and the conversion from ticks to bintime is down >> deep down in _callout_reset_on() where it can only >> be resolved at runtime, whereas by doing the conversion >> at the beginning the decision could have been made at compile >> time. >> >> So I believe my proposal would give large simplifications in >> the code and lead to a much cleaner implementation of what >> you have designed: >> >> 1. acknowledge the fact that the only representation of time >> that callouts use internally is a bintime+precision, define one >> single function (instead of two or three or six) that implements >> the blessed API, and implement the others with macros or >> inline functions doing the appropriate conversions; >> >> 2. specifically, the *_flags() variant has no reason to exist. >> It can be implemented through the *_bt() variant, and >> being a new function the only places where you introduce it >> require manual modifications so you can directly invoke >> the new function. >> >> Again, please take this as constructive criticism, as i really >> like the work you have been doing and appreciate the time and >> effort you are putting on it > > Your words about useless cascaded ifs touched me. Actually I've looked > on _callout_reset_bt_on() yesterday, thinking about moving tick to bt > conversion to separate function or wrapper. The only thing we would save > in such case is single integer argument (ticks), as all others (bt, > prec, flags) are used in the new world order. From the other side, to > make the conversion process really effective and correct, I've used > quite specific way of obtaining time, that may change in the future. I > would not really like it to be inlined in every consumer function and > become an ABI. So I see two possible ways: make that conversion a > separate non-inline function (that will require two temporary variables > to return results and will consume some time on call/return), or make > callout_reset_bt_on() to have extra ticks argument, allowing to use it > in one or another way without external ifs and macros. In last case all > _bt functions in other APIs will also obtain ticks, bt, pr and flags > arguments. Actually flags there could be used to specify time scale > (monotonic or wall) and time base (relative or absolute), if we decide > to implement all of them at some point. What will you say about this patch: http://people.freebsd.org/~mav/calloutng_api2.patch ? It is the second way of above. Somewhat less functions, one extra argument, no branching. >>> Creating sets of three functions I had three different goals: >>> - callout_reset() -- it is legacy variant required to keep API >>> compatibility; >>> - callout_reset_flags() -- it is for cases where custom precision >>> specification needs to be added to the existing code, or where direct >>> callout execution is needed. Conversion to bintime would additionally >>> complicate consumer code, that I would try to avoid. >>> - callout_reset_bt() -- API for the new code, which needs high >>> precision and doesn't mind to operate bintime. Now there is only three >>> such places in kernel now, and I don't think there will be much more. >>> >>> Respectively, these three options are replicated to other APIs where >>> time intervals are used. > -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 22:59:50 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9D221637; Tue, 18 Dec 2012 22:59:50 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 0ECCF8FC15; Tue, 18 Dec 2012 22:59:49 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id B16B07300A; Tue, 18 Dec 2012 23:58:23 +0100 (CET) Date: Tue, 18 Dec 2012 23:58:23 +0100 From: Luigi Rizzo To: Alexander Motin Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121218225823.GA96962@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50D0E42B.6030605@FreeBSD.org> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , freebsd-current , phk@onelab2.iet.unipi.it, "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 22:59:50 -0000 [top posting for readability; in summary we were discussing the new callout API trying to avoid an explosion of methods and arguments while at the same time supporting the old API and the new one] (I am also Cc-ing phk as he might have better insight on the topic). I think the patch you propose is a step in the right direction, but i still remain concerned by having to pass two bintimes (by reference, but they should really go by value) and one 'ticks' value to all these functions. I am also dubious that we need a full 128 bits to specify the 'precision': there would be absolutely no loss of functionality if we decided to specify the precision in powers of 2, so a precision 'k' (signed) means 2^k seconds. This way 8 bits are enough to represent any precision we want. The key difference between 'ticks' and bintimes (and the main difficulty in the conversion) is that ticks are relative and bintimes are interpreted as absolute. This could be easily solved by using a flag to specify if the 'bt' argument is absolute or relative, and passing the argument by value. So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the precision, and another 64 or 128 bit field contains the bintime. How does this look ? cheers luigi > On 18.12.2012 20:03, Alexander Motin wrote: > >On 18.12.2012 19:36, Luigi Rizzo wrote: > >>On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: > >>>>I would instead do the following: > >>> > >>>I also don't very like the wide API and want to hear fresh ideas, but > >>>approaches to time measurement there are too different to do what you > >>>are proposing. Main problem is that while ticks value is relative, > >>>bintime is absolute. It is not easy to make conversion between them fast > >>>and precise. I've managed to do it, but the only function that does it > >>>now is _callout_reset_on(). All other functions are just passing values > >>>down. I am not sure I want to duplicate that code in each place, though > >>>doing it at least for for callout may be a good idea. > >> > >>I am afraid the above is not convincing. > >> > >>Most/all of the APIs i mentioned still have the conversion from > >>ticks to bintime, and the code in your patch is just > >>building multiple parallel paths (one for each of the > >>three versions of the same function) to some final > >>piece of code where the conversion takes place. > >> > >>The problem is that all of this goes through a set of obfuscating > >>macros and the end result is horrible. > >> > >>To be clear, i believe the work you have been doing on cleaning up > >>callout is great, i am just saying that this is the time to look > >>at the code from a few steps away and clean up all those design > >>decisions that perhaps were made in a haste to make things work. > >> > >>I will give you another example to show how convoluted > >>is the code now: > >> > >>cv_timedwait() and cv_timedwait_sig() now have three > >>versions each (plain, bt, flags). > >> > >>These six are remapped through macros to two functions, _cv_timedwait() > >>and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() > >>maps to _cv_timedwait_sig() ) > >> > >>These two _cv_timedwait*() take both ticks and bintimes, > >>and contain this sequence: > >> > >>+ if (bt == NULL) > >>+ sleepq_set_timeout_flags(cvp, timo, flags); > >>+ else > >>+ sleepq_set_timeout_bt(cvp, bt, precision); > >> > >>Guess what, both sleepq_* are macros that remap to the same > >>_sleepq_set_timeout(...) . So the above "if (bt == NULL)" is useless. > >> > >>But then if you dig into _sleepq_set_timeout() you'll see > >> > >>+ if (bt == NULL) > >>+ callout_reset_flags_on(&td->td_slpcallout, timo, > >>+ sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > >>+ else > >>+ callout_reset_bt_on(&td->td_slpcallout, bt, precision, > >>+ sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > >> > >>and again both callout_reset*() are again remapped through > >>macros to _callout_reset_on(), so another useless "if (bt == NULL)" > >>And in the end you have the conversion from ticks to bintime. > >> > >>So basically the code path for cv_timedwait() has those > >>two useless switches and one useless extra argument, > >>and the conversion from ticks to bintime is down > >>deep down in _callout_reset_on() where it can only > >>be resolved at runtime, whereas by doing the conversion > >>at the beginning the decision could have been made at compile > >>time. > >> > >>So I believe my proposal would give large simplifications in > >>the code and lead to a much cleaner implementation of what > >>you have designed: > >> > >>1. acknowledge the fact that the only representation of time > >> that callouts use internally is a bintime+precision, define one > >> single function (instead of two or three or six) that implements > >> the blessed API, and implement the others with macros or > >> inline functions doing the appropriate conversions; > >> > >>2. specifically, the *_flags() variant has no reason to exist. > >> It can be implemented through the *_bt() variant, and > >> being a new function the only places where you introduce it > >> require manual modifications so you can directly invoke > >> the new function. > >> > >>Again, please take this as constructive criticism, as i really > >>like the work you have been doing and appreciate the time and > >>effort you are putting on it > > > >Your words about useless cascaded ifs touched me. Actually I've looked > >on _callout_reset_bt_on() yesterday, thinking about moving tick to bt > >conversion to separate function or wrapper. The only thing we would save > >in such case is single integer argument (ticks), as all others (bt, > >prec, flags) are used in the new world order. From the other side, to > >make the conversion process really effective and correct, I've used > >quite specific way of obtaining time, that may change in the future. I > >would not really like it to be inlined in every consumer function and > >become an ABI. So I see two possible ways: make that conversion a > >separate non-inline function (that will require two temporary variables > >to return results and will consume some time on call/return), or make > >callout_reset_bt_on() to have extra ticks argument, allowing to use it > >in one or another way without external ifs and macros. In last case all > >_bt functions in other APIs will also obtain ticks, bt, pr and flags > >arguments. Actually flags there could be used to specify time scale > >(monotonic or wall) and time base (relative or absolute), if we decide > >to implement all of them at some point. > > What will you say about this patch: > http://people.freebsd.org/~mav/calloutng_api2.patch > ? > > It is the second way of above. Somewhat less functions, one extra > argument, no branching. > > >>>Creating sets of three functions I had three different goals: > >>> - callout_reset() -- it is legacy variant required to keep API > >>>compatibility; > >>> - callout_reset_flags() -- it is for cases where custom precision > >>>specification needs to be added to the existing code, or where direct > >>>callout execution is needed. Conversion to bintime would additionally > >>>complicate consumer code, that I would try to avoid. > >>> - callout_reset_bt() -- API for the new code, which needs high > >>>precision and doesn't mind to operate bintime. Now there is only three > >>>such places in kernel now, and I don't think there will be much more. > >>> > >>>Respectively, these three options are replicated to other APIs where > >>>time intervals are used. > > > > > -- > Alexander Motin > > From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 23:28:07 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8BF2A978; Tue, 18 Dec 2012 23:28:07 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id A84468FC15; Tue, 18 Dec 2012 23:28:02 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBINRtBF014465; Tue, 18 Dec 2012 16:27:55 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBINRjZu064428; Tue, 18 Dec 2012 16:27:45 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Ian Lepore To: Luigi Rizzo In-Reply-To: <20121218225823.GA96962@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> Content-Type: text/plain; charset="us-ascii" Date: Tue, 18 Dec 2012 16:27:45 -0700 Message-ID: <1355873265.1198.183.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Davide Italiano , Alexander Motin , freebsd-current , phk@onelab2.iet.unipi.it, "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 23:28:07 -0000 On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: > [top posting for readability; > in summary we were discussing the new callout API trying to avoid > an explosion of methods and arguments while at the same time > supporting the old API and the new one] > (I am also Cc-ing phk as he might have better insight > on the topic). > > I think the patch you propose is a step in the right direction, > but i still remain concerned by having to pass two bintimes > (by reference, but they should really go by value) > and one 'ticks' value to all these functions. > > I am also dubious that we need a full 128 bits to specify > the 'precision': there would be absolutely no loss of functionality > if we decided to specify the precision in powers of 2, so a precision > 'k' (signed) means 2^k seconds. This way 8 bits are enough to > represent any precision we want. > > The key difference between 'ticks' and bintimes (and the main > difficulty in the conversion) is that ticks are relative and bintimes > are interpreted as absolute. This could be easily solved by using > a flag to specify if the 'bt' argument is absolute or relative, and > passing the argument by value. > > So now the flags could contain C_DIRECT_EXEC, C_BT_IS_RELATIVE, the > precision, and another 64 or 128 bit field contains the bintime. > > How does this look ? > > cheers > luigi > I tend to agree that the bintime should be passed by value instead of reference. That would allow an inline tickstobintime() that converts relative ticks to an absolute bintime returned by value and passed right along in one tidy line/clump of code without any temporary variables cluttering things up. While the 1980s C programmer in me still wants to avoid returning complex objects by value, the reality is that modern compilers tend to generate really nice code for such constructs, usually without any copying of the return value at all. I'm not so sure about the 2^k precision. You speak of seconds, but I would be worrying about sub-second precision in my work. It would typical to want a 500uS timeout but be willing to late by up to 250uS if that helped scheduling and performance. Also, my idea of precision would virtually always be "I'm willing to be late by this much, but never early by any amount." To reinforce the point of that last paragraph... the way I'm looking at these changes has nothing to do with power saving (I've never owned a battery-operated computer, probably never will) and everything to do with performance and being able to sleep accurately for less than a tick. -- Ian > > On 18.12.2012 20:03, Alexander Motin wrote: > > >On 18.12.2012 19:36, Luigi Rizzo wrote: > > >>On Mon, Dec 17, 2012 at 11:03:53PM +0200, Alexander Motin wrote: > > >>>>I would instead do the following: > > >>> > > >>>I also don't very like the wide API and want to hear fresh ideas, but > > >>>approaches to time measurement there are too different to do what you > > >>>are proposing. Main problem is that while ticks value is relative, > > >>>bintime is absolute. It is not easy to make conversion between them fast > > >>>and precise. I've managed to do it, but the only function that does it > > >>>now is _callout_reset_on(). All other functions are just passing values > > >>>down. I am not sure I want to duplicate that code in each place, though > > >>>doing it at least for for callout may be a good idea. > > >> > > >>I am afraid the above is not convincing. > > >> > > >>Most/all of the APIs i mentioned still have the conversion from > > >>ticks to bintime, and the code in your patch is just > > >>building multiple parallel paths (one for each of the > > >>three versions of the same function) to some final > > >>piece of code where the conversion takes place. > > >> > > >>The problem is that all of this goes through a set of obfuscating > > >>macros and the end result is horrible. > > >> > > >>To be clear, i believe the work you have been doing on cleaning up > > >>callout is great, i am just saying that this is the time to look > > >>at the code from a few steps away and clean up all those design > > >>decisions that perhaps were made in a haste to make things work. > > >> > > >>I will give you another example to show how convoluted > > >>is the code now: > > >> > > >>cv_timedwait() and cv_timedwait_sig() now have three > > >>versions each (plain, bt, flags). > > >> > > >>These six are remapped through macros to two functions, _cv_timedwait() > > >>and _cv_timedwait_sig(), with a possible bug (cv_timedwait_bt() > > >>maps to _cv_timedwait_sig() ) > > >> > > >>These two _cv_timedwait*() take both ticks and bintimes, > > >>and contain this sequence: > > >> > > >>+ if (bt == NULL) > > >>+ sleepq_set_timeout_flags(cvp, timo, flags); > > >>+ else > > >>+ sleepq_set_timeout_bt(cvp, bt, precision); > > >> > > >>Guess what, both sleepq_* are macros that remap to the same > > >>_sleepq_set_timeout(...) . So the above "if (bt == NULL)" is useless. > > >> > > >>But then if you dig into _sleepq_set_timeout() you'll see > > >> > > >>+ if (bt == NULL) > > >>+ callout_reset_flags_on(&td->td_slpcallout, timo, > > >>+ sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > > >>+ else > > >>+ callout_reset_bt_on(&td->td_slpcallout, bt, precision, > > >>+ sleepq_timeout, td, PCPU_GET(cpuid), flags | C_DIRECT_EXEC); > > >> > > >>and again both callout_reset*() are again remapped through > > >>macros to _callout_reset_on(), so another useless "if (bt == NULL)" > > >>And in the end you have the conversion from ticks to bintime. > > >> > > >>So basically the code path for cv_timedwait() has those > > >>two useless switches and one useless extra argument, > > >>and the conversion from ticks to bintime is down > > >>deep down in _callout_reset_on() where it can only > > >>be resolved at runtime, whereas by doing the conversion > > >>at the beginning the decision could have been made at compile > > >>time. > > >> > > >>So I believe my proposal would give large simplifications in > > >>the code and lead to a much cleaner implementation of what > > >>you have designed: > > >> > > >>1. acknowledge the fact that the only representation of time > > >> that callouts use internally is a bintime+precision, define one > > >> single function (instead of two or three or six) that implements > > >> the blessed API, and implement the others with macros or > > >> inline functions doing the appropriate conversions; > > >> > > >>2. specifically, the *_flags() variant has no reason to exist. > > >> It can be implemented through the *_bt() variant, and > > >> being a new function the only places where you introduce it > > >> require manual modifications so you can directly invoke > > >> the new function. > > >> > > >>Again, please take this as constructive criticism, as i really > > >>like the work you have been doing and appreciate the time and > > >>effort you are putting on it > > > > > >Your words about useless cascaded ifs touched me. Actually I've looked > > >on _callout_reset_bt_on() yesterday, thinking about moving tick to bt > > >conversion to separate function or wrapper. The only thing we would save > > >in such case is single integer argument (ticks), as all others (bt, > > >prec, flags) are used in the new world order. From the other side, to > > >make the conversion process really effective and correct, I've used > > >quite specific way of obtaining time, that may change in the future. I > > >would not really like it to be inlined in every consumer function and > > >become an ABI. So I see two possible ways: make that conversion a > > >separate non-inline function (that will require two temporary variables > > >to return results and will consume some time on call/return), or make > > >callout_reset_bt_on() to have extra ticks argument, allowing to use it > > >in one or another way without external ifs and macros. In last case all > > >_bt functions in other APIs will also obtain ticks, bt, pr and flags > > >arguments. Actually flags there could be used to specify time scale > > >(monotonic or wall) and time base (relative or absolute), if we decide > > >to implement all of them at some point. > > > > What will you say about this patch: > > http://people.freebsd.org/~mav/calloutng_api2.patch > > ? > > > > It is the second way of above. Somewhat less functions, one extra > > argument, no branching. > > > > >>>Creating sets of three functions I had three different goals: > > >>> - callout_reset() -- it is legacy variant required to keep API > > >>>compatibility; > > >>> - callout_reset_flags() -- it is for cases where custom precision > > >>>specification needs to be added to the existing code, or where direct > > >>>callout execution is needed. Conversion to bintime would additionally > > >>>complicate consumer code, that I would try to avoid. > > >>> - callout_reset_bt() -- API for the new code, which needs high > > >>>precision and doesn't mind to operate bintime. Now there is only three > > >>>such places in kernel now, and I don't think there will be much more. > > >>> > > >>>Respectively, these three options are replicated to other APIs where > > >>>time intervals are used. > > > > > > > > > -- > > Alexander Motin > > > > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 23:31:15 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 073A1B67; Tue, 18 Dec 2012 23:31:15 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id B1A7F8FC0C; Tue, 18 Dec 2012 23:31:14 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id AD3B37300A; Wed, 19 Dec 2012 00:29:55 +0100 (CET) Date: Wed, 19 Dec 2012 00:29:55 +0100 From: Luigi Rizzo To: Ian Lepore Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121218232955.GA97440@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1355873265.1198.183.camel@revolution.hippie.lan> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 23:31:15 -0000 On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: > On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: > > [top posting for readability; > > in summary we were discussing the new callout API trying to avoid > > an explosion of methods and arguments while at the same time > > supporting the old API and the new one] > > (I am also Cc-ing phk as he might have better insight > > on the topic). > > > > I think the patch you propose is a step in the right direction, > > but i still remain concerned by having to pass two bintimes > > (by reference, but they should really go by value) > > and one 'ticks' value to all these functions. > > > > I am also dubious that we need a full 128 bits to specify > > the 'precision': there would be absolutely no loss of functionality > > if we decided to specify the precision in powers of 2, so a precision > > 'k' (signed) means 2^k seconds. This way 8 bits are enough to > > represent any precision we want. ... > I'm not so sure about the 2^k precision. You speak of seconds, but I > would be worrying about sub-second precision in my work. It would > typical to want a 500uS timeout but be willing to late by up to 250uS if i said k is signed so negative values represent fractions of a second. 2^-128 is pretty short :) cheers luigi From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 23:37:16 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 49399D9F; Tue, 18 Dec 2012 23:37:16 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id E01328FC0A; Tue, 18 Dec 2012 23:37:13 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBINbCI3036231; Tue, 18 Dec 2012 16:37:12 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBINbAla064484; Tue, 18 Dec 2012 16:37:10 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Ian Lepore To: Luigi Rizzo In-Reply-To: <20121218232955.GA97440@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <20121218232955.GA97440@onelab2.iet.unipi.it> Content-Type: text/plain; charset="us-ascii" Date: Tue, 18 Dec 2012 16:37:10 -0700 Message-ID: <1355873830.1198.189.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Davide Italiano , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 23:37:16 -0000 On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: > On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: > > On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: > > > [top posting for readability; > > > in summary we were discussing the new callout API trying to avoid > > > an explosion of methods and arguments while at the same time > > > supporting the old API and the new one] > > > (I am also Cc-ing phk as he might have better insight > > > on the topic). > > > > > > I think the patch you propose is a step in the right direction, > > > but i still remain concerned by having to pass two bintimes > > > (by reference, but they should really go by value) > > > and one 'ticks' value to all these functions. > > > > > > I am also dubious that we need a full 128 bits to specify > > > the 'precision': there would be absolutely no loss of functionality > > > if we decided to specify the precision in powers of 2, so a precision > > > 'k' (signed) means 2^k seconds. This way 8 bits are enough to > > > represent any precision we want. > > ... > > I'm not so sure about the 2^k precision. You speak of seconds, but I > > would be worrying about sub-second precision in my work. It would > > typical to want a 500uS timeout but be willing to late by up to 250uS if > > i said k is signed so negative values represent fractions of a > second. 2^-128 is pretty short :) > > cheers > luigi Ahh, I missed that. Good enough then! Hmmm, if that ideas survives further review, then could precision be encoded in 8 bits of the flags, eliminating another parm? -- Ian From owner-freebsd-arch@FreeBSD.ORG Tue Dec 18 23:44:00 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1E482243; Tue, 18 Dec 2012 23:44:00 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id C1D398FC0C; Tue, 18 Dec 2012 23:43:59 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 073547300A; Wed, 19 Dec 2012 00:42:41 +0100 (CET) Date: Wed, 19 Dec 2012 00:42:41 +0100 From: Luigi Rizzo To: Ian Lepore Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121218234240.GA97678@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <20121218232955.GA97440@onelab2.iet.unipi.it> <1355873830.1198.189.camel@revolution.hippie.lan> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1355873830.1198.189.camel@revolution.hippie.lan> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Dec 2012 23:44:00 -0000 On Tue, Dec 18, 2012 at 04:37:10PM -0700, Ian Lepore wrote: > On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: > > On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: > > > On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: > > > > [top posting for readability; > > > > in summary we were discussing the new callout API trying to avoid > > > > an explosion of methods and arguments while at the same time > > > > supporting the old API and the new one] > > > > (I am also Cc-ing phk as he might have better insight > > > > on the topic). > > > > > > > > I think the patch you propose is a step in the right direction, > > > > but i still remain concerned by having to pass two bintimes > > > > (by reference, but they should really go by value) > > > > and one 'ticks' value to all these functions. > > > > > > > > I am also dubious that we need a full 128 bits to specify > > > > the 'precision': there would be absolutely no loss of functionality > > > > if we decided to specify the precision in powers of 2, so a precision > > > > 'k' (signed) means 2^k seconds. This way 8 bits are enough to > > > > represent any precision we want. > > > > ... > > > I'm not so sure about the 2^k precision. You speak of seconds, but I > > > would be worrying about sub-second precision in my work. It would > > > typical to want a 500uS timeout but be willing to late by up to 250uS if > > > > i said k is signed so negative values represent fractions of a > > second. 2^-128 is pretty short :) > > > > cheers > > luigi > > Ahh, I missed that. Good enough then! Hmmm, if that ideas survives > further review, then could precision be encoded in 8 bits of the flags, > eliminating another parm? that was also what i wrote later in the message :) now we should figure out some use for the remaining 22 bits of the flags cheers luigi From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 08:34:46 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8E15B48D; Wed, 19 Dec 2012 08:34:46 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by mx1.freebsd.org (Postfix) with ESMTP id B50A88FC0C; Wed, 19 Dec 2012 08:34:45 +0000 (UTC) Received: by mail-wi0-f180.google.com with SMTP id hj13so997337wib.1 for ; Wed, 19 Dec 2012 00:34:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=DadxxGdYJHOV3q/w6EouO8LdB8CbMOLMQf5BpToPM1E=; b=s50Z4udQ6HeeDIGg5napEbdv8lah3hPMbzf/nlG5xwMD6g+73iSGwMtqqB+HNajOca H3GLWdg53OE2ZxEx6clVchLC4moT/DsKTuMntsQpOUOHyY0+K2vNIGD//beVOcr5coH+ 0PduBSki1Tk1HvBIbFaJ6kmugYhlZ+2QXF94k6KSgk5MEmnyxRZqmtc+KvHR/hGetI49 Q/xeLC9wSaPIxKjZmKDwnS0jlsuSqQaMCzQr35l21E8qyvNo8cI3zq3Ra5QAEJWnzYwc mXAeTLzvI4tyxh9bcQOuV6x3XvdmyIc9ncx1xfe0J7jVL95/kU5eDAHjZqiMizOD/uHr NH/g== X-Received: by 10.194.122.98 with SMTP id lr2mr9798965wjb.55.1355906079195; Wed, 19 Dec 2012 00:34:39 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id l5sm6806692wia.10.2012.12.19.00.34.37 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 19 Dec 2012 00:34:38 -0800 (PST) Sender: Alexander Motin Message-ID: <50D17C1B.8010207@FreeBSD.org> Date: Wed, 19 Dec 2012 10:34:35 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Ian Lepore Subject: Re: API explosion (Re: [RFC/RFT] calloutng) References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <20121218232955.GA97440@onelab2.iet.unipi.it> <1355873830.1198.189.camel@revolution.hippie.lan> In-Reply-To: <1355873830.1198.189.camel@revolution.hippie.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 08:34:46 -0000 On 19.12.2012 01:37, Ian Lepore wrote: > On Wed, 2012-12-19 at 00:29 +0100, Luigi Rizzo wrote: >> On Tue, Dec 18, 2012 at 04:27:45PM -0700, Ian Lepore wrote: >>> On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: >>>> [top posting for readability; >>>> in summary we were discussing the new callout API trying to avoid >>>> an explosion of methods and arguments while at the same time >>>> supporting the old API and the new one] >>>> (I am also Cc-ing phk as he might have better insight >>>> on the topic). >>>> >>>> I think the patch you propose is a step in the right direction, >>>> but i still remain concerned by having to pass two bintimes >>>> (by reference, but they should really go by value) >>>> and one 'ticks' value to all these functions. >>>> >>>> I am also dubious that we need a full 128 bits to specify >>>> the 'precision': there would be absolutely no loss of functionality >>>> if we decided to specify the precision in powers of 2, so a precision >>>> 'k' (signed) means 2^k seconds. This way 8 bits are enough to >>>> represent any precision we want. >> >> ... >>> I'm not so sure about the 2^k precision. You speak of seconds, but I >>> would be worrying about sub-second precision in my work. It would >>> typical to want a 500uS timeout but be willing to late by up to 250uS if >> >> i said k is signed so negative values represent fractions of a >> second. 2^-128 is pretty short :) > > Ahh, I missed that. Good enough then! Hmmm, if that ideas survives > further review, then could precision be encoded in 8 bits of the flags, > eliminating another parm? Those who tracked the branch could see that actually was our first approach to handle precision. Unfortunately, it appeared not very convenient when you need to convert relative precision in percents or fraction of interval to the absolute precision in power of 2. We were worried that using some ffsll() for it can be inconvenient or expensive. But since we are now talking about passing relative bintime as an argument, that may be more viable option. I'll make another try. Thanks for the input. Pity it didn't happen couple of months ago. -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 09:54:17 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5A2D914F; Wed, 19 Dec 2012 09:54:17 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id D3CEE8FC13; Wed, 19 Dec 2012 09:54:16 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id AA90289EAF; Wed, 19 Dec 2012 09:54:09 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJ9s8Kn014605; Wed, 19 Dec 2012 09:54:08 GMT (envelope-from phk@phk.freebsd.dk) To: Ian Lepore Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <1355873265.1198.183.camel@revolution.hippie.lan> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> Date: Wed, 19 Dec 2012 09:54:08 +0000 Message-ID: <14604.1355910848@critter.freebsd.dk> Cc: Davide Italiano , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 09:54:17 -0000 -------- In message <1355873265.1198.183.camel@revolution.hippie.lan>, Ian Lepore writes : >On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: >I'm not so sure about the 2^k precision. You speak of seconds, but I >would be worrying about sub-second precision in my work. It is a bad idea, and it is physically pointless, given the stabilities of the timebases available for computers in general. Please just take my word as a time-nut, and use a 32.32 binary format in seconds (see previous email) and be done with it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 10:03:34 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2563A531; Wed, 19 Dec 2012 10:03:34 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vc0-f177.google.com (mail-vc0-f177.google.com [209.85.220.177]) by mx1.freebsd.org (Postfix) with ESMTP id 898658FC16; Wed, 19 Dec 2012 10:03:33 +0000 (UTC) Received: by mail-vc0-f177.google.com with SMTP id m8so2063896vcd.36 for ; Wed, 19 Dec 2012 02:03:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=e2OnjHZR5gvjn99Aks5MWvPKVpUHayeKe6gYI/7v3no=; b=kCneGXbC1/yYEy9kNEE9Up3LjxNNqV7T57XQmTPOvPx77cvdN58rpME15iNYAKwHEH OU0bQtNBwxHRctgcV4r55u9cYrbxRF2Ds+1jaOm0F8auwbCL79JRx22YoIVmNkwZAK5X cWKFquIhEn69zIZdWRovbPSNCB8TyIJPzNy1CWysDjfqmyvLYMpsQ85hR+P/Igv2s1Nc 3PF10qqa4qXV/zwD+c518rjKtR31pzQWTc+j2l5TAoB4vDfYMJCMFB1X9leEMs/QSrUd 1BdHNXUuz7J6mA9Fkj7GaKubPa6tiXbnGYXAXiGtC+ww2uI+ovPRI+pCohjL67jWCuCL HXKQ== MIME-Version: 1.0 Received: by 10.220.154.148 with SMTP id o20mr7834553vcw.54.1355911412603; Wed, 19 Dec 2012 02:03:32 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.229.136 with HTTP; Wed, 19 Dec 2012 02:03:32 -0800 (PST) In-Reply-To: <14604.1355910848@critter.freebsd.dk> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> Date: Wed, 19 Dec 2012 02:03:32 -0800 X-Google-Sender-Auth: 5j9_ioOEHcmA53K-Ww31vnAW9gM Message-ID: Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Davide Italiano To: Poul-Henning Kamp Content-Type: text/plain; charset=ISO-8859-1 Cc: Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 10:03:34 -0000 On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp wrote: > -------- > In message <1355873265.1198.183.camel@revolution.hippie.lan>, Ian Lepore writes > : >>On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: > >>I'm not so sure about the 2^k precision. You speak of seconds, but I >>would be worrying about sub-second precision in my work. > > It is a bad idea, and it is physically pointless, given the stabilities > of the timebases available for computers in general. > > Please just take my word as a time-nut, and use a 32.32 binary format > in seconds (see previous email) and be done with it. > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. Right now -- the precision is specified in 'bintime', which is a binary number. It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in the specific platform. I do not really think it worth to create another structure for handling time (e.g. struct bintime32), as it will lead to code duplication for all the basic conversion/math operation. On the other hand, 32.32 may not be enough in the long future. What do you think about that? Thanks, Davide From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 10:18:13 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9E6F89AB; Wed, 19 Dec 2012 10:18:13 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-wg0-x229.google.com (wg-in-x0229.1e100.net [IPv6:2a00:1450:400c:c00::229]) by mx1.freebsd.org (Postfix) with ESMTP id C255D8FC12; Wed, 19 Dec 2012 10:18:12 +0000 (UTC) Received: by mail-wg0-f41.google.com with SMTP id ds1so210184wgb.0 for ; Wed, 19 Dec 2012 02:18:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=CIHZ1ahh1M/SzSEHIHjnikoUbcY3hIh/SFfqZZf3g20=; b=w9iSy3YziITUVq18IAKOzUDq039HYSUfHl+uzQ9O+9ZYCs3tkLvfjegPgHHHIgX9xI yHsS9eObV6wldPOhkWi7SkUV/qrs4p7hNIG6c29TohkiTlGfn+JUK7dhxtmDEDjEjyRH j27i37M3sUT5heC80Bjib8/ihENHokimMau7W3BjmzBjb5fzTuEEjeMbcShkFj7x06pJ GbgXxfBCVR7zoKxU7JRgX8zTDiYpvE95rXH0CRKm8iT536N82HZZ07qVnqlG37ywi720 /HDlivYcSdiQ6gQ2jmni6LQzJmrBjRqTgjlvKJ/4hhEZNmgeFBuddlQ1AFPpaxNWX0Ah rJ4A== X-Received: by 10.180.24.4 with SMTP id q4mr10538733wif.19.1355911916210; Wed, 19 Dec 2012 02:11:56 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id ew4sm7163770wid.11.2012.12.19.02.11.54 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 19 Dec 2012 02:11:55 -0800 (PST) Sender: Alexander Motin Message-ID: <50D192E8.3020704@FreeBSD.org> Date: Wed, 19 Dec 2012 12:11:52 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Davide Italiano Subject: Re: API explosion (Re: [RFC/RFT] calloutng) References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Ian Lepore , phk@onelab2.iet.unipi.it, Poul-Henning Kamp , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 10:18:13 -0000 On 19.12.2012 12:03, Davide Italiano wrote: > On Wed, Dec 19, 2012 at 1:54 AM, Poul-Henning Kamp wrote: >> -------- >> In message <1355873265.1198.183.camel@revolution.hippie.lan>, Ian Lepore writes >> : >>> On Tue, 2012-12-18 at 23:58 +0100, Luigi Rizzo wrote: >> >>> I'm not so sure about the 2^k precision. You speak of seconds, but I >>> would be worrying about sub-second precision in my work. >> >> It is a bad idea, and it is physically pointless, given the stabilities >> of the timebases available for computers in general. >> >> Please just take my word as a time-nut, and use a 32.32 binary format >> in seconds (see previous email) and be done with it. > > Right now -- the precision is specified in 'bintime', which is a binary number. > It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in > the specific platform. > I do not really think it worth to create another structure for > handling time (e.g. struct bintime32), as it will lead to code > duplication for all the basic conversion/math operation. On the other > hand, 32.32 may not be enough in the long future. > What do you think about that? Linux uses 32.32 format in their eventtimers code. Respecting that now we have no any timer hardware with frequency about 4GHz, that precision is probably sufficient. But if at some point we want to be able to handle absolute wall time, then 32bit integer part may be not enough. Then we return to the question: "how many different data types do we want to see in one subsystem"? Sure, using single 64bit value would be much easier then struct bintime from many perspectives, but what's about the edge cases? -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 10:51:50 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C300E28B; Wed, 19 Dec 2012 10:51:50 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 5FBE48FC0C; Wed, 19 Dec 2012 10:51:50 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 15DB389EAF; Wed, 19 Dec 2012 10:51:49 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJApmOF015883; Wed, 19 Dec 2012 10:51:48 GMT (envelope-from phk@phk.freebsd.dk) To: Davide Italiano Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> Date: Wed, 19 Dec 2012 10:51:48 +0000 Message-ID: <15882.1355914308@critter.freebsd.dk> Cc: Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 10:51:50 -0000 -------- In message , Davide Italiano writes: >Right now -- the precision is specified in 'bintime', which is a binary number. >It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in >the specific platform. And that is way overkill for specifying a callout, at best your clock has short term stabilities approaching 1e-8, but likely as bad as 1e-6. (The reason why bintime is important for timekeeping is that we accumulate timeintervals approx 1e3 times a second, so the rounding error has to be much smaller than the short term stability in order to not dominate) >I do not really think it worth to create another structure for >handling time (e.g. struct bintime32), as it will lead to code No, that was exactly my point: It should be an integer so that comparisons and arithmetic is trivial. A 32.32 format fits nicely into a int64_t which is readily available in the language. As I said in my previous email: typedef dur_t int64_t; /* signed for bug catching */ #define DURSEC ((dur_t)1 << 32) #define DURMIN (DURSEC * 60) #define DURMSEC (DURSEC / 1000) #define DURUSEC (DURSEC / 10000000) #define DURNSEC (DURSEC / 10000000000) (Bikeshed the names at your convenience) Then you can say callout_foo(34 * DURSEC) callout_foo(2400 * DURMSEC) or callout_foo(500 * DURNSEC) With this format you can specify callouts 68 years into the future with quarter nanosecond resolution, and you can trivially and efficiently compare dur_t's with if (d1 < d2) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 11:00:09 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0C515955; Wed, 19 Dec 2012 11:00:09 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id AB49B8FC1E; Wed, 19 Dec 2012 11:00:08 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 812708A512; Wed, 19 Dec 2012 11:00:07 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJB06xL015948; Wed, 19 Dec 2012 11:00:06 GMT (envelope-from phk@phk.freebsd.dk) To: Alexander Motin Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <50D192E8.3020704@FreeBSD.org> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <50D192E8.3020704@FreeBSD.org> Date: Wed, 19 Dec 2012 11:00:06 +0000 Message-ID: <15947.1355914806@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 11:00:09 -0000 -------- In message <50D192E8.3020704@FreeBSD.org>, Alexander Motin writes: >Linux uses 32.32 format in their eventtimers code. (And that is no accident, I know who they got the number from :-) >But if at some point we want to be able to >handle absolute wall time, [...] Then you have other problems, including but not limited to clock being stepped, leap-seconds, suspend/resume and frequency stability. If you want to support callouts of the type ("At 14:00 UTC tomorrow") (disregarding the time-zone issue), you need to catch all significant changes to our UTC estimate and recalibrate your callout based on that. It is not obvious that we have applications for such an API that warrant the complexity. Either way, such a facility should be layered on top of the callout facility, which should always run in "elapsed time"[1] with no attention paid to what NTPD might do to the UTC estimate. So summary: 32.32 is the right format. Poul-Henning [1] Notice that "elapsed time" needs a firm definition with respect to suspend/resume, and that this decision has big implications for the API use and code duplication. I think it prudent to specify a flag to callouts, to tell what should happen on suspend/resume, something like: SR_CANCEL /* Cancel the callout on S/R */ /* no flag* /* Toll this callout only when system is running */ SR_IGNORE /* Toll suspended time from callout */ If you get this right, callouts from device drivers will just "DTRT", if you get it wrong, all device drivers will need boilerplate code to handle S/R -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 12:19:05 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B07BB562; Wed, 19 Dec 2012 12:19:05 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id CE74B8FC0C; Wed, 19 Dec 2012 12:19:04 +0000 (UTC) Received: from mail36.syd.optusnet.com.au (mail36.syd.optusnet.com.au [211.29.133.76]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJCIuSs016828; Wed, 19 Dec 2012 23:18:56 +1100 Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail36.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJCIcZS017294 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 19 Dec 2012 23:18:40 +1100 Date: Wed, 19 Dec 2012 23:18:38 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Poul-Henning Kamp Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-Reply-To: <15882.1355914308@critter.freebsd.dk> Message-ID: <20121219221518.E1082@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=SfSv7Ytu c=1 sm=1 a=5xuQJAhXp8AA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=o_YUZdvV9usA:10 a=pGLkceISAAAA:8 a=9Mupeb60rhJ5CC9yY78A:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 12:19:05 -0000 On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: > -------- > In message > , Davide Italiano writes: > >> Right now -- the precision is specified in 'bintime', which is a binary number. >> It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in >> the specific platform. > > And that is way overkill for specifying a callout, at best your clock > has short term stabilities approaching 1e-8, but likely as bad as 1e-6. So you always agreed with me that bintimes are unsuitable for almost everything, and especially unsuitable for timeouts? :-) > (The reason why bintime is important for timekeeping is that we > accumulate timeintervals approx 1e3 times a second, so the rounding > error has to be much smaller than the short term stability in order > to not dominate) bintimes are not unsuitable for timekeeping, but they a painful to use for other APIs. You have to either put bintimes in layers in the other APIs, or convert them to a more suitable format, and there is a problem placing the conversion at points where it is efficient. This thread seems to be mostly about putting the conversion in wrong places. My original objection was about using bintimes for almost everything at the implementation level. >> I do not really think it worth to create another structure for >> handling time (e.g. struct bintime32), as it will lead to code > > No, that was exactly my point: It should be an integer so that > comparisons and arithmetic is trivial. A 32.32 format fits > nicely into a int64_t which is readily available in the language. I would have tried a 32 bit format with a variable named 'ticks'. Something like: - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use this. The tick period would be constant but for virtual ticks and not too small. hz = 1000 now makes the period too small, and not a power of 2. So make the period 1/128 second. This gives a 1.24.7 binary format. 2**24 seconds is 194 days. - ticks < 0. The 31 value bits are now a cookie (descriptor) referring to a bintime or whatever. This case should rarely be used. I don't like it that a tickless kernel, which is needed mainly for power saving, has expanded into complications to support short timeouts which should rarely be used. > As I said in my previous email: > > typedef dur_t int64_t; /* signed for bug catching */ > #define DURSEC ((dur_t)1 << 32) > #define DURMIN (DURSEC * 60) > #define DURMSEC (DURSEC / 1000) > #define DURUSEC (DURSEC / 10000000) > #define DURNSEC (DURSEC / 10000000000) > > (Bikeshed the names at your convenience) > > Then you can say > > callout_foo(34 * DURSEC) > callout_foo(2400 * DURMSEC) > or > callout_foo(500 * DURNSEC) Constructing the cookie for my special case would not be so easy. > With this format you can specify callouts 68 years into the future > with quarter nanosecond resolution, and you can trivially and > efficiently compare dur_t's with > if (d1 < d2) This would make a better general format than timevals, timespecs and of course bintimes :-). It is a bit wasteful for timeouts since its extremes are rarely used. Malicious and broken callers can still cause overflow at 68 years, so you have to check for it and handle it. The limit of 194 days is just as good for timeouts. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 13:04:45 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B9018571; Wed, 19 Dec 2012 13:04:45 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3F7AB8FC12; Wed, 19 Dec 2012 13:04:45 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id C90E189EAF; Wed, 19 Dec 2012 13:04:43 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJD4gQR016440; Wed, 19 Dec 2012 13:04:42 GMT (envelope-from phk@phk.freebsd.dk) To: Bruce Evans Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <20121219221518.E1082@besplex.bde.org> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> Date: Wed, 19 Dec 2012 13:04:42 +0000 Message-ID: <16439.1355922282@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 13:04:45 -0000 -------- In message <20121219221518.E1082@besplex.bde.org>, Bruce Evans writes: >> With this format you can specify callouts 68 years into the future >> with quarter nanosecond resolution, and you can trivially and >> efficiently compare dur_t's with >> if (d1 < d2) > >This would make a better general format than timevals, timespecs and >of course bintimes :-). Except that for absolute timescales, we're running out of the 32 bits integer part. Bintimes is a necessary superset of the 32.32 which tries to work around the necessary but missing int96_t or int128_t[1]. Poul-Henning [1] A good addition to C would be a general multi-word integer type where you could ask for any int%d_t or uint%d_t you cared for, and have the compiler DTRT. In difference from using a multiword-library, this would still give these types their natural integer behaviour. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 13:05:08 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 50B1F714; Wed, 19 Dec 2012 13:05:08 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vb0-f45.google.com (mail-vb0-f45.google.com [209.85.212.45]) by mx1.freebsd.org (Postfix) with ESMTP id AED108FC0C; Wed, 19 Dec 2012 13:05:07 +0000 (UTC) Received: by mail-vb0-f45.google.com with SMTP id p1so2209257vbi.4 for ; Wed, 19 Dec 2012 05:05:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=MX5cw1qHYhNlodpk7btfMXKeyR7NnCm68rmxmtC5ebg=; b=Z2/gXbAKOyEmkjIzySBTvavzH4mRlpyHpZqD+PfPzf2mTNEcFgHLhmNixEwLYClvDQ 0TMQHVRslhhdWkuJkVhdsA6aRRAVfAs3WD15fN+ZuoAAnplMo3jJ9Z7DjKvSsxcuoU/K Keo+0jMcH+lVipoNmhVVuZSiNiYxOjgxVT4wnHgo2WkNbqqMGh4dM844ad1x2mJNgrDA GwoX9OqYerjEXQghNBuq+8Pd1VLv0olpNn83sSQK5Kc4AmbM2TQAdpAR3KajBKQnKZzJ 8oXCkZjr5TcHQGJAerXKcsFIi18EOnwFMJ53JNKcL8v8yKcBm+mS41WHuPkbPbp7GpCZ SMdg== MIME-Version: 1.0 Received: by 10.52.66.70 with SMTP id d6mr7575902vdt.30.1355922306590; Wed, 19 Dec 2012 05:05:06 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.229.136 with HTTP; Wed, 19 Dec 2012 05:05:06 -0800 (PST) In-Reply-To: <20121219221518.E1082@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> Date: Wed, 19 Dec 2012 05:05:06 -0800 X-Google-Sender-Auth: 0-_PGXea5bUURKIlbagvU5Ucxbs Message-ID: Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Davide Italiano To: Bruce Evans Content-Type: text/plain; charset=ISO-8859-1 Cc: Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, Poul-Henning Kamp , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 13:05:08 -0000 On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans wrote: > On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: > >> -------- >> In message >> >> , Davide Italiano writes: >> >>> Right now -- the precision is specified in 'bintime', which is a binary >>> number. >>> It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in >>> the specific platform. >> >> >> And that is way overkill for specifying a callout, at best your clock >> has short term stabilities approaching 1e-8, but likely as bad as 1e-6. > > > So you always agreed with me that bintimes are unsuitable for almost > everything, and especially unsuitable for timeouts? :-) > > >> (The reason why bintime is important for timekeeping is that we >> accumulate timeintervals approx 1e3 times a second, so the rounding >> error has to be much smaller than the short term stability in order >> to not dominate) > > > bintimes are not unsuitable for timekeeping, but they a painful to use > for other APIs. You have to either put bintimes in layers in the other > APIs, or convert them to a more suitable format, and there is a problem > placing the conversion at points where it is efficient. This thread > seems to be mostly about putting the conversion in wrong places. My > original objection was about using bintimes for almost everything at > the implementation level. > > >>> I do not really think it worth to create another structure for >>> handling time (e.g. struct bintime32), as it will lead to code >> >> >> No, that was exactly my point: It should be an integer so that >> comparisons and arithmetic is trivial. A 32.32 format fits >> nicely into a int64_t which is readily available in the language. > > > I would have tried a 32 bit format with a variable named 'ticks'. > Something like: > - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use > this. The tick period would be constant but for virtual ticks and > not too small. hz = 1000 now makes the period too small, and not a > power of 2. So make the period 1/128 second. This gives a 1.24.7 > binary format. 2**24 seconds is 194 days. > - ticks < 0. The 31 value bits are now a cookie (descriptor) referring > to a bintime or whatever. This case should rarely be used. I don't > like it that a tickless kernel, which is needed mainly for power > saving, has expanded into complications to support short timeouts > which should rarely be used. > Bruce, I don't really agree with this. The data addressed by cookie should be still stored somewhere, and KBI will result broken. This, indeed, is not real problem as long as current calloutng code heavily breaks KBI, but if that was your point, I don't see how your proposed change could help. > >> As I said in my previous email: >> >> typedef dur_t int64_t; /* signed for bug catching */ >> #define DURSEC ((dur_t)1 << 32) >> #define DURMIN (DURSEC * 60) >> #define DURMSEC (DURSEC / 1000) >> #define DURUSEC (DURSEC / 10000000) >> #define DURNSEC (DURSEC / 10000000000) >> >> (Bikeshed the names at your convenience) >> >> Then you can say >> >> callout_foo(34 * DURSEC) >> callout_foo(2400 * DURMSEC) >> or >> callout_foo(500 * DURNSEC) > > > Constructing the cookie for my special case would not be so easy. > > >> With this format you can specify callouts 68 years into the future >> with quarter nanosecond resolution, and you can trivially and >> efficiently compare dur_t's with >> if (d1 < d2) > > > This would make a better general format than timevals, timespecs and > of course bintimes :-). It is a bit wasteful for timeouts since > its extremes are rarely used. Malicious and broken callers can > still cause overflow at 68 years, so you have to check for it and > handle it. The limit of 194 days is just as good for timeouts. > > Bruce I think the phk's proposal is better. About your overflow objection, I think is really unlikely to happen, but better safe than sorry. Thanks Davide From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 13:55:03 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A181592D for ; Wed, 19 Dec 2012 13:55:03 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 0FEFD8FC17 for ; Wed, 19 Dec 2012 13:55:02 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBJDsp4N027304 for ; Wed, 19 Dec 2012 15:54:51 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBJDsp4N027304 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBJDspVQ027303 for arch@freebsd.org; Wed, 19 Dec 2012 15:54:51 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 19 Dec 2012 15:54:51 +0200 From: Konstantin Belousov To: arch@freebsd.org Subject: Unmapped I/O Message-ID: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qim0fXNpvdl5D74Y" Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 13:55:03 -0000 --qim0fXNpvdl5D74Y Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable One of the known FreeBSD I/O path performance bootleneck is the neccessity to map each I/O buffer pages into KVA. The problem is that on the multi-core machines, the mapping must flush TLB on all cores, due to the global mapping of the buffer pages into the kernel. This means that buffer creation and destruction disrupts execution of all other cores to perform TLB shootdown through IPI, and the thread initiating the shootdown must wait for all other cores to execute and report. The patch at http://people.freebsd.org/~kib/misc/unmapped.4.patch implements the 'unmapped buffers'. It means an ability to create the VMIO struct buf, which does not point to the KVA mapping the buffer pages to the kernel addresses. Since there is no mapping, kernel does not need to clear TLB. The unmapped buffers are marked with the new B_NOTMAPPED flag, and should be requested explicitely using the GB_NOTMAPPED flag to the buffer allocation routines. If the mapped buffer is requested but unmapped buffer already exists, the buffer subsystem automatically maps the pages. The clustering code is also made aware of the not-mapped buffers, but this required the KPI change that accounts for the diff in the non-UFS filesystems. UFS is adopted to request not mapped buffers when kernel does not need to access the content, i.e. mostly for the file data. New helper function vn_io_fault_pgmove() operates on the unmapped array of pages. It calls new pmap method pmap_copy_pages() to do the data move to and =66rom usermode. Besides not mapped buffers, not mapped BIOs are introduced, marked with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated to unmapped BIOs. Geom providers may indicate an acceptance of the unmapped BIOs. If provider does not handle unmapped i/o requests, geom now automatically establishes transient mapping for the i/o pages. Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The gpart providers indicate the unmapped BIOs support if the underlying provider can do unmapped i/o. I also hacked ahci(4) to handle unmapped i/o, but this should be changed after the Jeff' physbio patch is committed, to use proper busdma interface. Besides, the swap pager does unmapped swapping if the swap partition indicated that it can do unmapped i/o. By Jeff request, a buffer allocation code may reserve the KVA for unmapped buffer in advance. The unmapped page-in for the vnode pager is also implemented if filesystem supports it, but the page out is not. The page-out, as well as the vnode-backed md(4), currently require mappings, mostly due to the use of VOP_WRITE(). As such, the patch worked in my test environment, where I used ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no statistically significant difference in the buildworld -j 10 times on the 4-core machine with HT. On the other hand, when doing sha1 over the 5GB file, the system time was reduced by 30%. Unfinished items: - Integration with the physbio, will be done after physbio is committed to HEAD. - The key per-architecture function needed for the unmapped i/o is the pmap_copy_pages(). I implemented it for amd64 and i386 right now, it shall be done for all other architectures. - The sizing of the submap used for transient mapping of the BIOs is naive. Should be adjusted, esp. for KVA-lean architectures. - Conversion of the other filesystems. Low priority. I am interested in reviews, tests and suggestions. Note that this only works now for md(4) and ahci(4), for other drivers the patched kernel should fall back to the mapped i/o. sys/amd64/amd64/pmap.c | 24 +++ sys/cam/ata/ata_da.c | 5 +- sys/cam/cam_ccb.h | 30 ++++ sys/dev/ahci/ahci.c | 53 +++++- sys/dev/md/md.c | 255 ++++++++++++++++++++++++----- sys/fs/cd9660/cd9660_vnops.c | 2 +- sys/fs/ext2fs/ext2_balloc.c | 2 +- sys/fs/ext2fs/ext2_vnops.c | 9 +- sys/fs/msdosfs/msdosfs_vnops.c | 4 +- sys/fs/udf/udf_vnops.c | 5 +- sys/geom/geom.h | 1 + sys/geom/geom_disk.c | 2 + sys/geom/geom_disk.h | 1 + sys/geom/geom_io.c | 44 ++++- sys/geom/geom_vfs.c | 10 +- sys/geom/part/g_part.c | 1 + sys/i386/i386/pmap.c | 42 +++++ sys/kern/vfs_bio.c | 356 +++++++++++++++++++++++++++++++++----= ---- sys/kern/vfs_cluster.c | 118 +++++++------- sys/kern/vfs_vnops.c | 39 +++++ sys/sys/bio.h | 7 + sys/sys/buf.h | 22 ++- sys/sys/mount.h | 1 + sys/sys/vnode.h | 2 + sys/ufs/ffs/ffs_alloc.c | 10 +- sys/ufs/ffs/ffs_balloc.c | 58 ++++--- sys/ufs/ffs/ffs_vfsops.c | 3 +- sys/ufs/ffs/ffs_vnops.c | 35 ++-- sys/ufs/ufs/ufs_extern.h | 1 + sys/vm/pmap.h | 2 + sys/vm/swap_pager.c | 43 +++-- sys/vm/swap_pager.h | 1 + sys/vm/vm.h | 2 + sys/vm/vm_init.c | 6 +- sys/vm/vm_kern.c | 9 +- sys/vm/vnode_pager.c | 30 +++- 36 files changed, 989 insertions(+), 246 deletions(-) --qim0fXNpvdl5D74Y Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ0ccfAAoJEJDCuSvBvK1BsN4QAKFcmhXwCzuwBTZcWIKK/J/Z 9BFBWG0hFKcIVOLyrwkEbwYdumjiVriJbGTl9PWrjc1e41YQBr4FNhrO/fitR31U rlEXuNaXjc/e5BuKg18nUGnrLBGQFryeT2ZYaomU06qtvMYknwXnbM4y+GmfYEnz FzoGICsoDpDZo9TKInL1Y/bM6gEgW5AjjdXJyOs/5Vb/ZrQJVBc/DMw7vg/U7olb EW6T7KxBc3d3zIkPkFtSHVRA6c3905gBYmKN/p11/GtZQpGsjLizYmK4WkwHvmR9 WVDkxRIK1XVq003om5HnTXZ+LPngDvZTC1djMAWjsHTAXwb8lLrewmcKIGQNaIxf 9qnIIuxX4FPHkpay7/EdlDQxR1gphSLbGFtLZBFMBxnCgAYMZXLguvLNnh/Jk1KC eRl8mgN7M2+E8JwcHgIsJTKMDrGuUvgIvCXJDHG8OuXKtdzzxrZi+fWeRfg/cTel K0sgvG49vWACpLoylCl0LcxXdtbBtYgNfjDdi/UaAqBPqUvRCrU9EuJlWhq7MgYp kJzlMcjKq1nxofy/bsXnztQ85KMgl88DN2CAXAqOpcfB9dVR5CbBYVw4UYeBuAoi Us9oIM09BddUHgunrdE3VAiwYDJWwfHgZI6t7dvma72eOChhJ0pmlJmDUGITCZH3 JxCckXc2yFKXR6UXGB5Y =BZQ1 -----END PGP SIGNATURE----- --qim0fXNpvdl5D74Y-- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 14:07:00 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 096C8BE7; Wed, 19 Dec 2012 14:07:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by mx1.freebsd.org (Postfix) with ESMTP id 8814D8FC14; Wed, 19 Dec 2012 14:06:58 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJE6mBV001435 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 20 Dec 2012 01:06:51 +1100 Date: Thu, 20 Dec 2012 01:06:48 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Poul-Henning Kamp Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-Reply-To: <16439.1355922282@critter.freebsd.dk> Message-ID: <20121220005706.I1675@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <16439.1355922282@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=e5de0tV/ c=1 sm=1 a=5xuQJAhXp8AA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=o_YUZdvV9usA:10 a=sKz94c0OIJE1EZh3uQcA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 14:07:00 -0000 On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: > -------- > In message <20121219221518.E1082@besplex.bde.org>, Bruce Evans writes: > >>> With this format you can specify callouts 68 years into the future >>> with quarter nanosecond resolution, and you can trivially and >>> efficiently compare dur_t's with >>> if (d1 < d2) >> >> This would make a better general format than timevals, timespecs and >> of course bintimes :-). > > Except that for absolute timescales, we're running out of the 32 bits > integer part. Except 32 bit time_t works until 2106 if it is unsigned. > Bintimes is a necessary superset of the 32.32 which tries to work > around the necessary but missing int96_t or int128_t[1]. > > [1] A good addition to C would be a general multi-word integer type > where you could ask for any int%d_t or uint%d_t you cared for, and > have the compiler DTRT. In difference from using a multiword-library, > this would still give these types their natural integer behaviour. That would be convenient, but bad for efficiency if it were actually used much. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 14:14:52 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A5B6BFB1; Wed, 19 Dec 2012 14:14:52 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 3B2378FC13; Wed, 19 Dec 2012 14:14:52 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 316E18A3FC; Wed, 19 Dec 2012 14:14:50 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJEEnN9016665; Wed, 19 Dec 2012 14:14:49 GMT (envelope-from phk@phk.freebsd.dk) To: Bruce Evans Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <20121220005706.I1675@besplex.bde.org> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <16439.1355922282@critter.freebsd.dk> <20121220005706.I1675@besplex.bde.org> Date: Wed, 19 Dec 2012 14:14:49 +0000 Message-ID: <16664.1355926489@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 14:14:52 -0000 -------- In message <20121220005706.I1675@besplex.bde.org>, Bruce Evans writes: >On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: >> Except that for absolute timescales, we're running out of the 32 bits >> integer part. > >Except 32 bit time_t works until 2106 if it is unsigned. That's sort of not an option. The real problem was that time_t was not defined as a floating point number. >> [1] A good addition to C would be a general multi-word integer type >> where you could ask for any int%d_t or uint%d_t you cared for, and >> have the compiler DTRT. In difference from using a multiword-library, >> this would still give these types their natural integer behaviour. > >That would be convenient, but bad for efficiency if it were actually >used much. You can say that about anything but CPU-native operations, and I doubt it would be as inefficient as struct bintime, which does not have access to the carry bit. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 14:21:19 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AA4BF480; Wed, 19 Dec 2012 14:21:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 2C8CC8FC17; Wed, 19 Dec 2012 14:21:18 +0000 (UTC) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJELHww025480; Thu, 20 Dec 2012 01:21:17 +1100 Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJEKum1011107 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 20 Dec 2012 01:20:57 +1100 Date: Thu, 20 Dec 2012 01:20:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Davide Italiano Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-Reply-To: Message-ID: <20121220010702.B1675@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=L9pF2Jv8 c=1 sm=1 a=5xuQJAhXp8AA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=o_YUZdvV9usA:10 a=vXJ0KzY1-sXwhEubLOkA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, Poul-Henning Kamp , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 14:21:19 -0000 On Wed, 19 Dec 2012, Davide Italiano wrote: > On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans wrote: >> I would have tried a 32 bit format with a variable named 'ticks'. >> Something like: >> - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use >> this. The tick period would be constant but for virtual ticks and >> not too small. hz = 1000 now makes the period too small, and not a >> power of 2. So make the period 1/128 second. This gives a 1.24.7 >> binary format. 2**24 seconds is 194 days. >> - ticks < 0. The 31 value bits are now a cookie (descriptor) referring >> to a bintime or whatever. This case should rarely be used. I don't >> like it that a tickless kernel, which is needed mainly for power >> saving, has expanded into complications to support short timeouts >> which should rarely be used. > > Bruce, I don't really agree with this. > The data addressed by cookie should be still stored somewhere, and KBI > will result broken. This, indeed, is not real problem as long as > current calloutng code heavily breaks KBI, but if that was your point, > I don't see how your proposed change could help. In the old API, it is an error to pass ticks < 0, so only broken old callers are affected. Of course, if there are any then it would be hard to detect their garbage cookies. Anywy, it's too later to change to this, and maybe also to a 32.32 format. [32.32 format] >> This would make a better general format than timevals, timespecs and >> of course bintimes :-). It is a bit wasteful for timeouts since >> its extremes are rarely used. Malicious and broken callers can >> still cause overflow at 68 years, so you have to check for it and >> handle it. The limit of 194 days is just as good for timeouts. > > I think the phk's proposal is better. About your overflow objection, > I think is really unlikely to happen, but better safe than sorry. It's very easy for applications to cause kernel overflow using valid syscall args like tv_sec = TIME_T_MAX for a relative time in nanosleep(). Adding TIME_T_MAX to the current time in seconds overflow for all current times except for the first second after the Epoch. There is no difference between the overflow for 32-bit and 64-bit time_t's for this. This is now mostly handled so that the behaviour is harmless although wrong. E.g., the timeout might become negative, and then since it is not a cookie it is silently replaced by a timeout of 1 tick. In nanosleep(), IIRC there are further overflows that result in returning early instead of retrying the 1-tick timeouts endlessly. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 14:44:23 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CBA86ACB for ; Wed, 19 Dec 2012 14:44:23 +0000 (UTC) (envelope-from mjacob@freebsd.org) Received: from ns1.feral.com (ns1.feral.com [192.67.166.1]) by mx1.freebsd.org (Postfix) with ESMTP id A07408FC0C for ; Wed, 19 Dec 2012 14:44:23 +0000 (UTC) Received: from [192.168.135.2] (quaver.net [76.14.49.207]) (authenticated bits=0) by ns1.feral.com (8.14.5/8.14.4) with ESMTP id qBJEiGFi076972 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO) for ; Wed, 19 Dec 2012 06:44:16 -0800 (PST) (envelope-from mjacob@freebsd.org) Message-ID: <50D1D2BD.80107@freebsd.org> Date: Wed, 19 Dec 2012 06:44:13 -0800 From: Matthew Jacob Organization: FreeBSD User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: freebsd-arch@freebsd.org Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> In-Reply-To: <20121219135451.GU71906@kib.kiev.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (ns1.feral.com [192.67.166.1]); Wed, 19 Dec 2012 06:44:16 -0800 (PST) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: mjacob@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 14:44:23 -0000 On 12/19/2012 5:54 AM, Konstantin Belousov wrote: > One of the known FreeBSD I/O path performance bootleneck is the > neccessity to map each I/O buffer pages into KVA. The problem is that > on the multi-core machines, the mapping must flush TLB on all cores, > due to the global mapping of the buffer pages into the kernel. This > means that buffer creation and destruction disrupts execution of all > other cores to perform TLB shootdown through IPI, and the thread > initiating the shootdown must wait for all other cores to execute and > report. > About time! From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 14:57:21 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2161ED86; Wed, 19 Dec 2012 14:57:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail13.syd.optusnet.com.au (mail13.syd.optusnet.com.au [211.29.132.194]) by mx1.freebsd.org (Postfix) with ESMTP id 9EE668FC18; Wed, 19 Dec 2012 14:57:19 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail13.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJEv5aD027522 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 20 Dec 2012 01:57:08 +1100 Date: Thu, 20 Dec 2012 01:57:05 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Poul-Henning Kamp Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-Reply-To: <16664.1355926489@critter.freebsd.dk> Message-ID: <20121220012223.F1772@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <16439.1355922282@critter.freebsd.dk> <20121220005706.I1675@besplex.bde.org> <16664.1355926489@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=BrrFWvr5 c=1 sm=1 a=5xuQJAhXp8AA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=o_YUZdvV9usA:10 a=coF_l0KmdXCzCSlSFisA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 14:57:21 -0000 On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: > -------- > In message <20121220005706.I1675@besplex.bde.org>, Bruce Evans writes: >> On Wed, 19 Dec 2012, Poul-Henning Kamp wrote: > >>> Except that for absolute timescales, we're running out of the 32 bits >>> integer part. >> >> Except 32 bit time_t works until 2106 if it is unsigned. > > That's sort of not an option. I think it is. It is just probably not necessary since 32-bit systems will go away before 2038. > The real problem was that time_t was not defined as a floating > point number. That would be convenient too, but bad for efficiency on some systems. Kernels might not be able to use it, and then would have to use an alternative representation, which they should have done all along. >>> [1] A good addition to C would be a general multi-word integer type >>> where you could ask for any int%d_t or uint%d_t you cared for, and >>> have the compiler DTRT. In difference from using a multiword-library, >>> this would still give these types their natural integer behaviour. >> >> That would be convenient, but bad for efficiency if it were actually >> used much. > > You can say that about anything but CPU-native operations, and I doubt > it would be as inefficient as struct bintime, which does not have access > to the carry bit. Yes, I would say that about non-native. It goes against the spirit of C. OTOH, compilers are getting closer to giving full access to the carry bit. I just checked what clang does in a home-made 128-bit add function: % static void __noinline % uadd(struct u *xup, struct u *yup) % { % unsigned long long t; % % t = xup->w[0] + yup->w[0]; % if (t < xup->w[0]) % xup->w[1]++; % xup->w[0] = t; % xup->w[1] += yup->w[1]; % } % % .align 16, 0x90 % .type uadd,@function % uadd: # @uadd % .cfi_startproc % # BB#0: # %entry % movq (%rdi), %rcx % movq 8(%rdi), %rax % addq (%rsi), %rcx gcc generates an additional cmpq instruction here. % jae .LBB2_2 clang uses the carry bit set by the first addition to avoid the comparison, but still branches. % # BB#1: # %if.then % incq %rax % movq %rax, 8(%rdi) This adds 1 explicitly instead of using adcq, but this is the slow path. % .LBB2_2: # %if.end % movq %rcx, (%rdi) % addq 8(%rsi), %rax This is as efficient as possible except for the extra branch, and the branch is almost perfectly predictable. % movq %rax, 8(%rdi) % ret % .Ltmp22: % .size uadd, .Ltmp22-uadd % .cfi_endproc Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 15:09:28 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4848264C; Wed, 19 Dec 2012 15:09:28 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id EFF978FC16; Wed, 19 Dec 2012 15:09:27 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 0E9237300B; Wed, 19 Dec 2012 16:08:09 +0100 (CET) Date: Wed, 19 Dec 2012 16:08:09 +0100 From: Luigi Rizzo To: Poul-Henning Kamp Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121219150809.GA98673@onelab2.iet.unipi.it> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <15882.1355914308@critter.freebsd.dk> User-Agent: Mutt/1.4.2.3i Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 15:09:28 -0000 On Wed, Dec 19, 2012 at 10:51:48AM +0000, Poul-Henning Kamp wrote: > -------- > In message > , Davide Italiano writes: > > >Right now -- the precision is specified in 'bintime', which is a binary number. > >It's not 32.32, it's 32.64 or 64.64 depending on the size of time_t in > >the specific platform. > > And that is way overkill for specifying a callout, at best your clock > has short term stabilities approaching 1e-8, but likely as bad as 1e-6. > > (The reason why bintime is important for timekeeping is that we > accumulate timeintervals approx 1e3 times a second, so the rounding > error has to be much smaller than the short term stability in order > to not dominate) > > >I do not really think it worth to create another structure for > >handling time (e.g. struct bintime32), as it will lead to code > > No, that was exactly my point: It should be an integer so that > comparisons and arithmetic is trivial. A 32.32 format fits > nicely into a int64_t which is readily available in the language. > > As I said in my previous email: > > > typedef dur_t int64_t; /* signed for bug catching */ > #define DURSEC ((dur_t)1 << 32) > #define DURMIN (DURSEC * 60) > #define DURMSEC (DURSEC / 1000) > #define DURUSEC (DURSEC / 10000000) > #define DURNSEC (DURSEC / 10000000000) > > (Bikeshed the names at your convenience) > > Then you can say > > callout_foo(34 * DURSEC) > callout_foo(2400 * DURMSEC) > or > callout_foo(500 * DURNSEC) only thing, we must be careful with the parentheses For instance, in your macro, DURNSEC evaluates to 0 and so does any multiple of it. We should define them as #define DURNSEC DURSEC / 10000000000 ... so DURNSEC is still 0 and 500*DURNSEC gives 214 I am curious that Bruce did not mention this :) (btw the typedef is swapped, should be "typedef int64_t dur_t") cheers luigi From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 15:37:31 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 61D7AF3; Wed, 19 Dec 2012 15:37:31 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id E77A28FC12; Wed, 19 Dec 2012 15:37:30 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 5BF5489EAF; Wed, 19 Dec 2012 15:37:29 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJFbSun017058; Wed, 19 Dec 2012 15:37:28 GMT (envelope-from phk@phk.freebsd.dk) To: Luigi Rizzo Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <20121219150809.GA98673@onelab2.iet.unipi.it> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219150809.GA98673@onelab2.iet.unipi.it> Date: Wed, 19 Dec 2012 15:37:28 +0000 Message-ID: <17057.1355931448@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , Alexander Motin , phk@onelab2.iet.unipi.it, freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 15:37:31 -0000 -------- In message <20121219150809.GA98673@onelab2.iet.unipi.it>, Luigi Rizzo writes: >> typedef dur_t int64_t; /* signed for bug catching */ >> #define DURSEC ((dur_t)1 << 32) >> #define DURMIN (DURSEC * 60) >> #define DURMSEC (DURSEC / 1000) >> #define DURUSEC (DURSEC / 10000000) >> #define DURNSEC (DURSEC / 10000000000) >only thing, we must be careful with the parentheses Actually, it's more impportant to be careful with zeros, if you adjust the above to the correct number of zeros, DURNSEC is 4, which is within seven percent of the correct value. >(btw the typedef is swapped, should be "typedef int64_t dur_t") Yes, I'm trying to find out of people even listen to me :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 15:44:16 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BFDA659C; Wed, 19 Dec 2012 15:44:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 4C3428FC13; Wed, 19 Dec 2012 15:44:15 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qBJFi70F027414 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 20 Dec 2012 02:44:08 +1100 Date: Thu, 20 Dec 2012 02:44:07 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Luigi Rizzo Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-Reply-To: <20121219150809.GA98673@onelab2.iet.unipi.it> Message-ID: <20121220022926.C1961@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219150809.GA98673@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=EvuKNlgA c=1 sm=1 a=5xuQJAhXp8AA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=o_YUZdvV9usA:10 a=j58EWA6DYnG8U02EjcUA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: Davide Italiano , Ian Lepore , Alexander Motin , Poul-Henning Kamp , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 15:44:16 -0000 I finally remembered to remove the .it phk :-). On Wed, 19 Dec 2012, Luigi Rizzo wrote: > On Wed, Dec 19, 2012 at 10:51:48AM +0000, Poul-Henning Kamp wrote: >> ... >> As I said in my previous email: >> >> >> typedef dur_t int64_t; /* signed for bug catching */ >> #define DURSEC ((dur_t)1 << 32) >> #define DURMIN (DURSEC * 60) >> #define DURMSEC (DURSEC / 1000) >> #define DURUSEC (DURSEC / 10000000) >> #define DURNSEC (DURSEC / 10000000000) >> >> (Bikeshed the names at your convenience) >> >> Then you can say >> >> callout_foo(34 * DURSEC) >> callout_foo(2400 * DURMSEC) >> or >> callout_foo(500 * DURNSEC) > > only thing, we must be careful with the parentheses > > For instance, in your macro, DURNSEC evaluates to 0 and so > does any multiple of it. > We should define them as > > #define DURNSEC DURSEC / 10000000000 > ... > > so DURNSEC is still 0 and 500*DURNSEC gives 214 > > I am curious that Bruce did not mention this :) Er, he was careful. DURNSEC gives 4, not 0. This is not very accurate, but probably good enough. Your version without parentheses is not so careful and depends on a magic order of operations and no overflow from this. E.g.: 500*DURNSEC = 500*DURSEC / 1000000000 = 500*((dur_t)1 << 32) / 1000000000 This is very accurate and happens not to overflow. But 5 seconds represented a little strangely in nanoseconds would overflow: 5000000000*DURNSEC = 5000000000*((dur_t)1 << 32) / 1000000000 So would 5 billion times DURSEC, but 5 billion seconds is more unreasobable than 5 billion nanoseconds and the format just can't represent that. > > (btw the typedef is swapped, should be "typedef int64_t dur_t") Didn't notice this. Bruce From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 15:44:31 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 75623646; Wed, 19 Dec 2012 15:44:31 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-bk0-f50.google.com (mail-bk0-f50.google.com [209.85.214.50]) by mx1.freebsd.org (Postfix) with ESMTP id 92A7B8FC14; Wed, 19 Dec 2012 15:44:30 +0000 (UTC) Received: by mail-bk0-f50.google.com with SMTP id jf3so1077931bkc.23 for ; Wed, 19 Dec 2012 07:44:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=Q+HCfy6mvCBs/gmeiWofRAlvBBzh+GP4niu6XB1oumw=; b=Bb800yhX0+Ylcu5xxGH736OvrPQxaTW7f8srbZb4a1C7lqeWm2RWZYe1LaIKTkVsPm S1DnQo30T4vtlYPCUwR8m59+rjcKlZ020SGCERuY7ha9H8UhnQpur+M9UHS2GP5xwWSk jgN4dp4uPupes0Zc6191o2E4bhxogbM2fKm4B9Xtcz+fh2tD+9uTtKn42m2QY0g68iQ4 0Rqv0wqc3WIykm/BTcagXokSmEX/bF3f6rKfCdofOeiKo+v9MTjDRlBz0yOSG+KQgv5X ZWbg4/vZuG1fbjE7vI/L9NHAu4HNxo6IGe5QcGN4Kadt8+DuzEpMVa0O55QUFXZGFnP3 PShg== X-Received: by 10.204.5.145 with SMTP id 17mr2763593bkv.98.1355931868799; Wed, 19 Dec 2012 07:44:28 -0800 (PST) Received: from mavbook.mavhome.dp.ua (mavhome.mavhome.dp.ua. [213.227.240.37]) by mx.google.com with ESMTPS id o9sm4615748bko.15.2012.12.19.07.44.26 (version=TLSv1/SSLv3 cipher=OTHER); Wed, 19 Dec 2012 07:44:27 -0800 (PST) Sender: Alexander Motin Message-ID: <50D1E0D8.9070209@FreeBSD.org> Date: Wed, 19 Dec 2012 17:44:24 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120628 Thunderbird/13.0.1 MIME-Version: 1.0 To: Bruce Evans Subject: Re: API explosion (Re: [RFC/RFT] calloutng) References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <20121220010702.B1675@besplex.bde.org> In-Reply-To: <20121220010702.B1675@besplex.bde.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Davide Italiano , Ian Lepore , phk@onelab2.iet.unipi.it, Poul-Henning Kamp , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 15:44:31 -0000 On 19.12.2012 16:20, Bruce Evans wrote: > On Wed, 19 Dec 2012, Davide Italiano wrote: > >> On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans >> wrote: > >>> I would have tried a 32 bit format with a variable named 'ticks'. >>> Something like: >>> - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use >>> this. The tick period would be constant but for virtual ticks and >>> not too small. hz = 1000 now makes the period too small, and not a >>> power of 2. So make the period 1/128 second. This gives a 1.24.7 >>> binary format. 2**24 seconds is 194 days. >>> - ticks < 0. The 31 value bits are now a cookie (descriptor) referring >>> to a bintime or whatever. This case should rarely be used. I don't >>> like it that a tickless kernel, which is needed mainly for power >>> saving, has expanded into complications to support short timeouts >>> which should rarely be used. >> >> Bruce, I don't really agree with this. >> The data addressed by cookie should be still stored somewhere, and KBI >> will result broken. This, indeed, is not real problem as long as >> current calloutng code heavily breaks KBI, but if that was your point, >> I don't see how your proposed change could help. > > In the old API, it is an error to pass ticks < 0, so only broken old > callers are affected. Of course, if there are any then it would be > hard to detect their garbage cookies. > > Anywy, it's too later to change to this, and maybe also to a 32.32 > format. It would be late to change this after committing. I would definitely like it to be done earlier to not redo all the tests, but I think we could convert callout and eventtimers code to 32.32 format in several days. The only two questions are: do we really want it (won't there be any reasons to regret about it) and how do we want it to look? For the first question my personal showstopper since eventtimers creation always was the wish to keep consistency. But benefits of 32.32 format are clear, and if there are more votes for it, let's consider. If now it will be decided that full range will never be useful for callout subsystem, then it is obviously not needed for eventtimers also. About the second question, what do you think about such prototypes: typedef int64_t sbintime_t static __inline sbintime_t bintime_shrink(struct bintime *bt) {} static __inline struct bintime bintime_expand(sbintime_t sbt) {} ... int callout_reset_bt(struct callout *, sbintime_t sbt, sbintime_t pr, void (*fn)(void *), void *arg, int flags); , where pr used only for absolute precision, and flags includes: direct execution, absolute/relative time in argument, relative precision in case of relative sbt, flag for aligning to hardclock() to emulate legacy behavior, and potentially flags for reaction on suspend/resume. Another option is to move absolute precision also to flags, using log2() representation, as we tried and as was proposed before. With possibility to use precise relative time there will be few cases requiring absolute value of precision, that should depend on period. Then there will be no extra arguments in the most usual cases. Wrapper for existing API compatibility will look just like this: #define callout_reset(c, ticks, fn, arg) \ callout_reset_bt(c, ticks2sbintime(ticks), -1, \ (fn), (arg), C_HARDCLOCK) -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 15:58:15 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2386FDA1; Wed, 19 Dec 2012 15:58:15 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vb0-f53.google.com (mail-vb0-f53.google.com [209.85.212.53]) by mx1.freebsd.org (Postfix) with ESMTP id 8F2BE8FC13; Wed, 19 Dec 2012 15:58:14 +0000 (UTC) Received: by mail-vb0-f53.google.com with SMTP id b23so2410492vbz.26 for ; Wed, 19 Dec 2012 07:58:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=MZwBLAXR5ZQuT6zg0qGkR7r8/PYRwl3fCa5jYWpUoE8=; b=Zrlkhba+OFc6g/mdMC4zHJFh+hTCQV0Vzfxru89bbZy8R6SCg9JrC9Xss2XQMUPqF+ 4o9E/wAMDk0RrkssIWvc7bC1YwOTwAJPwAzcocSmwia46cwbsYC2LfbEfOAu2+uVL8+E 0Xa5O8vivrtuoFb0m6RB1jkvjdvMFQieRJGNyPELZOYctSoerbhiCnwbxDxbrIgyqVOd feJShTg6A5wePTT+mhWzsnu0chfnAw7fOsPAVMwclgLvucOI6s54SY08V2iNMZxt2wJr oFR7eNI+fhHQYT2OJDcjvfceUqzC0emGeIOPeYUbMByFVxdDVgnxnB4/yjsIV2PKRguh IHIg== MIME-Version: 1.0 Received: by 10.220.148.205 with SMTP id q13mr9184719vcv.6.1355928707925; Wed, 19 Dec 2012 06:51:47 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.229.136 with HTTP; Wed, 19 Dec 2012 06:51:47 -0800 (PST) In-Reply-To: <20121220010702.B1675@besplex.bde.org> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <20121220010702.B1675@besplex.bde.org> Date: Wed, 19 Dec 2012 06:51:47 -0800 X-Google-Sender-Auth: spBVXsoWBq0qXz79KesqcXmg_C8 Message-ID: Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Davide Italiano To: Bruce Evans Content-Type: text/plain; charset=ISO-8859-1 Cc: Ian Lepore , Poul-Henning Kamp , freebsd-current , Alexander Motin , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 15:58:15 -0000 dropping phk _AT_ onelab2 _DOT_ something from CC as long as it doesn't seem a valid mail address and I'm annoyed mails bounce back. On Wed, Dec 19, 2012 at 6:20 AM, Bruce Evans wrote: > On Wed, 19 Dec 2012, Davide Italiano wrote: > >> On Wed, Dec 19, 2012 at 4:18 AM, Bruce Evans wrote: > > >>> I would have tried a 32 bit format with a variable named 'ticks'. >>> Something like: >>> - ticks >= 0. Same meaning as now. No changes in ABIs or APIs to use >>> this. The tick period would be constant but for virtual ticks and >>> not too small. hz = 1000 now makes the period too small, and not a >>> power of 2. So make the period 1/128 second. This gives a 1.24.7 >>> binary format. 2**24 seconds is 194 days. >>> - ticks < 0. The 31 value bits are now a cookie (descriptor) referring >>> to a bintime or whatever. This case should rarely be used. I don't >>> like it that a tickless kernel, which is needed mainly for power >>> saving, has expanded into complications to support short timeouts >>> which should rarely be used. >> >> >> Bruce, I don't really agree with this. >> The data addressed by cookie should be still stored somewhere, and KBI >> will result broken. This, indeed, is not real problem as long as >> current calloutng code heavily breaks KBI, but if that was your point, >> I don't see how your proposed change could help. > > > In the old API, it is an error to pass ticks < 0, so only broken old > callers are affected. Of course, if there are any then it would be > hard to detect their garbage cookies. > > Anywy, it's too later to change to this, and maybe also to a 32.32 > format. > > [32.32 format] It's not too late. What I'd like to do, right now people got interested in the problem is agreeing on the interface used. Following this thread, as I've already discussed to mav@, we would like to decide what of the two is better: - specify precision as additional argument (as we're doing right now) - use 'flags' argument If we allow time argument to be relative and not absolute, as suggested by luigi@, we can easily use relative precision where we had to use ffl() before. >>> >>> This would make a better general format than timevals, timespecs and >>> of course bintimes :-). It is a bit wasteful for timeouts since >>> its extremes are rarely used. Malicious and broken callers can >>> still cause overflow at 68 years, so you have to check for it and >>> handle it. The limit of 194 days is just as good for timeouts. >> >> >> I think the phk's proposal is better. About your overflow objection, >> I think is really unlikely to happen, but better safe than sorry. > > > It's very easy for applications to cause kernel overflow using valid > syscall args like tv_sec = TIME_T_MAX for a relative time in > nanosleep(). Adding TIME_T_MAX to the current time in seconds overflow > for all current times except for the first second after the Epoch. > There is no difference between the overflow for 32-bit and 64-bit > time_t's for this. This is now mostly handled so that the behaviour is > harmless although wrong. E.g., the timeout might become negative, > and then since it is not a cookie it is silently replaced by a timeout > of 1 tick. In nanosleep(), IIRC there are further overflows that result > in returning early instead of retrying the 1-tick timeouts endlessly. > > Bruce Thanks, Davide From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 16:18:44 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id DBDCF8B5; Wed, 19 Dec 2012 16:18:43 +0000 (UTC) (envelope-from rizzo.unipi@gmail.com) Received: from mail-ee0-f48.google.com (mail-ee0-f48.google.com [74.125.83.48]) by mx1.freebsd.org (Postfix) with ESMTP id E37578FC13; Wed, 19 Dec 2012 16:18:42 +0000 (UTC) Received: by mail-ee0-f48.google.com with SMTP id b57so1128130eek.7 for ; Wed, 19 Dec 2012 08:18:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=xIViPU9CBvgN8soYJD56Hq21FV9fLdRC7/cdE3yQUsk=; b=tcthfqTekjPLVMtY6Kl/0uAMJIVJBABDkvyNQKIGJ9IWFX7nQo/WnQmpSY6bURLzrc fJSl8dbiQFaB6qw9GOiNpLgRcfPbodUPXlSK8GHfA7uToaJhXGRYQQuOfvSghqq5V5oX KZyRtmd+Oi8/x+lvTaSvB+yxd0su06xLLbc5JLfCrzfBAKPABEd7hwLyS6bBDOyNy7BP t78udfSFIBKrXtWXqlkI3iEcVBuQW01re9BKCtRPq4hwDC/APS4ahDsSu82f7Z2FU29V kxQ9AdZepsHPa532RXxVYW0iiXBvIMP2AeG3AKkcbHo0Rd+VyNOeYfQ0nccjSfK3f+VP AamA== MIME-Version: 1.0 Received: by 10.14.225.72 with SMTP id y48mr15124371eep.46.1355933921454; Wed, 19 Dec 2012 08:18:41 -0800 (PST) Sender: rizzo.unipi@gmail.com Received: by 10.14.0.2 with HTTP; Wed, 19 Dec 2012 08:18:41 -0800 (PST) In-Reply-To: <17057.1355931448@critter.freebsd.dk> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219150809.GA98673@onelab2.iet.unipi.it> <17057.1355931448@critter.freebsd.dk> Date: Wed, 19 Dec 2012 08:18:41 -0800 X-Google-Sender-Auth: ePNWaUYAQ9S9zl1G8hfm2qQ3X08 Message-ID: Subject: Re: API explosion (Re: [RFC/RFT] calloutng) From: Luigi Rizzo To: Poul-Henning Kamp Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Davide Italiano , Ian Lepore , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 16:18:44 -0000 On Wed, Dec 19, 2012 at 7:37 AM, Poul-Henning Kamp wrote: > -------- > In message <20121219150809.GA98673@onelab2.iet.unipi.it>, Luigi Rizzo > writes: > > >> typedef dur_t int64_t; /* signed for bug catching */ > >> #define DURSEC ((dur_t)1 << 32) > >> #define DURMIN (DURSEC * 60) > >> #define DURMSEC (DURSEC / 1000) > >> #define DURUSEC (DURSEC / 10000000) > >> #define DURNSEC (DURSEC / 10000000000) > > >only thing, we must be careful with the parentheses > > Actually, it's more impportant to be careful with zeros, if you > adjust the above to the correct number of zeros, DURNSEC is 4, > which is within seven percent of the correct value. > counting digits is impossible for people over 45. But i have a solution for that #define DURNSEC (DURSEC / 1003006009) which is within 0.5% of the desired value. (and of course (1000*1000*1000) might do the job too) cheers luigi From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 16:35:33 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5339123D; Wed, 19 Dec 2012 16:35:33 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) by mx1.freebsd.org (Postfix) with ESMTP id 1FD628FC12; Wed, 19 Dec 2012 16:35:32 +0000 (UTC) Received: from JRE-MBP-2.local (ppp121-45-232-233.lns20.per1.internode.on.net [121.45.232.233]) (authenticated bits=0) by vps1.elischer.org (8.14.5/8.14.5) with ESMTP id qBJGZNfI034321 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 19 Dec 2012 08:35:25 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <50D1ECC5.2070209@freebsd.org> Date: Thu, 20 Dec 2012 00:35:17 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: mjacob@freebsd.org Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> In-Reply-To: <50D1D2BD.80107@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 16:35:33 -0000 On 12/19/12 10:44 PM, Matthew Jacob wrote: > On 12/19/2012 5:54 AM, Konstantin Belousov wrote: >> One of the known FreeBSD I/O path performance bootleneck is the >> neccessity to map each I/O buffer pages into KVA. The problem is that >> on the multi-core machines, the mapping must flush TLB on all cores, >> due to the global mapping of the buffer pages into the kernel. This >> means that buffer creation and destruction disrupts execution of all >> other cores to perform TLB shootdown through IPI, and the thread >> initiating the shootdown must wait for all other cores to execute and >> report. >> > About time! yeah.. Bill Jolitz had patches for this in 92 ... that disappeared with him. > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > > From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 16:51:08 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3B173968; Wed, 19 Dec 2012 16:51:08 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id C68498FC12; Wed, 19 Dec 2012 16:51:07 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 5928B8A3FC; Wed, 19 Dec 2012 16:51:06 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJGp5O1017237; Wed, 19 Dec 2012 16:51:05 GMT (envelope-from phk@phk.freebsd.dk) To: Alexander Motin Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <50D1E0D8.9070209@FreeBSD.org> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <15882.1355914308@critter.freebsd.dk> <20121219221518.E1082@besplex.bde.org> <20121220010702.B1675@besplex.bde.org> <50D1E0D8.9070209@FreeBSD.org> Date: Wed, 19 Dec 2012 16:51:05 +0000 Message-ID: <17236.1355935865@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 16:51:08 -0000 -------- In message <50D1E0D8.9070209@FreeBSD.org>, Alexander Motin writes: >It would be late to change this after committing. I would definitely >like it to be done earlier to not redo all the tests, but I think we >could convert callout and eventtimers code to 32.32 format in several >days. The only two questions are: do we really want it (won't there be >any reasons to regret about it) and how do we want it to look? As much as it pains me to raise this point, we would regret it if we did not use 32.32, because Linux already went that way. As much as there is to be said for doing things right, we should also try to avoid pointless incompatibilities which will make it needlessly hard for people to move code, particular device drivers forth and back. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 16:52:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3E57DC4A; Wed, 19 Dec 2012 16:52:42 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id E9F028FC0A; Wed, 19 Dec 2012 16:52:41 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 13D0D89EAF; Wed, 19 Dec 2012 16:52:41 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJGqeEj017253; Wed, 19 Dec 2012 16:52:40 GMT (envelope-from phk@phk.freebsd.dk) To: Julian Elischer Subject: Re: Unmapped I/O In-reply-to: <50D1ECC5.2070209@freebsd.org> From: "Poul-Henning Kamp" References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> Date: Wed, 19 Dec 2012 16:52:40 +0000 Message-ID: <17252.1355935960@critter.freebsd.dk> Cc: mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 16:52:42 -0000 -------- In message <50D1ECC5.2070209@freebsd.org>, Julian Elischer writes: >yeah.. Bill Jolitz had patches for this in 92 ... that disappeared >with him. You know, I've never seen a shred of evidence supporting that claim or any of the many similarly improbable claims Bill Jolitz made, and in this particular case I very much did look for such evidence. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 17:23:28 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 42F37563; Wed, 19 Dec 2012 17:23:28 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 9C6E78FC0A; Wed, 19 Dec 2012 17:23:27 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBJHNKI9047899; Wed, 19 Dec 2012 19:23:20 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBJHNKI9047899 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBJHNKcl047898; Wed, 19 Dec 2012 19:23:20 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 19 Dec 2012 19:23:20 +0200 From: Konstantin Belousov To: Poul-Henning Kamp Subject: Re: Unmapped I/O Message-ID: <20121219172320.GW71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="25+pmc5/5D73GfAV" Content-Disposition: inline In-Reply-To: <17252.1355935960@critter.freebsd.dk> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 17:23:28 -0000 --25+pmc5/5D73GfAV Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 19, 2012 at 04:52:40PM +0000, Poul-Henning Kamp wrote: > -------- > In message <50D1ECC5.2070209@freebsd.org>, Julian Elischer writes: >=20 > >yeah.. Bill Jolitz had patches for this in 92 ... that disappeared=20 > >with him. >=20 > You know, I've never seen a shred of evidence supporting that claim > or any of the many similarly improbable claims Bill Jolitz made, and > in this particular case I very much did look for such evidence. This is definitely not a discussion I hoped for. Still, the i386 cannot have much benefit from the unmapped buffers, just because thre is no facilities similar to the direct map for amd64. i386 must use transient mapping even for unmapped buffers to copy the data to the usermode. Also, as I understand the history, VMIO buffers, or unified page/buffer cache, only appeared in the FreeBSD. --25+pmc5/5D73GfAV Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ0fgIAAoJEJDCuSvBvK1BMj8P/Au3nPzZZgXNjSEVpxSQ8gjD Wj3sr1XdtTuz3Y4lhmE6XLSy7mHyeS1HtNeoSSqZktHuteUE4DmtgyMuscHe+de+ heBQ7eob1sG7N363dibIUHMhkkWe/MZX8YZ50Jp/p13ij2ME4Um+jWtFVyn1VGNj 6hbH2uq0ClSYV9yI3iE+TJReWIeTfrSpNL535rGb7zoY7xlQh0LW74m0E7OE0/E+ oIaphXOkOn1OjL5PhYqX0TXeU+IB4vAxNPJmeetj8h9anWj8J+wMdP3s9Ki6xEwu dfj2ul0R2ddGG1Kkz+Dh3/22EXlpJIj2ahOkcRQkLB+K0PST3PGj7NQDnysh/bab OVCLc/o5S3t5LH4AF4LAsYc4OYrlkgVUx8O/AUoNSrZBZHu3RbxcSv3+FoQ3Dd9v UwqNkzuMPm6W8ZggJiiSZ6nwqR4Xc2EN6R+6qoI9fT3D6nPwcANDHOeOd/Xk7fz0 ntKgotw6zZtHjjpB7j3rJ243WgjNLJntYHfcvathCkD88IprLaLIU2RSMU8p6JRN mbBa7Mq4ksNVvDO+zaQz+8PawdkEaQv7mMsI/SP1owPU4uVCvJaOaEi5CIdumHum +BMj6bMBShkiZLjZUBx3hjdElU9SrGr+HYskWx0CPQP9X/n1FWejq98pKXfy2h22 eKRpEhjrmnpScYOcZEUg =FEJY -----END PGP SIGNATURE----- --25+pmc5/5D73GfAV-- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 17:47:47 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 13852FA0; Wed, 19 Dec 2012 17:47:47 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) by mx1.freebsd.org (Postfix) with ESMTP id D23A98FC12; Wed, 19 Dec 2012 17:47:46 +0000 (UTC) Received: from JRE-MBP-2.local (ppp121-45-232-233.lns20.per1.internode.on.net [121.45.232.233]) (authenticated bits=0) by vps1.elischer.org (8.14.5/8.14.5) with ESMTP id qBJHld9F034581 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 19 Dec 2012 09:47:41 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <50D1FDB5.70804@freebsd.org> Date: Thu, 20 Dec 2012 01:47:33 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> <20121219172320.GW71906@kib.kiev.ua> In-Reply-To: <20121219172320.GW71906@kib.kiev.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Poul-Henning Kamp , mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 17:47:47 -0000 On 12/20/12 1:23 AM, Konstantin Belousov wrote: > On Wed, Dec 19, 2012 at 04:52:40PM +0000, Poul-Henning Kamp wrote: >> -------- >> In message <50D1ECC5.2070209@freebsd.org>, Julian Elischer writes: >> >>> yeah.. Bill Jolitz had patches for this in 92 ... that disappeared >>> with him. >> You know, I've never seen a shred of evidence supporting that claim >> or any of the many similarly improbable claims Bill Jolitz made, and >> in this particular case I very much did look for such evidence. > This is definitely not a discussion I hoped for. > > Still, the i386 cannot have much benefit from the unmapped buffers, > just because thre is no facilities similar to the direct map for amd64. > i386 must use transient mapping even for unmapped buffers to copy > the data to the usermode. > > Also, as I understand the history, VMIO buffers, or unified page/buffer > cache, only appeared in the FreeBSD. If you look at the old physio code then you will see that the driver can DMA directly to user space, even in BSD4.3 and earlier. The system however insisted on mapping it to kernel addresses in case the device needed to be spoon fed. Even if it didn't need to. The case of buffered IO is of course different. Bill did explain his changes to me once when I visited him at his home in Oakland, and showed me code. I do not remember the details but the impression I retain is that there was some sort of "just-in-time" mapping that was used "if required" and that buffer caches were entirely non mapped most of the time, being maintained and managed in physical memory. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 18:24:26 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2B2F279F; Wed, 19 Dec 2012 18:24:26 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id D6B4A8FC14; Wed, 19 Dec 2012 18:24:25 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id A3CFB89EAF; Wed, 19 Dec 2012 18:24:23 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJION9k017480; Wed, 19 Dec 2012 18:24:23 GMT (envelope-from phk@phk.freebsd.dk) To: Konstantin Belousov Subject: Re: Unmapped I/O In-reply-to: <20121219172320.GW71906@kib.kiev.ua> From: "Poul-Henning Kamp" References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> <20121219172320.GW71906@kib.kiev.ua> Date: Wed, 19 Dec 2012 18:24:23 +0000 Message-ID: <17479.1355941463@critter.freebsd.dk> Cc: mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 18:24:26 -0000 -------- In message <20121219172320.GW71906@kib.kiev.ua>, Konstantin Belousov writes: >Still, the i386 cannot have much benefit from the unmapped buffers, >just because thre is no facilities similar to the direct map for amd64. >i386 must use transient mapping even for unmapped buffers to copy >the data to the usermode. Wrong, a Adaptec 1542 could DMA directly into or out of any spot of memory and that could have been mapped in userland but not in kernel. >Also, as I understand the history, VMIO buffers, or unified page/buffer >cache, only appeared in the FreeBSD. Correct, but truth to be told, they have probably delayed our implementation of unmapped buffers by about 10 years... I don't blame John & David however, making that full leap in one go would have required the mythical HeldenProgrammer, there were a lot of cruft we had to get out of the way first. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 18:36:05 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 02728925; Wed, 19 Dec 2012 18:36:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 902808FC0A; Wed, 19 Dec 2012 18:36:04 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBJIa026055191; Wed, 19 Dec 2012 20:36:00 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBJIa026055191 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBJIa0aF055190; Wed, 19 Dec 2012 20:36:00 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 19 Dec 2012 20:36:00 +0200 From: Konstantin Belousov To: Poul-Henning Kamp Subject: Re: Unmapped I/O Message-ID: <20121219183600.GX71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> <20121219172320.GW71906@kib.kiev.ua> <17479.1355941463@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Ge5ZftkQPdHHxjIb" Content-Disposition: inline In-Reply-To: <17479.1355941463@critter.freebsd.dk> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 18:36:05 -0000 --Ge5ZftkQPdHHxjIb Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 19, 2012 at 06:24:23PM +0000, Poul-Henning Kamp wrote: > -------- > In message <20121219172320.GW71906@kib.kiev.ua>, Konstantin Belousov writ= es: >=20 > >Still, the i386 cannot have much benefit from the unmapped buffers, > >just because thre is no facilities similar to the direct map for amd64. > >i386 must use transient mapping even for unmapped buffers to copy > >the data to the usermode. >=20 > Wrong, a Adaptec 1542 could DMA directly into or out of any spot > of memory and that could have been mapped in userland but not in > kernel. And how this can be used while keeping on-disk data coherent with the buffer ? It can by used by physio, but not for the normal file i/o, which caches the file data in the vnode pages or buffers for non-unified cache. The transient mapping is needed to copy between kernel buffer and usermode address on i386. >=20 > >Also, as I understand the history, VMIO buffers, or unified page/buffer > >cache, only appeared in the FreeBSD. >=20 > Correct, but truth to be told, they have probably delayed our > implementation of unmapped buffers by about 10 years... Mapped bufers only become an issue on really multi-core machines. Before large SMP become ubiquitous, additional complexity of the transient mappings definitely not worth it. >=20 > I don't blame John & David however, making that full leap in > one go would have required the mythical HeldenProgrammer, there > were a lot of cruft we had to get out of the way first. >=20 > --=20 > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe =20 > Never attribute to malice what can adequately be explained by incompetenc= e. --Ge5ZftkQPdHHxjIb Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ0gkPAAoJEJDCuSvBvK1B98cP/RVpqg6fWzCmvT2do78GRYhX +zDWRQyfYppKmer9ymth2xuFBpa+OVNhhRqnSQ2iV9SBEIG834lFK18t1hUOkP++ 8Lg1YZMWRQ8WW5eRaLeA3T7pf07YhuTZchMgwxG9zi5lGXBOOHIrDcEQ92qQq8fY ImMPj8cBusWL+Z09s85KFAjGvdsC8pVZXRVxXhaRfC59pLEowdxMIF5GQKUzbYqz FYpwd2NFCMN5ZMCAJHudxg7dwqfEFAIKLYpouEzzNXS4VOdgher4+WS7sdYsMeMn Htn15r/qc/TikItxzwrEA8LZbw6w/ASLau61dMc0alfc2RbPcLfTdbtQ62v83nIp TsglyGnuaSbs7+h8B5kz/hHZKe9Y8T6lF9KJC/YkmvuVh5mLnK/vmSZromUkpDKR cPaY2cZ7z8E2g8kRND2JLjUjXh083BiDdB+0F0eYdW+QbDJnJQYGRgwXaUegnDBA ZqAwVlg22px/tizYKpw5r2KBpXmFqd3GngH/KKQVkAxS6HItWBfzkTXjK2ReKGVq Sb9IihLcENNETgBO2xWMCd0ohAj9jYUpyMrtxSjqhDutw6ubk2kBH5rY89cQKgD7 Ks/fhn2Bd1VxkOQB4X1+LOpQzDPvDMPNDXzDKKdx6LSQN2aXec8wASogOrxnkjEZ bbQSqukY0Hp8PJYyaDjv =PCNw -----END PGP SIGNATURE----- --Ge5ZftkQPdHHxjIb-- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 18:47:25 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1B2A6E68; Wed, 19 Dec 2012 18:47:25 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id C5F2F8FC0C; Wed, 19 Dec 2012 18:47:24 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 8E5DF89EAF; Wed, 19 Dec 2012 18:47:23 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBJIlNId017564; Wed, 19 Dec 2012 18:47:23 GMT (envelope-from phk@phk.freebsd.dk) To: Konstantin Belousov Subject: Re: Unmapped I/O In-reply-to: <20121219183600.GX71906@kib.kiev.ua> From: "Poul-Henning Kamp" References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> <20121219172320.GW71906@kib.kiev.ua> <17479.1355941463@critter.freebsd.dk> <20121219183600.GX71906@kib.kiev.ua> Date: Wed, 19 Dec 2012 18:47:23 +0000 Message-ID: <17563.1355942843@critter.freebsd.dk> Cc: mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 18:47:25 -0000 -------- In message <20121219183600.GX71906@kib.kiev.ua>, Konstantin Belousov writes: >> Wrong, a Adaptec 1542 could DMA directly into or out of any spot >> of memory and that could have been mapped in userland but not in >> kernel. > >And how this can be used while keeping on-disk data coherent with the >buffer ?[...] You simply don't care about the kernel buffer (most of the time). The kernel doesn't need to see the data, all it has to do is move it from disk to userland. Of course there are boundary issues and cornercases that need to be handled, for instance if userland does not issue requests which are disk-sector-integral, in which case the buffers will be needed, but for the common case, they will not be necessary. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 18:58:44 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 302FF33E for ; Wed, 19 Dec 2012 18:58:44 +0000 (UTC) (envelope-from alan.l.cox@gmail.com) Received: from mail-la0-f45.google.com (mail-la0-f45.google.com [209.85.215.45]) by mx1.freebsd.org (Postfix) with ESMTP id 7C17F8FC16 for ; Wed, 19 Dec 2012 18:58:42 +0000 (UTC) Received: by mail-la0-f45.google.com with SMTP id p9so1790415laa.32 for ; Wed, 19 Dec 2012 10:58:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=4Doj3lgRSUAsiPGo4+qP5Sj1NUhVR/ZCKlRGyFSRpFg=; b=t6CYBsdsaAe9FVzl6fKwrzYEPK9coY8iAq4P8kKM6Q7z3uKN9Rn5JnEyOx0fkrC6V0 B9hk1RKfbPbtmrFwCl0TcsMtie21QN7yAsGOkQdexrrhq/CE8GU1jTRUgnbMikk8/owk GkDBGqZwCdyWsbiQFFevvMn2rAgo2RMVfDnc0SNljP6mMXHEFYFVjxDwvhSIlfJf0OlJ zltU9Z1Ubbvkt57PiVthIe/AHqn3utx2htZ+Bvzo9PDPOwh3upcvneQkeKXPTkXEl2EC VpJzS/ju0k7kHskMoikeQzpMSzasKR6hxHzlb3FjqvQ+vBwKUrcqv8I/xI5XbYmCOusn bN6A== MIME-Version: 1.0 Received: by 10.112.23.34 with SMTP id j2mr2759038lbf.118.1355943521789; Wed, 19 Dec 2012 10:58:41 -0800 (PST) Received: by 10.114.21.197 with HTTP; Wed, 19 Dec 2012 10:58:41 -0800 (PST) In-Reply-To: <20121219135451.GU71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> Date: Wed, 19 Dec 2012 12:58:41 -0600 Message-ID: Subject: Re: Unmapped I/O From: Alan Cox To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: alc@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 18:58:44 -0000 On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov wrote: > One of the known FreeBSD I/O path performance bootleneck is the > neccessity to map each I/O buffer pages into KVA. The problem is that > on the multi-core machines, the mapping must flush TLB on all cores, > due to the global mapping of the buffer pages into the kernel. This > means that buffer creation and destruction disrupts execution of all > other cores to perform TLB shootdown through IPI, and the thread > initiating the shootdown must wait for all other cores to execute and > report. > > The patch at > http://people.freebsd.org/~kib/misc/unmapped.4.patch > implements the 'unmapped buffers'. It means an ability to create the > VMIO struct buf, which does not point to the KVA mapping the buffer > pages to the kernel addresses. Since there is no mapping, kernel does > not need to clear TLB. The unmapped buffers are marked with the new > B_NOTMAPPED flag, and should be requested explicitely using the > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > buffer is requested but unmapped buffer already exists, the buffer > subsystem automatically maps the pages. > > The clustering code is also made aware of the not-mapped buffers, but > this required the KPI change that accounts for the diff in the non-UFS > filesystems. > > UFS is adopted to request not mapped buffers when kernel does not need > to access the content, i.e. mostly for the file data. New helper > function vn_io_fault_pgmove() operates on the unmapped array of pages. > It calls new pmap method pmap_copy_pages() to do the data move to and > from usermode. > > Besides not mapped buffers, not mapped BIOs are introduced, marked > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > to unmapped BIOs. Geom providers may indicate an acceptance of the > unmapped BIOs. If provider does not handle unmapped i/o requests, > geom now automatically establishes transient mapping for the i/o > pages. > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > gpart providers indicate the unmapped BIOs support if the underlying > provider can do unmapped i/o. I also hacked ahci(4) to handle > unmapped i/o, but this should be changed after the Jeff' physbio patch > is committed, to use proper busdma interface. > > Besides, the swap pager does unmapped swapping if the swap partition > indicated that it can do unmapped i/o. By Jeff request, a buffer > allocation code may reserve the KVA for unmapped buffer in advance. > The unmapped page-in for the vnode pager is also implemented if > filesystem supports it, but the page out is not. The page-out, as well > as the vnode-backed md(4), currently require mappings, mostly due to > the use of VOP_WRITE(). > > As such, the patch worked in my test environment, where I used > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > statistically significant difference in the buildworld -j 10 times on > the 4-core machine with HT. On the other hand, when doing sha1 over > the 5GB file, the system time was reduced by 30%. > > Unfinished items: > - Integration with the physbio, will be done after physbio is > committed to HEAD. > - The key per-architecture function needed for the unmapped i/o is the > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > shall be done for all other architectures. > - The sizing of the submap used for transient mapping of the BIOs is > naive. Should be adjusted, esp. for KVA-lean architectures. > - Conversion of the other filesystems. Low priority. > > I am interested in reviews, tests and suggestions. Note that this > only works now for md(4) and ahci(4), for other drivers the patched > kernel should fall back to the mapped i/o. > > Here are a couple things for you to think about: 1. A while back, I developed the patch at http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to reduce the number of TLB shootdowns by the buffer map. The idea is simple: Replace the calls to pmap_q{enter,remove}() with calls to a new machine-dependent function that opportunistically sets the buffer's kernel virtual address to the direct map for physically contiguous pages. However, if the pages are not physically contiguous, it calls pmap_qenter() with the kernel virtual address from the buffer map. This eliminated about half of the TLB shootdowns for a buildworld, because there is a decent amount of physical contiguity that occurs by "accident". Using a buddy allocator for physical page allocation tends to promote this contiguity. However, in a few places, it occurs by explicit action, e.g., mapped files, including large executables, using superpage reservations. So, how does this fit with what you've done? You might think of using what I describe above as a kind of "fast path". As you can see from the patch, it's very simple and non-intrusive. If the pages aren't physically contiguous, then instead of using pmap_qenter(), you fall back to whatever approach for creating ephemeral mappings is appropriate to a given architecture. 2. As for managing the ephemeral mappings on machines that don't support a direct map. I would suggest an approach that is loosely inspired by copying garbage collection (or the segment cleaners in log-structured file systems). Roughly, you manage the buffer map as a few spaces (or segments). When you create a new mapping in one of these spaces (or segments), you simply install the PTEs. When you decide to "garbage collect" a space (or spaces), then you perform a global TLB flush. Specifically, you do something like toggling the bit in the cr4 register that enables/disables support for the PG_G bit. If the spaces are sufficiently large, then the number of such global TLB flushes should be quite low. Every space would have an epoch number (or flush number). In the buffer, you would record the epoch number alongside the kernel virtual address. On access to the buffer, if the epoch number was too old, then you have to recreate the buffer's mapping in a new space. Alan From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 19:28:40 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A82E9B4D for ; Wed, 19 Dec 2012 19:28:40 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from mail-pa0-f44.google.com (mail-pa0-f44.google.com [209.85.220.44]) by mx1.freebsd.org (Postfix) with ESMTP id 6F8C58FC0C for ; Wed, 19 Dec 2012 19:28:40 +0000 (UTC) Received: by mail-pa0-f44.google.com with SMTP id hz11so1528829pad.31 for ; Wed, 19 Dec 2012 11:28:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:date:from:x-x-sender:to:cc:subject:in-reply-to :message-id:references:user-agent:mime-version:content-type :x-gm-message-state; bh=dwzZBcGz0Cw+mIO/PpWA0x3j0liHZOuiiSvGrJPf+rw=; b=Evt9/DuUvEgBk2Pi5LArcWPvkRJIj2NdmpVDPyxEgrjhu6SkIhaAJ8P56DuAkbaaZq dk340WPscaSPp8TwSiGyoc5vivu0gV9pq7deZqbfcGyp57eHJlhjy2/QaHIts1gjikuX QouCoEXLEcr8dIn4zDWXDu1hJ6OInvIGCzSb2x67iMivTYj6uq5rlpavaEl/VDdQjW/1 Idg+xElt7e3Wyc0WC9B41c7PePkyB34/aDiJCXsio74F4wvO7PSXvQuQ2LAUCrWkgVks DdRzcrFFAaEgNHHPDRwV1RC2VtmlyfDdF2dV3F/xmNgUuKymehi2bsx6SA5cMI0rX1Zh YZzQ== X-Received: by 10.68.235.71 with SMTP id uk7mr21912720pbc.10.1355945319949; Wed, 19 Dec 2012 11:28:39 -0800 (PST) Received: from rrcs-66-91-135-210.west.biz.rr.com (rrcs-66-91-135-210.west.biz.rr.com. [66.91.135.210]) by mx.google.com with ESMTPS id kl5sm3561500pbc.74.2012.12.19.11.28.38 (version=SSLv3 cipher=OTHER); Wed, 19 Dec 2012 11:28:39 -0800 (PST) Date: Wed, 19 Dec 2012 09:28:46 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: alc@freebsd.org Subject: Re: Unmapped I/O In-Reply-To: Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Gm-Message-State: ALoCoQlhQ0Do9pyXm38NQLUchdzBN7fOjUXIjcMHerjIA5srLK50TZi124CP2c7fRUeDz8CJx5BA Cc: Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 19:28:40 -0000 On Wed, 19 Dec 2012, Alan Cox wrote: > On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov wrote: > >> One of the known FreeBSD I/O path performance bootleneck is the >> neccessity to map each I/O buffer pages into KVA. The problem is that >> on the multi-core machines, the mapping must flush TLB on all cores, >> due to the global mapping of the buffer pages into the kernel. This >> means that buffer creation and destruction disrupts execution of all >> other cores to perform TLB shootdown through IPI, and the thread >> initiating the shootdown must wait for all other cores to execute and >> report. >> >> The patch at >> http://people.freebsd.org/~kib/misc/unmapped.4.patch >> implements the 'unmapped buffers'. It means an ability to create the >> VMIO struct buf, which does not point to the KVA mapping the buffer >> pages to the kernel addresses. Since there is no mapping, kernel does >> not need to clear TLB. The unmapped buffers are marked with the new >> B_NOTMAPPED flag, and should be requested explicitely using the >> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >> buffer is requested but unmapped buffer already exists, the buffer >> subsystem automatically maps the pages. >> >> The clustering code is also made aware of the not-mapped buffers, but >> this required the KPI change that accounts for the diff in the non-UFS >> filesystems. >> >> UFS is adopted to request not mapped buffers when kernel does not need >> to access the content, i.e. mostly for the file data. New helper >> function vn_io_fault_pgmove() operates on the unmapped array of pages. >> It calls new pmap method pmap_copy_pages() to do the data move to and >> from usermode. >> >> Besides not mapped buffers, not mapped BIOs are introduced, marked >> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >> to unmapped BIOs. Geom providers may indicate an acceptance of the >> unmapped BIOs. If provider does not handle unmapped i/o requests, >> geom now automatically establishes transient mapping for the i/o >> pages. >> >> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >> gpart providers indicate the unmapped BIOs support if the underlying >> provider can do unmapped i/o. I also hacked ahci(4) to handle >> unmapped i/o, but this should be changed after the Jeff' physbio patch >> is committed, to use proper busdma interface. >> >> Besides, the swap pager does unmapped swapping if the swap partition >> indicated that it can do unmapped i/o. By Jeff request, a buffer >> allocation code may reserve the KVA for unmapped buffer in advance. >> The unmapped page-in for the vnode pager is also implemented if >> filesystem supports it, but the page out is not. The page-out, as well >> as the vnode-backed md(4), currently require mappings, mostly due to >> the use of VOP_WRITE(). >> >> As such, the patch worked in my test environment, where I used >> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >> statistically significant difference in the buildworld -j 10 times on >> the 4-core machine with HT. On the other hand, when doing sha1 over >> the 5GB file, the system time was reduced by 30%. >> >> Unfinished items: >> - Integration with the physbio, will be done after physbio is >> committed to HEAD. >> - The key per-architecture function needed for the unmapped i/o is the >> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >> shall be done for all other architectures. >> - The sizing of the submap used for transient mapping of the BIOs is >> naive. Should be adjusted, esp. for KVA-lean architectures. >> - Conversion of the other filesystems. Low priority. >> >> I am interested in reviews, tests and suggestions. Note that this >> only works now for md(4) and ahci(4), for other drivers the patched >> kernel should fall back to the mapped i/o. >> >> > Here are a couple things for you to think about: > > 1. A while back, I developed the patch at > http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to > reduce the number of TLB shootdowns by the buffer map. The idea is simple: > Replace the calls to pmap_q{enter,remove}() with calls to a new > machine-dependent function that opportunistically sets the buffer's kernel > virtual address to the direct map for physically contiguous pages. > However, if the pages are not physically contiguous, it calls pmap_qenter() > with the kernel virtual address from the buffer map. > > This eliminated about half of the TLB shootdowns for a buildworld, because > there is a decent amount of physical contiguity that occurs by "accident". > Using a buddy allocator for physical page allocation tends to promote this > contiguity. However, in a few places, it occurs by explicit action, e.g., > mapped files, including large executables, using superpage reservations. > > So, how does this fit with what you've done? You might think of using what > I describe above as a kind of "fast path". As you can see from the patch, > it's very simple and non-intrusive. If the pages aren't physically > contiguous, then instead of using pmap_qenter(), you fall back to whatever > approach for creating ephemeral mappings is appropriate to a given > architecture. I think these are complimentary. Kib's patch gives us the fastest possible path for user data. Alan's patch will improve the metadata performance for things that really require the buffer cache. I see no reason not to clean up and commit both. > > 2. As for managing the ephemeral mappings on machines that don't support a > direct map. I would suggest an approach that is loosely inspired by > copying garbage collection (or the segment cleaners in log-structured file > systems). Roughly, you manage the buffer map as a few spaces (or > segments). When you create a new mapping in one of these spaces (or > segments), you simply install the PTEs. When you decide to "garbage > collect" a space (or spaces), then you perform a global TLB flush. > Specifically, you do something like toggling the bit in the cr4 register > that enables/disables support for the PG_G bit. If the spaces are > sufficiently large, then the number of such global TLB flushes should be > quite low. Every space would have an epoch number (or flush number). In > the buffer, you would record the epoch number alongside the kernel virtual > address. On access to the buffer, if the epoch number was too old, then > you have to recreate the buffer's mapping in a new space. Are the machines that don't have a direct map performance critical? My expectation is that they are legacy or embedded. This seems like a great project to do when the rest of the pieces are stable and fast. Until then they could just use something like pbufs? Jeff > > Alan > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 19:28:49 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C15B4BF7; Wed, 19 Dec 2012 19:28:49 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 2C36B8FC14; Wed, 19 Dec 2012 19:28:48 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBJJSces060051; Wed, 19 Dec 2012 21:28:38 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBJJSces060051 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBJJSc9E060050; Wed, 19 Dec 2012 21:28:38 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 19 Dec 2012 21:28:38 +0200 From: Konstantin Belousov To: alc@freebsd.org Subject: Re: Unmapped I/O Message-ID: <20121219192838.GZ71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="5RB7GLe/slk02+tJ" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 19:28:49 -0000 --5RB7GLe/slk02+tJ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Dec 19, 2012 at 12:58:41PM -0600, Alan Cox wrote: > On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov wrote: >=20 > > One of the known FreeBSD I/O path performance bootleneck is the > > neccessity to map each I/O buffer pages into KVA. The problem is that > > on the multi-core machines, the mapping must flush TLB on all cores, > > due to the global mapping of the buffer pages into the kernel. This > > means that buffer creation and destruction disrupts execution of all > > other cores to perform TLB shootdown through IPI, and the thread > > initiating the shootdown must wait for all other cores to execute and > > report. > > > > The patch at > > http://people.freebsd.org/~kib/misc/unmapped.4.patch > > implements the 'unmapped buffers'. It means an ability to create the > > VMIO struct buf, which does not point to the KVA mapping the buffer > > pages to the kernel addresses. Since there is no mapping, kernel does > > not need to clear TLB. The unmapped buffers are marked with the new > > B_NOTMAPPED flag, and should be requested explicitely using the > > GB_NOTMAPPED flag to the buffer allocation routines. If the mapped > > buffer is requested but unmapped buffer already exists, the buffer > > subsystem automatically maps the pages. > > > > The clustering code is also made aware of the not-mapped buffers, but > > this required the KPI change that accounts for the diff in the non-UFS > > filesystems. > > > > UFS is adopted to request not mapped buffers when kernel does not need > > to access the content, i.e. mostly for the file data. New helper > > function vn_io_fault_pgmove() operates on the unmapped array of pages. > > It calls new pmap method pmap_copy_pages() to do the data move to and > > from usermode. > > > > Besides not mapped buffers, not mapped BIOs are introduced, marked > > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > > to unmapped BIOs. Geom providers may indicate an acceptance of the > > unmapped BIOs. If provider does not handle unmapped i/o requests, > > geom now automatically establishes transient mapping for the i/o > > pages. > > > > Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The > > gpart providers indicate the unmapped BIOs support if the underlying > > provider can do unmapped i/o. I also hacked ahci(4) to handle > > unmapped i/o, but this should be changed after the Jeff' physbio patch > > is committed, to use proper busdma interface. > > > > Besides, the swap pager does unmapped swapping if the swap partition > > indicated that it can do unmapped i/o. By Jeff request, a buffer > > allocation code may reserve the KVA for unmapped buffer in advance. > > The unmapped page-in for the vnode pager is also implemented if > > filesystem supports it, but the page out is not. The page-out, as well > > as the vnode-backed md(4), currently require mappings, mostly due to > > the use of VOP_WRITE(). > > > > As such, the patch worked in my test environment, where I used > > ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no > > statistically significant difference in the buildworld -j 10 times on > > the 4-core machine with HT. On the other hand, when doing sha1 over > > the 5GB file, the system time was reduced by 30%. > > > > Unfinished items: > > - Integration with the physbio, will be done after physbio is > > committed to HEAD. > > - The key per-architecture function needed for the unmapped i/o is the > > pmap_copy_pages(). I implemented it for amd64 and i386 right now, it > > shall be done for all other architectures. > > - The sizing of the submap used for transient mapping of the BIOs is > > naive. Should be adjusted, esp. for KVA-lean architectures. > > - Conversion of the other filesystems. Low priority. > > > > I am interested in reviews, tests and suggestions. Note that this > > only works now for md(4) and ahci(4), for other drivers the patched > > kernel should fall back to the mapped i/o. > > > > > Here are a couple things for you to think about: >=20 > 1. A while back, I developed the patch at > http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to > reduce the number of TLB shootdowns by the buffer map. The idea is simpl= e: > Replace the calls to pmap_q{enter,remove}() with calls to a new > machine-dependent function that opportunistically sets the buffer's kernel > virtual address to the direct map for physically contiguous pages. > However, if the pages are not physically contiguous, it calls pmap_qenter= () > with the kernel virtual address from the buffer map. >=20 > This eliminated about half of the TLB shootdowns for a buildworld, because > there is a decent amount of physical contiguity that occurs by "accident". > Using a buddy allocator for physical page allocation tends to promote this > contiguity. However, in a few places, it occurs by explicit action, e.g., > mapped files, including large executables, using superpage reservations. >=20 > So, how does this fit with what you've done? You might think of using wh= at > I describe above as a kind of "fast path". As you can see from the patch, > it's very simple and non-intrusive. If the pages aren't physically > contiguous, then instead of using pmap_qenter(), you fall back to whatever > approach for creating ephemeral mappings is appropriate to a given > architecture. I remember this. I did not measured the change in the amount of IPIs issued during the buildworld, but I do account for the mapped/unmapped buffer space in the patch. For the buildworld load, there is 5-10% of the mapped buffers =66rom the whole buffers, which coincide with the intuitive size of the metadata for sources. Since unmapped buffers eliminate IPIs at creation and reuse, I safely guess that IPI reduction is on the comparable numbers. The pmap_map_buf() patch is orthohonal to the work I did, and it should nicely reduce the overhead for the metadata buffers handling. I can finish it, if you want. I do not think that it should be added to the already large patch, but instead it could be done and committed separately. >=20 > 2. As for managing the ephemeral mappings on machines that don't support a > direct map. I would suggest an approach that is loosely inspired by > copying garbage collection (or the segment cleaners in log-structured file > systems). Roughly, you manage the buffer map as a few spaces (or > segments). When you create a new mapping in one of these spaces (or > segments), you simply install the PTEs. When you decide to "garbage > collect" a space (or spaces), then you perform a global TLB flush. > Specifically, you do something like toggling the bit in the cr4 register > that enables/disables support for the PG_G bit. If the spaces are > sufficiently large, then the number of such global TLB flushes should be > quite low. Every space would have an epoch number (or flush number). In > the buffer, you would record the epoch number alongside the kernel virtual > address. On access to the buffer, if the epoch number was too old, then > you have to recreate the buffer's mapping in a new space. Could you, please, describe the idea in more details ? For which mappings the described mechanism should be used ? Do you mean the pmap_copy_pages() implementation, or the fallback mappings for BIOs ? Note that pmap_copy_pages() implementaion on i386 is shamelessly stolen =66rom pmap_copy_page() and uses the per-cpu ephemeral mapping for copying. For BIOs, this might be used, but I am also quite satisfied with submap and pmap_qenter(). --5RB7GLe/slk02+tJ Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iQIcBAEBAgAGBQJQ0hVlAAoJEJDCuSvBvK1BAjcP/jdbIaqgddYRu0hD9iVJJTnt dlXYUROtXcfFjTGRwt/l0ZEOadE7nTB/Nbtrwcb6+juYzq95yybUFk2yWjmXb0XX fI8lb/i1SCx3y0w9FA08Nd0cBjxbxGKm7qJmnLuROraYGItPAMwh9sVPkHndBGnD K61/mU+icda2gSAvZ8G+vlrtC3i/1+aC2fGLhyLohM3rwwowEHKMDQxq6gshR4XU CWNJVa1t4RgnE5ZfwbPvsZO5HJvXOVNZ2IhlCTvepwzOVevP6zL/CYY4hxOFTjpN lVgG/qZrK780Ye7qEtjHy8H2eYNBgQbn5rjGdaGUAoXUBI1MRm80namBne+OZqQw 1PCPzRDd+a//QKcPkOPaW6UJOPCg9s4V/tpWFqYUjLRG9TcoU5keGV9TZATqj0k3 D3KGcNBMnoShCsVbOEbE/5JozgDPogmvEI0SV3/ei2zeeJzd2HlyE/bmKI5NANPJ 2WPE1Wnhehu0AJNlFB03WhZ1CqOLt48DUmtrgBRSS/841G6r/F5BbxIOZmneFGNw 5Si40F5l4E+mbKbCmxVDCB4KTdyntLa5uB+8dxAvr/q0H1lSTJJOrjzND+BdrepW G3OtwQ5kVQSwp6iWHvEK4j9b53kTaq3Zl5jlRXJi2nD5aYuYHhQMoSPOykU7i2yB 31NfKZoOyS34FwrrlG/Z =xmx1 -----END PGP SIGNATURE----- --5RB7GLe/slk02+tJ-- From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 21:02:23 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9B65CE7F; Wed, 19 Dec 2012 21:02:23 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id D6B568FC12; Wed, 19 Dec 2012 21:02:22 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBJL2Lx8045035; Wed, 19 Dec 2012 14:02:21 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBJL2IfD065631; Wed, 19 Dec 2012 14:02:18 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: Unmapped I/O From: Ian Lepore To: Konstantin Belousov In-Reply-To: <20121219183600.GX71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <50D1D2BD.80107@freebsd.org> <50D1ECC5.2070209@freebsd.org> <17252.1355935960@critter.freebsd.dk> <20121219172320.GW71906@kib.kiev.ua> <17479.1355941463@critter.freebsd.dk> <20121219183600.GX71906@kib.kiev.ua> Content-Type: text/plain; charset="us-ascii" Date: Wed, 19 Dec 2012 14:02:18 -0700 Message-ID: <1355950938.1198.227.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Poul-Henning Kamp , mjacob@freebsd.org, freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 21:02:23 -0000 On Wed, 2012-12-19 at 20:36 +0200, Konstantin Belousov wrote: > On Wed, Dec 19, 2012 at 06:24:23PM +0000, Poul-Henning Kamp wrote: > > -------- > > In message <20121219172320.GW71906@kib.kiev.ua>, Konstantin Belousov writes: > > > > >Still, the i386 cannot have much benefit from the unmapped buffers, > > >just because thre is no facilities similar to the direct map for amd64. > > >i386 must use transient mapping even for unmapped buffers to copy > > >the data to the usermode. > > > > Wrong, a Adaptec 1542 could DMA directly into or out of any spot > > of memory and that could have been mapped in userland but not in > > kernel. > And how this can be used while keeping on-disk data coherent with the > buffer ? It can by used by physio, but not for the normal file i/o, which > caches the file data in the vnode pages or buffers for non-unified cache. > The transient mapping is needed to copy between kernel buffer and usermode > address on i386. > > > > > >Also, as I understand the history, VMIO buffers, or unified page/buffer > > >cache, only appeared in the FreeBSD. > > > > Correct, but truth to be told, they have probably delayed our > > implementation of unmapped buffers by about 10 years... > Mapped bufers only become an issue on really multi-core machines. > Before large SMP become ubiquitous, additional complexity of the > transient mappings definitely not worth it. On VIVT cache architectures we have to disable caching on all mappings of a page if there are multiple mappings and any are writable. This causes executables to run with the i-cache disabled if its pages are in the buffer cache, because right now the buffers are mapped with persistant writable mappings. So if I understand the conversation so far, these changes are going to fix that problem by only using ephemeral mappings when needed, right? If so, that's very good news. -- Ian From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 21:16:30 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E0B8092B; Wed, 19 Dec 2012 21:16:30 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30]) by mx1.freebsd.org (Postfix) with ESMTP id ACADF8FC0A; Wed, 19 Dec 2012 21:16:30 +0000 (UTC) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id EBE104C03E9; Wed, 19 Dec 2012 15:16:23 -0600 (CST) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id EA2744C03B6; Wed, 19 Dec 2012 15:16:23 -0600 (CST) X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel Received: from mh11.mail.rice.edu ([127.0.0.1]) by mh11.mail.rice.edu (mh11.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id Urs2681PyzOC; Wed, 19 Dec 2012 15:16:23 -0600 (CST) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh11.mail.rice.edu (Postfix) with ESMTPSA id 57EF74C02A4; Wed, 19 Dec 2012 15:16:23 -0600 (CST) Message-ID: <50D22EA6.1040501@rice.edu> Date: Wed, 19 Dec 2012 15:16:22 -0600 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Jeff Roberson Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 21:16:31 -0000 On 12/19/2012 13:28, Jeff Roberson wrote: > On Wed, 19 Dec 2012, Alan Cox wrote: > >> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov >> wrote: >> >>> One of the known FreeBSD I/O path performance bootleneck is the >>> neccessity to map each I/O buffer pages into KVA. The problem is that >>> on the multi-core machines, the mapping must flush TLB on all cores, >>> due to the global mapping of the buffer pages into the kernel. This >>> means that buffer creation and destruction disrupts execution of all >>> other cores to perform TLB shootdown through IPI, and the thread >>> initiating the shootdown must wait for all other cores to execute and >>> report. >>> >>> The patch at >>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>> implements the 'unmapped buffers'. It means an ability to create the >>> VMIO struct buf, which does not point to the KVA mapping the buffer >>> pages to the kernel addresses. Since there is no mapping, kernel does >>> not need to clear TLB. The unmapped buffers are marked with the new >>> B_NOTMAPPED flag, and should be requested explicitely using the >>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>> buffer is requested but unmapped buffer already exists, the buffer >>> subsystem automatically maps the pages. >>> >>> The clustering code is also made aware of the not-mapped buffers, but >>> this required the KPI change that accounts for the diff in the non-UFS >>> filesystems. >>> >>> UFS is adopted to request not mapped buffers when kernel does not need >>> to access the content, i.e. mostly for the file data. New helper >>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>> It calls new pmap method pmap_copy_pages() to do the data move to and >>> from usermode. >>> >>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>> geom now automatically establishes transient mapping for the i/o >>> pages. >>> >>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>> gpart providers indicate the unmapped BIOs support if the underlying >>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>> is committed, to use proper busdma interface. >>> >>> Besides, the swap pager does unmapped swapping if the swap partition >>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>> allocation code may reserve the KVA for unmapped buffer in advance. >>> The unmapped page-in for the vnode pager is also implemented if >>> filesystem supports it, but the page out is not. The page-out, as well >>> as the vnode-backed md(4), currently require mappings, mostly due to >>> the use of VOP_WRITE(). >>> >>> As such, the patch worked in my test environment, where I used >>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>> statistically significant difference in the buildworld -j 10 times on >>> the 4-core machine with HT. On the other hand, when doing sha1 over >>> the 5GB file, the system time was reduced by 30%. >>> >>> Unfinished items: >>> - Integration with the physbio, will be done after physbio is >>> committed to HEAD. >>> - The key per-architecture function needed for the unmapped i/o is the >>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>> shall be done for all other architectures. >>> - The sizing of the submap used for transient mapping of the BIOs is >>> naive. Should be adjusted, esp. for KVA-lean architectures. >>> - Conversion of the other filesystems. Low priority. >>> >>> I am interested in reviews, tests and suggestions. Note that this >>> only works now for md(4) and ahci(4), for other drivers the patched >>> kernel should fall back to the mapped i/o. >>> >>> >> Here are a couple things for you to think about: >> >> 1. A while back, I developed the patch at >> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in >> trying to >> reduce the number of TLB shootdowns by the buffer map. The idea is >> simple: >> Replace the calls to pmap_q{enter,remove}() with calls to a new >> machine-dependent function that opportunistically sets the buffer's >> kernel >> virtual address to the direct map for physically contiguous pages. >> However, if the pages are not physically contiguous, it calls >> pmap_qenter() >> with the kernel virtual address from the buffer map. >> >> This eliminated about half of the TLB shootdowns for a buildworld, >> because >> there is a decent amount of physical contiguity that occurs by >> "accident". >> Using a buddy allocator for physical page allocation tends to promote >> this >> contiguity. However, in a few places, it occurs by explicit action, >> e.g., >> mapped files, including large executables, using superpage reservations. >> >> So, how does this fit with what you've done? You might think of >> using what >> I describe above as a kind of "fast path". As you can see from the >> patch, >> it's very simple and non-intrusive. If the pages aren't physically >> contiguous, then instead of using pmap_qenter(), you fall back to >> whatever >> approach for creating ephemeral mappings is appropriate to a given >> architecture. > > I think these are complimentary. Kib's patch gives us the fastest > possible path for user data. Alan's patch will improve the metadata > performance for things that really require the buffer cache. I see no > reason not to clean up and commit both. > >> >> 2. As for managing the ephemeral mappings on machines that don't >> support a >> direct map. I would suggest an approach that is loosely inspired by >> copying garbage collection (or the segment cleaners in log-structured >> file >> systems). Roughly, you manage the buffer map as a few spaces (or >> segments). When you create a new mapping in one of these spaces (or >> segments), you simply install the PTEs. When you decide to "garbage >> collect" a space (or spaces), then you perform a global TLB flush. >> Specifically, you do something like toggling the bit in the cr4 register >> that enables/disables support for the PG_G bit. If the spaces are >> sufficiently large, then the number of such global TLB flushes should be >> quite low. Every space would have an epoch number (or flush >> number). In >> the buffer, you would record the epoch number alongside the kernel >> virtual >> address. On access to the buffer, if the epoch number was too old, then >> you have to recreate the buffer's mapping in a new space. > > Are the machines that don't have a direct map performance critical? > My expectation is that they are legacy or embedded. This seems like a > great project to do when the rest of the pieces are stable and fast. > Until then they could just use something like pbufs? > I think the answer to your first question depends entirely on who you are. :-) Also, at the low-end of the server space, there are many people trying to promote arm-based systems. While FreeBSD may never run on your arm-based phone, I think that ceding the arm-based server market to others will be a strategic mistake. Alan P.S. I think we're moving the discussion to far away from kib's original, so I suggest changing the subject line on any follow ups. From owner-freebsd-arch@FreeBSD.ORG Wed Dec 19 23:17:37 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BE30A772; Wed, 19 Dec 2012 23:17:37 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-we0-f172.google.com (mail-we0-f172.google.com [74.125.82.172]) by mx1.freebsd.org (Postfix) with ESMTP id 18A8C8FC0A; Wed, 19 Dec 2012 23:17:36 +0000 (UTC) Received: by mail-we0-f172.google.com with SMTP id r3so1267159wey.17 for ; Wed, 19 Dec 2012 15:17:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=VCoBOgepYIgPaA0Be+jAYPXhIbelvMJt00E4iBNcW6Y=; b=kxaT/1qDHHKH8BvvQQagL3LTkTbEAzLC2/ocD0K/5r9jDPzWniLhcPG90UaLuXKEd9 q9IaChhtiNBjAgKq4obGmDEB+BQsvJHJ+XAphsboDcBDwGuFHGV05QhakN1ReOV3I6RD ogVRHGXcfEXmv2q8rgBPfjqMNyh7J/K8dF2RArAFIdlu3rnNpKZt9SeQFEzxBf+45LT5 fmVSiLL4KQPaWGlmNE2zvlrEgweeDgqlRocO6UNPKV83IIovXqsdUskG+jPipAJARN/T uzc1MlF6rSn1vuDVKhT/7bqZ6UuEgrLg6jF5LQsy0D3sJPkANr5uQzOA5c6x7lOq2Etp ozfA== MIME-Version: 1.0 Received: by 10.194.93.40 with SMTP id cr8mr14455368wjb.16.1355959055670; Wed, 19 Dec 2012 15:17:35 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.217.57.9 with HTTP; Wed, 19 Dec 2012 15:17:35 -0800 (PST) In-Reply-To: <50D22EA6.1040501@rice.edu> References: <20121219135451.GU71906@kib.kiev.ua> <50D22EA6.1040501@rice.edu> Date: Wed, 19 Dec 2012 15:17:35 -0800 X-Google-Sender-Auth: 3qIopfcW2Mi5NBNl1fg7OkpLduI Message-ID: Subject: Re: Unmapped I/O From: Adrian Chadd To: Alan Cox Content-Type: text/plain; charset=ISO-8859-1 Cc: alc@freebsd.org, Konstantin Belousov , arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Dec 2012 23:17:37 -0000 ... some of us are trying to get FreeBSD ready to run on your ARM phone. Please don't break that. end-goal. :-) Adrian On 19 December 2012 13:16, Alan Cox wrote: > On 12/19/2012 13:28, Jeff Roberson wrote: >> On Wed, 19 Dec 2012, Alan Cox wrote: >> >>> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov >>> wrote: >>> >>>> One of the known FreeBSD I/O path performance bootleneck is the >>>> neccessity to map each I/O buffer pages into KVA. The problem is that >>>> on the multi-core machines, the mapping must flush TLB on all cores, >>>> due to the global mapping of the buffer pages into the kernel. This >>>> means that buffer creation and destruction disrupts execution of all >>>> other cores to perform TLB shootdown through IPI, and the thread >>>> initiating the shootdown must wait for all other cores to execute and >>>> report. >>>> >>>> The patch at >>>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>>> implements the 'unmapped buffers'. It means an ability to create the >>>> VMIO struct buf, which does not point to the KVA mapping the buffer >>>> pages to the kernel addresses. Since there is no mapping, kernel does >>>> not need to clear TLB. The unmapped buffers are marked with the new >>>> B_NOTMAPPED flag, and should be requested explicitely using the >>>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>>> buffer is requested but unmapped buffer already exists, the buffer >>>> subsystem automatically maps the pages. >>>> >>>> The clustering code is also made aware of the not-mapped buffers, but >>>> this required the KPI change that accounts for the diff in the non-UFS >>>> filesystems. >>>> >>>> UFS is adopted to request not mapped buffers when kernel does not need >>>> to access the content, i.e. mostly for the file data. New helper >>>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>>> It calls new pmap method pmap_copy_pages() to do the data move to and >>>> from usermode. >>>> >>>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>>> geom now automatically establishes transient mapping for the i/o >>>> pages. >>>> >>>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>>> gpart providers indicate the unmapped BIOs support if the underlying >>>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>>> is committed, to use proper busdma interface. >>>> >>>> Besides, the swap pager does unmapped swapping if the swap partition >>>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>>> allocation code may reserve the KVA for unmapped buffer in advance. >>>> The unmapped page-in for the vnode pager is also implemented if >>>> filesystem supports it, but the page out is not. The page-out, as well >>>> as the vnode-backed md(4), currently require mappings, mostly due to >>>> the use of VOP_WRITE(). >>>> >>>> As such, the patch worked in my test environment, where I used >>>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>>> statistically significant difference in the buildworld -j 10 times on >>>> the 4-core machine with HT. On the other hand, when doing sha1 over >>>> the 5GB file, the system time was reduced by 30%. >>>> >>>> Unfinished items: >>>> - Integration with the physbio, will be done after physbio is >>>> committed to HEAD. >>>> - The key per-architecture function needed for the unmapped i/o is the >>>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>>> shall be done for all other architectures. >>>> - The sizing of the submap used for transient mapping of the BIOs is >>>> naive. Should be adjusted, esp. for KVA-lean architectures. >>>> - Conversion of the other filesystems. Low priority. >>>> >>>> I am interested in reviews, tests and suggestions. Note that this >>>> only works now for md(4) and ahci(4), for other drivers the patched >>>> kernel should fall back to the mapped i/o. >>>> >>>> >>> Here are a couple things for you to think about: >>> >>> 1. A while back, I developed the patch at >>> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in >>> trying to >>> reduce the number of TLB shootdowns by the buffer map. The idea is >>> simple: >>> Replace the calls to pmap_q{enter,remove}() with calls to a new >>> machine-dependent function that opportunistically sets the buffer's >>> kernel >>> virtual address to the direct map for physically contiguous pages. >>> However, if the pages are not physically contiguous, it calls >>> pmap_qenter() >>> with the kernel virtual address from the buffer map. >>> >>> This eliminated about half of the TLB shootdowns for a buildworld, >>> because >>> there is a decent amount of physical contiguity that occurs by >>> "accident". >>> Using a buddy allocator for physical page allocation tends to promote >>> this >>> contiguity. However, in a few places, it occurs by explicit action, >>> e.g., >>> mapped files, including large executables, using superpage reservations. >>> >>> So, how does this fit with what you've done? You might think of >>> using what >>> I describe above as a kind of "fast path". As you can see from the >>> patch, >>> it's very simple and non-intrusive. If the pages aren't physically >>> contiguous, then instead of using pmap_qenter(), you fall back to >>> whatever >>> approach for creating ephemeral mappings is appropriate to a given >>> architecture. >> >> I think these are complimentary. Kib's patch gives us the fastest >> possible path for user data. Alan's patch will improve the metadata >> performance for things that really require the buffer cache. I see no >> reason not to clean up and commit both. >> >>> >>> 2. As for managing the ephemeral mappings on machines that don't >>> support a >>> direct map. I would suggest an approach that is loosely inspired by >>> copying garbage collection (or the segment cleaners in log-structured >>> file >>> systems). Roughly, you manage the buffer map as a few spaces (or >>> segments). When you create a new mapping in one of these spaces (or >>> segments), you simply install the PTEs. When you decide to "garbage >>> collect" a space (or spaces), then you perform a global TLB flush. >>> Specifically, you do something like toggling the bit in the cr4 register >>> that enables/disables support for the PG_G bit. If the spaces are >>> sufficiently large, then the number of such global TLB flushes should be >>> quite low. Every space would have an epoch number (or flush >>> number). In >>> the buffer, you would record the epoch number alongside the kernel >>> virtual >>> address. On access to the buffer, if the epoch number was too old, then >>> you have to recreate the buffer's mapping in a new space. >> >> Are the machines that don't have a direct map performance critical? >> My expectation is that they are legacy or embedded. This seems like a >> great project to do when the rest of the pieces are stable and fast. >> Until then they could just use something like pbufs? >> > > > I think the answer to your first question depends entirely on who you > are. :-) Also, at the low-end of the server space, there are many > people trying to promote arm-based systems. While FreeBSD may never run > on your arm-based phone, I think that ceding the arm-based server market > to others will be a strategic mistake. > > Alan > > P.S. I think we're moving the discussion to far away from kib's > original, so I suggest changing the subject line on any follow ups. > > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Thu Dec 20 07:25:06 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0F260A3F; Thu, 20 Dec 2012 07:25:06 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30]) by mx1.freebsd.org (Postfix) with ESMTP id CFCD78FC0C; Thu, 20 Dec 2012 07:25:05 +0000 (UTC) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id E9ED04C026C; Thu, 20 Dec 2012 01:25:04 -0600 (CST) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id E86364C024E; Thu, 20 Dec 2012 01:25:04 -0600 (CST) X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel Received: from mh11.mail.rice.edu ([127.0.0.1]) by mh11.mail.rice.edu (mh11.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id uRRMvNTtDQTu; Thu, 20 Dec 2012 01:25:04 -0600 (CST) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh11.mail.rice.edu (Postfix) with ESMTPSA id 6221A4C0235; Thu, 20 Dec 2012 01:25:04 -0600 (CST) Message-ID: <50D2BD4F.7010204@rice.edu> Date: Thu, 20 Dec 2012 01:25:03 -0600 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> <20121219192838.GZ71906@kib.kiev.ua> In-Reply-To: <20121219192838.GZ71906@kib.kiev.ua> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: alc@freebsd.org, arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Dec 2012 07:25:06 -0000 On 12/19/2012 13:28, Konstantin Belousov wrote: > On Wed, Dec 19, 2012 at 12:58:41PM -0600, Alan Cox wrote: >> On Wed, Dec 19, 2012 at 7:54 AM, Konstantin Belousov wrote: >> >>> One of the known FreeBSD I/O path performance bootleneck is the >>> neccessity to map each I/O buffer pages into KVA. The problem is that >>> on the multi-core machines, the mapping must flush TLB on all cores, >>> due to the global mapping of the buffer pages into the kernel. This >>> means that buffer creation and destruction disrupts execution of all >>> other cores to perform TLB shootdown through IPI, and the thread >>> initiating the shootdown must wait for all other cores to execute and >>> report. >>> >>> The patch at >>> http://people.freebsd.org/~kib/misc/unmapped.4.patch >>> implements the 'unmapped buffers'. It means an ability to create the >>> VMIO struct buf, which does not point to the KVA mapping the buffer >>> pages to the kernel addresses. Since there is no mapping, kernel does >>> not need to clear TLB. The unmapped buffers are marked with the new >>> B_NOTMAPPED flag, and should be requested explicitely using the >>> GB_NOTMAPPED flag to the buffer allocation routines. If the mapped >>> buffer is requested but unmapped buffer already exists, the buffer >>> subsystem automatically maps the pages. >>> >>> The clustering code is also made aware of the not-mapped buffers, but >>> this required the KPI change that accounts for the diff in the non-UFS >>> filesystems. >>> >>> UFS is adopted to request not mapped buffers when kernel does not need >>> to access the content, i.e. mostly for the file data. New helper >>> function vn_io_fault_pgmove() operates on the unmapped array of pages. >>> It calls new pmap method pmap_copy_pages() to do the data move to and >>> from usermode. >>> >>> Besides not mapped buffers, not mapped BIOs are introduced, marked >>> with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated >>> to unmapped BIOs. Geom providers may indicate an acceptance of the >>> unmapped BIOs. If provider does not handle unmapped i/o requests, >>> geom now automatically establishes transient mapping for the i/o >>> pages. >>> >>> Swap- and malloc-backed md(4) is changed to accept unmapped BIOs. The >>> gpart providers indicate the unmapped BIOs support if the underlying >>> provider can do unmapped i/o. I also hacked ahci(4) to handle >>> unmapped i/o, but this should be changed after the Jeff' physbio patch >>> is committed, to use proper busdma interface. >>> >>> Besides, the swap pager does unmapped swapping if the swap partition >>> indicated that it can do unmapped i/o. By Jeff request, a buffer >>> allocation code may reserve the KVA for unmapped buffer in advance. >>> The unmapped page-in for the vnode pager is also implemented if >>> filesystem supports it, but the page out is not. The page-out, as well >>> as the vnode-backed md(4), currently require mappings, mostly due to >>> the use of VOP_WRITE(). >>> >>> As such, the patch worked in my test environment, where I used >>> ahci-attached SATA disks with gpt partitions, md(4) and UFS. I see no >>> statistically significant difference in the buildworld -j 10 times on >>> the 4-core machine with HT. On the other hand, when doing sha1 over >>> the 5GB file, the system time was reduced by 30%. >>> >>> Unfinished items: >>> - Integration with the physbio, will be done after physbio is >>> committed to HEAD. >>> - The key per-architecture function needed for the unmapped i/o is the >>> pmap_copy_pages(). I implemented it for amd64 and i386 right now, it >>> shall be done for all other architectures. >>> - The sizing of the submap used for transient mapping of the BIOs is >>> naive. Should be adjusted, esp. for KVA-lean architectures. >>> - Conversion of the other filesystems. Low priority. >>> >>> I am interested in reviews, tests and suggestions. Note that this >>> only works now for md(4) and ahci(4), for other drivers the patched >>> kernel should fall back to the mapped i/o. >>> >>> >> Here are a couple things for you to think about: >> >> 1. A while back, I developed the patch at >> http://www.cs.rice.edu/~alc/buf_maps5.patch as an experiment in trying to >> reduce the number of TLB shootdowns by the buffer map. The idea is simple: >> Replace the calls to pmap_q{enter,remove}() with calls to a new >> machine-dependent function that opportunistically sets the buffer's kernel >> virtual address to the direct map for physically contiguous pages. >> However, if the pages are not physically contiguous, it calls pmap_qenter() >> with the kernel virtual address from the buffer map. >> >> This eliminated about half of the TLB shootdowns for a buildworld, because >> there is a decent amount of physical contiguity that occurs by "accident". >> Using a buddy allocator for physical page allocation tends to promote this >> contiguity. However, in a few places, it occurs by explicit action, e.g., >> mapped files, including large executables, using superpage reservations. >> >> So, how does this fit with what you've done? You might think of using what >> I describe above as a kind of "fast path". As you can see from the patch, >> it's very simple and non-intrusive. If the pages aren't physically >> contiguous, then instead of using pmap_qenter(), you fall back to whatever >> approach for creating ephemeral mappings is appropriate to a given >> architecture. > I remember this. > > I did not measured the change in the amount of IPIs issued during the > buildworld, but I do account for the mapped/unmapped buffer space in > the patch. For the buildworld load, there is 5-10% of the mapped buffers > from the whole buffers, which coincide with the intuitive size of the > metadata for sources. Since unmapped buffers eliminate IPIs at creation > and reuse, I safely guess that IPI reduction is on the comparable numbers. > > The pmap_map_buf() patch is orthohonal to the work I did, and it should > nicely reduce the overhead for the metadata buffers handling. I can finish > it, if you want. I do not think that it should be added to the already > large patch, but instead it could be done and committed separately. I agree. This patch should be completed and committed separately from your patch. I would be happy for you to complete the patch. However, before doing that, let me send you another patch that is an alternate implementation of this same basic idea. Essentially, I was trying to see if I could come up with another way of doing the same thing that didn't require two new pmap functions. After you've had a chance to look at them both, we can discuss the pros and cons of each, and decide which one to complete and commit. I'll dig up this alternate implementation and send it to you on Friday. >> 2. As for managing the ephemeral mappings on machines that don't support a >> direct map. I would suggest an approach that is loosely inspired by >> copying garbage collection (or the segment cleaners in log-structured file >> systems). Roughly, you manage the buffer map as a few spaces (or >> segments). When you create a new mapping in one of these spaces (or >> segments), you simply install the PTEs. When you decide to "garbage >> collect" a space (or spaces), then you perform a global TLB flush. >> Specifically, you do something like toggling the bit in the cr4 register >> that enables/disables support for the PG_G bit. If the spaces are >> sufficiently large, then the number of such global TLB flushes should be >> quite low. Every space would have an epoch number (or flush number). In >> the buffer, you would record the epoch number alongside the kernel virtual >> address. On access to the buffer, if the epoch number was too old, then >> you have to recreate the buffer's mapping in a new space. > Could you, please, describe the idea in more details ? For which mappings > the described mechanism should be used ? > > Do you mean the pmap_copy_pages() implementation, or the fallback mappings > for BIOs ? > > Note that pmap_copy_pages() implementaion on i386 is shamelessly stolen > from pmap_copy_page() and uses the per-cpu ephemeral mapping for copying. > > For BIOs, this might be used, but I am also quite satisfied with submap > and pmap_qenter(). I'll try to answer your questions on Friday. Alan From owner-freebsd-arch@FreeBSD.ORG Thu Dec 20 10:49:36 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E9787841 for ; Thu, 20 Dec 2012 10:49:36 +0000 (UTC) (envelope-from freebsd-arch@m.gmane.org) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) by mx1.freebsd.org (Postfix) with ESMTP id 96ACD8FC0A for ; Thu, 20 Dec 2012 10:49:36 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1TldhB-0004z4-Hu for freebsd-arch@freebsd.org; Thu, 20 Dec 2012 11:49:45 +0100 Received: from lara.cc.fer.hr ([161.53.72.113]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 20 Dec 2012 11:49:45 +0100 Received: from ivoras by lara.cc.fer.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 20 Dec 2012 11:49:45 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-arch@freebsd.org From: Ivan Voras Subject: Re: Unmapped I/O Date: Thu, 20 Dec 2012 11:49:18 +0100 Lines: 37 Message-ID: References: <20121219135451.GU71906@kib.kiev.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig672A5CF2041B938E8336E022" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: lara.cc.fer.hr User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:14.0) Gecko/20120812 Thunderbird/14.0 In-Reply-To: <20121219135451.GU71906@kib.kiev.ua> X-Enigmail-Version: 1.4.3 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Dec 2012 10:49:37 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig672A5CF2041B938E8336E022 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 19/12/2012 14:54, Konstantin Belousov wrote: > Besides not mapped buffers, not mapped BIOs are introduced, marked > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > to unmapped BIOs. Geom providers may indicate an acceptance of the > unmapped BIOs. If provider does not handle unmapped i/o requests, > geom now automatically establishes transient mapping for the i/o > pages. Hi, Can you write up more details on what this means for GEOM developers: what is to be gained / lost (on GEOM level) for existing GEOM classes, and how to "indicate acceptance" for umapped BIOs? --------------enig672A5CF2041B938E8336E022 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlDS7S8ACgkQ/QjVBj3/HSyhmQCfSxnxOEfWaZTnx+IIgZ4sv+Ev fxUAnRIHLBgaSCtkZQdYMyRjrKEG9iK+ =QbNP -----END PGP SIGNATURE----- --------------enig672A5CF2041B938E8336E022-- From owner-freebsd-arch@FreeBSD.ORG Thu Dec 20 20:15:28 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E8DB1B52; Thu, 20 Dec 2012 20:15:28 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4FAE88FC0A; Thu, 20 Dec 2012 20:15:28 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBKKFNDh009632; Thu, 20 Dec 2012 22:15:23 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBKKFNDh009632 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBKKFN9k009631; Thu, 20 Dec 2012 22:15:23 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 20 Dec 2012 22:15:23 +0200 From: Konstantin Belousov To: Ivan Voras Subject: Re: Unmapped I/O Message-ID: <20121220201523.GD53644@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="9dgjiU4MmWPVapMU" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Dec 2012 20:15:29 -0000 --9dgjiU4MmWPVapMU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 20, 2012 at 11:49:18AM +0100, Ivan Voras wrote: > On 19/12/2012 14:54, Konstantin Belousov wrote: >=20 > > Besides not mapped buffers, not mapped BIOs are introduced, marked > > with the flag BIO_NOTMAPPED. Unmapped buffers are directly translated > > to unmapped BIOs. Geom providers may indicate an acceptance of the > > unmapped BIOs. If provider does not handle unmapped i/o requests, > > geom now automatically establishes transient mapping for the i/o > > pages. >=20 > Hi, >=20 > Can you write up more details on what this means for GEOM developers: > what is to be gained / lost (on GEOM level) for existing GEOM classes, > and how to "indicate acceptance" for umapped BIOs? Nothing is changed for existing GEOM classes, and it does not mean anything for GEOM developers, unless she wants to change the GEOM class to handle unmapped BIOs. Did you looked at the patch ? Look at the changes for struct bio and geom_vfs.c. Geoms which accept unmapped BIOs now could get either mapped bio, not different from the current one, or unmapped bio, where the bio_data is invalid, and array of the vm_pages bio_ma is passed with bio_ma_offset offset specifying the start position of the data in the first page of the array. Unless provider explicitely agreed to accept unmapped BIOs, it would never get them. --9dgjiU4MmWPVapMU Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlDTcdsACgkQC3+MBN1Mb4jY4gCfV3TsBdk5E9TfHqmrkwoLsKUJ 41MAoN+36nNeofx0UyKaKO2YPOgtMQW6 =nKCt -----END PGP SIGNATURE----- --9dgjiU4MmWPVapMU-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 21 11:53:22 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 84DCCA17 for ; Fri, 21 Dec 2012 11:53:22 +0000 (UTC) (envelope-from ivoras@gmail.com) Received: from mail-vc0-f174.google.com (mail-vc0-f174.google.com [209.85.220.174]) by mx1.freebsd.org (Postfix) with ESMTP id 311BC8FC0A for ; Fri, 21 Dec 2012 11:53:22 +0000 (UTC) Received: by mail-vc0-f174.google.com with SMTP id d16so4927105vcd.5 for ; Fri, 21 Dec 2012 03:53:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=Z4rx0M19dNeSCGknWKYHvhvT4Nd2l3xXfrjC5N6XboA=; b=RZkk1sUkqC4jeNVs1oUUR/CLp/pwhYfDwrk9cCoIwpY32KgaJLT+rIfYI1pMKxAz61 tj/HbTMIv98y0u2BsJf2sfHTR6GN93cHGRg/N6KYX1RavYwHORpz+bMNpPlgRn6XZCtP UxzBsm0Tcl1JgXGwDBOU//vp7RjlWjvJjRCZQ059AWBSUidboNLIi9wa0RyzQ8zbky0m ewWtfWAsYLG967XXLEWhHUcah2tnr7zlUx14MUnIhrBtM9WxFPEFXw8yxIUW6VI8tGju 1kE9GIHQvwytVv0dBIKm12DrP+W7qAB1ntyk6nGC+LAPpFIdquDzY9qnTQ9kFBm//pLD Ap3A== Received: by 10.58.15.72 with SMTP id v8mr19721426vec.55.1356090795182; Fri, 21 Dec 2012 03:53:15 -0800 (PST) MIME-Version: 1.0 Sender: ivoras@gmail.com Received: by 10.58.107.230 with HTTP; Fri, 21 Dec 2012 03:52:35 -0800 (PST) In-Reply-To: <20121220201523.GD53644@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <20121220201523.GD53644@kib.kiev.ua> From: Ivan Voras Date: Fri, 21 Dec 2012 12:52:35 +0100 X-Google-Sender-Auth: ze5uIP8CwY_zYzbkLp5mWEYK5jE Message-ID: Subject: Re: Unmapped I/O To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2012 11:53:22 -0000 On 20 December 2012 21:15, Konstantin Belousov wrote: > Nothing is changed for existing GEOM classes, and it does not mean anything > for GEOM developers, unless she wants to change the GEOM class to handle > unmapped BIOs. Understood, but the intention of my question was: do you recommend GEOM classes should take the effort and implement unmapped BIOs whenever possible? Your change in g_part.c is trivial - this is because g_part doesn't actually touch the BIO data, only pass it on? From owner-freebsd-arch@FreeBSD.ORG Fri Dec 21 12:02:42 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A20DADA5; Fri, 21 Dec 2012 12:02:42 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 086A98FC12; Fri, 21 Dec 2012 12:02:41 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qBLC2bkU018521; Fri, 21 Dec 2012 14:02:37 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.7.3 kib.kiev.ua qBLC2bkU018521 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qBLC2btY018520; Fri, 21 Dec 2012 14:02:37 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 21 Dec 2012 14:02:37 +0200 From: Konstantin Belousov To: Ivan Voras Subject: Re: Unmapped I/O Message-ID: <20121221120237.GF53644@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> <20121220201523.GD53644@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vA66WO2vHvL/CRSR" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2012 12:02:42 -0000 --vA66WO2vHvL/CRSR Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 21, 2012 at 12:52:35PM +0100, Ivan Voras wrote: > On 20 December 2012 21:15, Konstantin Belousov wrot= e: >=20 > > Nothing is changed for existing GEOM classes, and it does not mean anyt= hing > > for GEOM developers, unless she wants to change the GEOM class to handle > > unmapped BIOs. >=20 > Understood, but the intention of my question was: do you recommend > GEOM classes should take the effort and implement unmapped BIOs > whenever possible? Depends. RAID 0 and RAID 1 can process unmapped BIOs without changes, I am sure. For the class like RAID5, you would need a hardware for it to be able to operate on the unmapped BIOs without requiring the remap. There is indeed Intel IOAT, which I believe can do this. On the other hand, for encrypting classes like GELI it probably does not make much sense to care, for the case of encryption done in software or using AES-NI. >=20 > Your change in g_part.c is trivial - this is because g_part doesn't > actually touch the BIO data, only pass it on? Right. --vA66WO2vHvL/CRSR Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlDUT90ACgkQC3+MBN1Mb4hcEgCgtmlwbM98Vjqkt4CWaLTPXbir dBQAoOQsdKvnio2jfME76vKDGdlLOKtq =JPBn -----END PGP SIGNATURE----- --vA66WO2vHvL/CRSR-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 21 14:22:57 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3C00CDCC for ; Fri, 21 Dec 2012 14:22:57 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay05.ispgateway.de (smtprelay05.ispgateway.de [80.67.31.98]) by mx1.freebsd.org (Postfix) with ESMTP id BFA878FC17 for ; Fri, 21 Dec 2012 14:22:56 +0000 (UTC) Received: from [78.35.147.221] (helo=fabiankeil.de) by smtprelay05.ispgateway.de with esmtpsa (SSLv3:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1Tm3KJ-0004JC-Gb; Fri, 21 Dec 2012 15:11:51 +0100 Date: Fri, 21 Dec 2012 15:08:23 +0100 From: Fabian Keil To: Konstantin Belousov Subject: Re: Unmapped I/O Message-ID: <20121221150823.09c9d913@fabiankeil.de> In-Reply-To: <20121219135451.GU71906@kib.kiev.ua> References: <20121219135451.GU71906@kib.kiev.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/lGD3IsDEyeDtcAlfZY15mo8"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2012 14:22:57 -0000 --Sig_/lGD3IsDEyeDtcAlfZY15mo8 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable Konstantin Belousov wrote: > The patch at > http://people.freebsd.org/~kib/misc/unmapped.4.patch > implements the 'unmapped buffers'. > I am interested in reviews, tests and suggestions. Note that this > only works now for md(4) and ahci(4), for other drivers the patched > kernel should fall back to the mapped i/o. Using this patch I can't login as either root or user because most of the programs including the login shells segfault: Dec 21 14:27:19 r500 kernel: eval: cp: Exec format error Dec 21 14:27:19 r500 wpa_supplicant[1716]: Trying to associate with [...] Dec 21 14:27:19 r500 wpa_supplicant[1716]: Associated with [...] Dec 21 14:27:19 r500 kernel: [54] wlan0: link state changed to UP Dec 21 14:27:19 r500 wpa_supplicant[1716]: WPA: Key negotiation completed w= ith [...] [PTK=3DCCMP GTK=3DCCMP] Dec 21 14:27:19 r500 wpa_supplicant[1716]: CTRL-EVENT-CONNECTED - Connectio= n to [...] completed (auth) [id=3D0 id_str=3D] Dec 21 14:27:19 r500 kernel: /sbin/mount_nullfs: N^P^Hg[...]0^E= : not found Dec 21 14:27:21 r500 kernel: . Dec 21 14:27:21 r500 kernel: [56] pid 2989 (egrep), uid 0: exited on signal= 11 (core dumped) Dec 21 14:27:36 r500 kernel: [71] pid 3013 (bash), uid 1001: exited on sign= al 11 (core dumped) Dec 21 14:27:36 r500 kernel: [71] pid 3005 (login), uid 0: exited on signal= 11 Dec 21 14:27:46 r500 kernel: [81] pid 3016 (bash), uid 1001: exited on sign= al 11 (core dumped) Dec 21 14:28:52 r500 login: ROOT LOGIN (root) ON ttyv0 Dec 21 14:28:52 r500 kernel: [148] pid 3020 (csh), uid 0: exited on signal = 11 (core dumped) Jails don't start either. fk@r500 ~ $gdb75 /usr/sbin/syslogd /syslogd.core=20 [...] Reading symbols from /usr/sbin/syslogd...done. [New process 100356] Core was generated by `syslogd'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000604878 in ?? () (gdb) where #0 0x0000000000604878 in ?? () #1 0x0000000800b4c667 in _swrite (fp=3D0x800daa720, buf=3D0x1200000137 , n=3D21) at /usr/src/lib/libc/stdio/stdio= .c:133 #2 0x0000000800b4c269 in __sflush (fp=3D, fp=3D) at /usr/src/lib/libc/stdio/fflush.c:123 #3 0x0000000800af0930 in _fwalk (function=3D0x800b4c200 <__sflush>) at /us= r/src/lib/libc/stdio/fwalk.c:65 #4 0x0000000800b4c060 in _cleanup () at /usr/src/lib/libc/stdio/findfp.c:2= 02 #5 0x0000000800acfabd in exit (status=3D1) at /usr/src/lib/libc/stdlib/exi= t.c:69 #6 0x0000000000404bc8 in die (signo=3D) at /usr/src/usr.sbi= n/syslogd/syslogd.c:1528 #7 0x0000000000404771 in main (argc=3D21143816, argv=3D0x7fff00000000) at = /usr/src/usr.sbin/syslogd/syslogd.c:615 fk@r500 ~ $gdb75 /usr/sbin/syslogd /syslogd.core=20 [...] [New process 100356] Core was generated by `syslogd'. Program terminated with signal 11, Segmentation fault. #0 0x0000000000604878 in ?? () (gdb) where #0 0x0000000000604878 in ?? () #1 0x0000000800b4c667 in _swrite (fp=3D0x800daa720, buf=3D0x1200000137 , n=3D21) at /usr/src/lib/libc/stdio/stdio= .c:133 #2 0x0000000800b4c269 in __sflush (fp=3D, fp=3D) at /usr/src/lib/libc/stdio/fflush.c:123 #3 0x0000000800af0930 in _fwalk (function=3D0x800b4c200 <__sflush>) at /us= r/src/lib/libc/stdio/fwalk.c:65 #4 0x0000000800b4c060 in _cleanup () at /usr/src/lib/libc/stdio/findfp.c:2= 02 #5 0x0000000800acfabd in exit (status=3D1) at /usr/src/lib/libc/stdlib/exi= t.c:69 #6 0x0000000000404bc8 in die (signo=3D) at /usr/src/usr.sbi= n/syslogd/syslogd.c:1528 #7 0x0000000000404771 in main (argc=3D21143816, argv=3D0x7fff00000000) at = /usr/src/usr.sbin/syslogd/syslogd.c:615 fk@r500 ~ $gdb75 /usr/local/bin/bash bash.core=20 [...] [New process 100449] Core was generated by `bash'. Program terminated with signal 11, Segmentation fault. #0 0x000000080091b4f0 in _nc_trim_sgr0 (tp=3D0x8014a2180) from /lib/libncu= rses.so.8 (gdb) where #0 0x000000080091b4f0 in _nc_trim_sgr0 (tp=3D0x8014a2180) from /lib/libncu= rses.so.8 #1 0x000000080091773d in tgetent (bufp=3D0x8014a0000 " +C\001\b", name=3D<= optimized out>) at /usr/src/lib/ncurses/ncurses/../../../contrib/ncurses/nc= urses/tinfo/lib_termcap.c:162 #2 0x00000000004b46f1 in _rl_init_terminal_io (terminal_name=3D0x8014061a8= "xterm") at terminal.c:452 #3 0x000000000049a0e4 in readline_initialize_everything () at readline.c:1= 066 #4 0x0000000000499fb2 in rl_initialize () at readline.c:968 #5 0x0000000000460294 in initialize_readline () at bashline.c:522 #6 0x000000000040a687 in yy_readline_get () at ./parse.y:1428 #7 0x000000000040a63d in yy_getc () at ./parse.y:1376 #8 0x000000000040b49f in shell_getc (remove_quoted_newline=3D1) at ./parse= .y:2231 #9 0x000000000040c5a3 in read_token (command=3D0) at ./parse.y:2908 #10 0x000000000040bcc8 in yylex () at ./parse.y:2517 #11 0x0000000000407467 in yyparse () at y.tab.c:2065 #12 0x00000000004070be in parse_command () at eval.c:228 #13 0x00000000004071a4 in read_command () at eval.c:272 #14 0x0000000000406ea4 in reader_loop () at eval.c:137 #15 0x0000000000404f7c in main (argc=3D1, argv=3D0x7fffffffdcd0, env=3D0x7f= ffffffdce0) at shell.c:749 I'm using UFS for / and ZFS (and nullfs ...) for the rest. The ZFS pool is on ada0s1d.eli, swap is on ada0s1b.eli. I didn't update the userland as the patch only seems to touch the kernel. Fabian --Sig_/lGD3IsDEyeDtcAlfZY15mo8 Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlDUbVkACgkQBYqIVf93VJ2ZAwCgseYIPiy+K0cHtLWzq2hYU5aj 7+IAn3xQdI+FtxyfzD0IeRHDm2vo6aBo =oVZk -----END PGP SIGNATURE----- --Sig_/lGD3IsDEyeDtcAlfZY15mo8-- From owner-freebsd-arch@FreeBSD.ORG Fri Dec 21 17:36:01 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5DF3BFC4; Fri, 21 Dec 2012 17:36:01 +0000 (UTC) (envelope-from alfred@ixsystems.com) Received: from mail.iXsystems.com (newknight.ixsystems.com [206.40.55.70]) by mx1.freebsd.org (Postfix) with ESMTP id 3D3788FC1A; Fri, 21 Dec 2012 17:36:01 +0000 (UTC) Received: from localhost (mail.ixsystems.com [10.2.55.1]) by mail.iXsystems.com (Postfix) with ESMTP id E92235E604; Fri, 21 Dec 2012 09:36:00 -0800 (PST) Received: from mail.iXsystems.com ([10.2.55.1]) by localhost (mail.ixsystems.com [10.2.55.1]) (maiad, port 10024) with ESMTP id 47429-05; Fri, 21 Dec 2012 09:36:00 -0800 (PST) Received: from Alfreds-MacBook-Pro-9.local (unknown [10.8.0.10]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by mail.iXsystems.com (Postfix) with ESMTPSA id B44105E5FF; Fri, 21 Dec 2012 09:36:00 -0800 (PST) Message-ID: <50D49DFF.3060803@ixsystems.com> Date: Fri, 21 Dec 2012 09:35:59 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: "arch@freebsd.org" , Rui Paulo Subject: making use of userland dtrace on FreeBSD Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2012 17:36:01 -0000 Hey folks, We have had userland dtrace for a while now. However it's not really hooked up into the build, nor as far as I can tell are ports nor shared libs. Dtrace can be immensely useful for tracking down hard to find bugs, memory leaks, performance problems and a lot more. What are the thoughts on making this available by default on FreeBSD going forward? What would need to happen? Supposedly we can do this by just adding "CFLAGS=-fno-omit-frame-pointer" and not completely stripping installed tools/libraries. Would it make sense to set this as default for the whole system? Just libs+ports? Or do people think that the performance gain of omit-frame-pointer (which I am unsure of) is worth the loss of debug-ability (like a certain arctic bird based OS)? I have also factored in the size of binaries into this, and I really am not sure why it would be a problem other than if we didn't offer an "easy button" to make things "small". Let's figure this out, because it seems to me that we should be offering this to our users if possible. -Alfred From owner-freebsd-arch@FreeBSD.ORG Fri Dec 21 17:38:04 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 71CF2132; Fri, 21 Dec 2012 17:38:04 +0000 (UTC) (envelope-from bright@mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 5246D8FC1B; Fri, 21 Dec 2012 17:38:04 +0000 (UTC) Received: from Alfreds-MacBook-Pro-9.local (c-67-180-208-218.hsd1.ca.comcast.net [67.180.208.218]) by elvis.mu.org (Postfix) with ESMTPSA id 2F6721A3C1C; Fri, 21 Dec 2012 09:38:02 -0800 (PST) Message-ID: <50D49E79.6090500@mu.org> Date: Fri, 21 Dec 2012 09:38:01 -0800 From: Alfred Perlstein User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: Unmapped I/O References: <20121219135451.GU71906@kib.kiev.ua> <20121220201523.GD53644@kib.kiev.ua> <20121221120237.GF53644@kib.kiev.ua> In-Reply-To: <20121221120237.GF53644@kib.kiev.ua> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Ivan Voras , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Dec 2012 17:38:04 -0000 On 12/21/12 4:02 AM, Konstantin Belousov wrote: > On Fri, Dec 21, 2012 at 12:52:35PM +0100, Ivan Voras wrote: >> On 20 December 2012 21:15, Konstantin Belousov wrote: >> >>> Nothing is changed for existing GEOM classes, and it does not mean anything >>> for GEOM developers, unless she wants to change the GEOM class to handle >>> unmapped BIOs. >> Understood, but the intention of my question was: do you recommend >> GEOM classes should take the effort and implement unmapped BIOs >> whenever possible? > Depends. RAID 0 and RAID 1 can process unmapped BIOs without changes, > I am sure. For the class like RAID5, you would need a hardware > for it to be able to operate on the unmapped BIOs without requiring > the remap. There is indeed Intel IOAT, which I believe can do this. > > On the other hand, for encrypting classes like GELI it probably does not > make much sense to care, for the case of encryption done in software or > using AES-NI. *Raising my hand like the annoying kid in class* What about asking for the physaddr for such pages on dmap arches? by the way, thanks for this giant leap forward, it's going to help FreeBSD very much! -Alfred From owner-freebsd-arch@FreeBSD.ORG Sat Dec 22 19:28:29 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 47F39535; Sat, 22 Dec 2012 19:28:29 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id 0D2DB8FC0C; Sat, 22 Dec 2012 19:28:28 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBMJSRB1052561; Sat, 22 Dec 2012 12:28:28 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBMJSPk0069007; Sat, 22 Dec 2012 12:28:25 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: jemalloc enhancement for small-memory systems From: Ian Lepore To: freebsd-arch@freebsd.org Content-Type: text/plain; charset="us-ascii" Date: Sat, 22 Dec 2012 12:28:25 -0700 Message-ID: <1356204505.1129.21.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Jason Evans X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2012 19:28:29 -0000 When a daemon such as watchdogd uses mlockall(2) on a small-memory embedded system, it can end up wiring much of the available ram because jemalloc allocates large chunks of vmspace by default. More background info on this can be found in this thread: http://lists.freebsd.org/pipermail/freebsd-embedded/2012-November/001679.html It's hard to tune jemalloc's allocation behavior for this in a machine-independent way because the minimum chunk size depends on PAGE_SIZE and other factors internal to jemalloc. I've created a patch that addresses this by defining that lg_chunk:0 is implicitly a request to set the chunk size to the smallest value allowable for the machine it's running on. The patch is attached to this PR... http://www.freebsd.org/cgi/query-pr.cgi?pr=174641 Jason, could you please review this and consider incorporating it into jemalloc? Or let us know if there's a better way to handle this situation. -- Ian From owner-freebsd-arch@FreeBSD.ORG Sat Dec 22 22:40:43 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D15EB3B3; Sat, 22 Dec 2012 22:40:43 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id A61DB8FC0C; Sat, 22 Dec 2012 22:40:43 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id qBMMeelV049987; Sat, 22 Dec 2012 22:40:40 GMT (envelope-from tim@kientzle.com) Received: from [192.168.2.143] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id 8pqwut2y98yyjain7bupjg43nw; Sat, 22 Dec 2012 22:40:39 +0000 (UTC) (envelope-from tim@kientzle.com) Subject: Re: jemalloc enhancement for small-memory systems Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=us-ascii From: Tim Kientzle In-Reply-To: <1356204505.1129.21.camel@revolution.hippie.lan> Date: Sat, 22 Dec 2012 14:40:39 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> References: <1356204505.1129.21.camel@revolution.hippie.lan> To: Ian Lepore X-Mailer: Apple Mail (2.1283) Cc: Jason Evans , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2012 22:40:43 -0000 On Dec 22, 2012, at 11:28 AM, Ian Lepore wrote: > When a daemon such as watchdogd uses mlockall(2) on a small-memory > embedded system, it can end up wiring much of the available ram = because > jemalloc allocates large chunks of vmspace by default. More = background > info on this can be found in this thread: >=20 > = http://lists.freebsd.org/pipermail/freebsd-embedded/2012-November/001679.h= tml >=20 > It's hard to tune jemalloc's allocation behavior for this in a > machine-independent way because the minimum chunk size depends on > PAGE_SIZE and other factors internal to jemalloc. I've created a = patch > that addresses this by defining that lg_chunk:0 is implicitly a = request > to set the chunk size to the smallest value allowable for the machine > it's running on. The patch is attached to this PR... >=20 > http://www.freebsd.org/cgi/query-pr.cgi?pr=3D174641 >=20 > Jason, could you please review this and consider incorporating it into > jemalloc? Or let us know if there's a better way to handle this > situation. Would it be feasible for jemalloc to initially allocate small blocks (to not over-allocate for small programs and systems with small RAM) and then allocate successively larger blocks as the program requires more memory? Tim From owner-freebsd-arch@FreeBSD.ORG Sat Dec 22 22:50:28 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9802952C; Sat, 22 Dec 2012 22:50:28 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) by mx1.freebsd.org (Postfix) with ESMTP id 298398FC0A; Sat, 22 Dec 2012 22:50:26 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id 5F7793592D9; Sat, 22 Dec 2012 23:50:25 +0100 (CET) Received: by snail.stack.nl (Postfix, from userid 1677) id 479B62848C; Sat, 22 Dec 2012 23:50:25 +0100 (CET) Date: Sat, 22 Dec 2012 23:50:25 +0100 From: Jilles Tjoelker To: Poul-Henning Kamp Subject: Re: API explosion (Re: [RFC/RFT] calloutng) Message-ID: <20121222225025.GA46583@stack.nl> References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <50D192E8.3020704@FreeBSD.org> <15947.1355914806@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <15947.1355914806@critter.freebsd.dk> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Davide Italiano , Ian Lepore , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2012 22:50:28 -0000 On Wed, Dec 19, 2012 at 11:00:06AM +0000, Poul-Henning Kamp wrote: > -------- > In message <50D192E8.3020704@FreeBSD.org>, Alexander Motin writes: > >Linux uses 32.32 format in their eventtimers code. > (And that is no accident, I know who they got the number from :-) > >But if at some point we want to be able to > >handle absolute wall time, [...] > Then you have other problems, including but not limited to clock > being stepped, leap-seconds, suspend/resume and frequency stability. > If you want to support callouts of the type ("At 14:00 UTC tomorrow") > (disregarding the time-zone issue), you need to catch all significant > changes to our UTC estimate and recalibrate your callout based on > that. > It is not obvious that we have applications for such an API that > warrant the complexity. > Either way, such a facility should be layered on top of the callout > facility, which should always run in "elapsed time"[1] with no attention > paid to what NTPD might do to the UTC estimate. POSIX specifies functions that assume such a facility exists, although applications may not care much if we implement them incorrectly. For example, pthread_mutex_timedlock() and sem_timedwait() shall time out when the CLOCK_REALTIME clock reaches the given value, and pthread_cond_timedwait() and clock_nanosleep() (with TIMER_ABSTIME) shall time out when the specified clock reaches the given value. > So summary: 32.32 is the right format. > Poul-Henning > [1] Notice that "elapsed time" needs a firm definition with respect > to suspend/resume, and that this decision has big implications > for the API use and code duplication. > I think it prudent to specify a flag to callouts, to tell what > should happen on suspend/resume, something like: > SR_CANCEL /* Cancel the callout on S/R */ > /* no flag* /* Toll this callout only when system is running */ > SR_IGNORE /* Toll suspended time from callout */ > If you get this right, callouts from device drivers will just "DTRT", > if you get it wrong, all device drivers will need boilerplate code > to handle S/R Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC vs CLOCK_UPTIME. -- Jilles Tjoelker From owner-freebsd-arch@FreeBSD.ORG Sat Dec 22 22:58:02 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A4BC6837; Sat, 22 Dec 2012 22:58:02 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id 522EA8FC0C; Sat, 22 Dec 2012 22:58:02 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id 1EFD88A512; Sat, 22 Dec 2012 22:57:55 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.5/8.14.5) with ESMTP id qBMMvsts002072; Sat, 22 Dec 2012 22:57:54 GMT (envelope-from phk@phk.freebsd.dk) To: Jilles Tjoelker Subject: Re: API explosion (Re: [RFC/RFT] calloutng) In-reply-to: <20121222225025.GA46583@stack.nl> From: "Poul-Henning Kamp" References: <50CF88B9.6040004@FreeBSD.org> <20121218173643.GA94266@onelab2.iet.unipi.it> <50D0B00D.8090002@FreeBSD.org> <50D0E42B.6030605@FreeBSD.org> <20121218225823.GA96962@onelab2.iet.unipi.it> <1355873265.1198.183.camel@revolution.hippie.lan> <14604.1355910848@critter.freebsd.dk> <50D192E8.3020704@FreeBSD.org> <15947.1355914806@critter.freebsd.dk> <20121222225025.GA46583@stack.nl> Date: Sat, 22 Dec 2012 22:57:54 +0000 Message-ID: <2071.1356217074@critter.freebsd.dk> Cc: Davide Italiano , Ian Lepore , Alexander Motin , freebsd-current , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2012 22:58:02 -0000 -------- In message <20121222225025.GA46583@stack.nl>, Jilles Tjoelker writes: >> Either way, such a facility should be layered on top of the callout >> facility, which should always run in "elapsed time"[1] with no attention >> paid to what NTPD might do to the UTC estimate. > >POSIX specifies functions that assume such a facility exists, although >applications may not care much if we implement them incorrectly. It should still be implemented op top of callouts, not as part of: it is an entirely different thing to try to do right. >> I think it prudent to specify a flag to callouts, to tell what >> should happen on suspend/resume, something like: > >> SR_CANCEL /* Cancel the callout on S/R */ >> /* no flag* /* Toll this callout only when system is running */ >> SR_IGNORE /* Toll suspended time from callout */ > >> If you get this right, callouts from device drivers will just "DTRT", >> if you get it wrong, all device drivers will need boilerplate code >> to handle S/R > >Userland could get access to this via CLOCK_REALTIME vs CLOCK_MONOTONIC >vs CLOCK_UPTIME. I have _no_ idea what you are trying to say here... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Sat Dec 22 23:04:39 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 43D96B05; Sat, 22 Dec 2012 23:04:39 +0000 (UTC) (envelope-from freebsd@damnhippie.dyndns.org) Received: from duck.symmetricom.us (duck.symmetricom.us [206.168.13.214]) by mx1.freebsd.org (Postfix) with ESMTP id C11D08FC0A; Sat, 22 Dec 2012 23:04:38 +0000 (UTC) Received: from damnhippie.dyndns.org (daffy.symmetricom.us [206.168.13.218]) by duck.symmetricom.us (8.14.5/8.14.5) with ESMTP id qBMN4boL054862; Sat, 22 Dec 2012 16:04:37 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Received: from [172.22.42.240] (revolution.hippie.lan [172.22.42.240]) by damnhippie.dyndns.org (8.14.3/8.14.3) with ESMTP id qBMN4YqS069151; Sat, 22 Dec 2012 16:04:34 -0700 (MST) (envelope-from freebsd@damnhippie.dyndns.org) Subject: Re: jemalloc enhancement for small-memory systems From: Ian Lepore To: Tim Kientzle In-Reply-To: <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> References: <1356204505.1129.21.camel@revolution.hippie.lan> <75ECE5AB-9276-44BA-84D7-56EF6BDC3984@kientzle.com> Content-Type: text/plain; charset="us-ascii" Date: Sat, 22 Dec 2012 16:04:34 -0700 Message-ID: <1356217474.1129.40.camel@revolution.hippie.lan> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit Cc: Jason Evans , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 22 Dec 2012 23:04:39 -0000 On Sat, 2012-12-22 at 14:40 -0800, Tim Kientzle wrote: > On Dec 22, 2012, at 11:28 AM, Ian Lepore wrote: > > > When a daemon such as watchdogd uses mlockall(2) on a small-memory > > embedded system, it can end up wiring much of the available ram because > > jemalloc allocates large chunks of vmspace by default. More background > > info on this can be found in this thread: > > > > http://lists.freebsd.org/pipermail/freebsd-embedded/2012-November/001679.html > > > > It's hard to tune jemalloc's allocation behavior for this in a > > machine-independent way because the minimum chunk size depends on > > PAGE_SIZE and other factors internal to jemalloc. I've created a patch > > that addresses this by defining that lg_chunk:0 is implicitly a request > > to set the chunk size to the smallest value allowable for the machine > > it's running on. The patch is attached to this PR... > > > > http://www.freebsd.org/cgi/query-pr.cgi?pr=174641 > > > > Jason, could you please review this and consider incorporating it into > > jemalloc? Or let us know if there's a better way to handle this > > situation. > > Would it be feasible for jemalloc to initially allocate > small blocks (to not over-allocate for small programs and > systems with small RAM) and then allocate successively > larger blocks as the program requires more memory? > > Tim It might be nice if it used sysconf(3) to see how much memory is available on the system and auto-tune accordingly. If the machine only has 32mb it's probably not useful to allocate in 8mb chunks. On the other hand, since it's normally only allocating virtual address space, over-allocating should be harmless. It's the addition of mlockall() that makes it problematic, so it's not too onerous to require a program using mlockall() to specifically tune its use of jemalloc. -- Ian