From owner-freebsd-arch@FreeBSD.ORG  Sun Nov 29 23:44:42 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1A3261065676
	for <arch@freebsd.org>; Sun, 29 Nov 2009 23:44:42 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id 7B73C8FC0A
	for <arch@freebsd.org>; Sun, 29 Nov 2009 23:44:41 +0000 (UTC)
Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nATNRnnW057251;
	Sun, 29 Nov 2009 16:27:49 -0700 (MST)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1076)
From: Scott Long <scottl@samsco.org>
In-Reply-To: <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
Date: Sun, 29 Nov 2009 16:27:49 -0700
Message-Id: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>
	<alpine.BSF.2.00.0905200841230.981@desktop>
	<200905201522.58501.jhb@freebsd.org>
	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
To: John Baldwin <jhb@freebsd.org>
X-Mailer: Apple Mail (2.1076)
X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00,
	HTML_MESSAGE autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Content-Type: text/plain;
	charset=us-ascii;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 29 Nov 2009 23:44:42 -0000

John,

Sorry for the late reply on this.  Attilio approached me recently  
about moving busdma and storage to sglists; up until then I had  
largely ignored this conversation because I thought that it was only  
about the nvidia driver.

> On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
>> On Tue, 19 May 2009, John Baldwin wrote:
>>
>> 2) I worry that if all users do sglist_count() followed by a dynamic
>> allocation and then an _append() they will be very expensive.
>> pmap_kextract() is much more expensive than it may first seem to  
>> be.  Do
>> you have a user of count already?
>
> The only one that does now is sglist_build() and nothing currently  
> uses that.

Kinda silly to have it then.  I also don't see the point of it; if the  
point of the sglist object is to avoid VA mappings, then why start  
with a VA mapping?  But that aside, Jeff is correct, sglist_build() is  
horribly inefficient.

> VOP_GET/PUTPAGES would not need to do this since they could simply  
> append
> the physical addresses extracted directly from vm_page_t's for  
> example.  I'm
> not sure this will be used very much now as originally I thought I  
> would be
> changing all storage drivers to do all DMA operations using sglists  
> and this
> sort of thing would have been used for non-bio requests like firmware
> commands; however, as expounded on below, it actually appears better  
> to
> still treat bio's separate from non-bio requests for bus_dma so that  
> the
> non-bio requests can continue to use bus_dmamap_load_buffer() as  
> they do
> now.
>

I completely disagree with your approach to busdma, but I'll get into  
that later.  What I really don't understand is, why have yet another  
page description format?  Whether the I/O is disk buffers being pushed  
by a pager to a disk controller, or texture buffers being pushed to  
video hardware, they already have vm objects associated with them,  
no?  Why translate to an intermediate format?  I understand that  
you're creating another vm object format to deal directly with the  
needs of nvidia, but that's really a one-off case right now.  What  
about when we want the pagedaemon to push unmapped i/o?  Will it have  
to spend cycles translating its objects to sglists?  This is really a  
larger question that I'm not prepared to answer, but would like to  
discuss.

>> 3) Rather than having sg_segs be an actual pointer, did you consider
>> making it an unsized array?  This removes the overhead of one  
>> pointer from
>> the structure while enforcing that it's always contiguously  
>> allocated.
>
> It's actually a feature to be able to have the header in separate  
> storage from
> segs array.  I use this in the jhb_bio branch in the bus_dma  
> implementations
> where a pre-allocated segs array is stored in the bus dma tag and  
> the header
> is allocated on the stack.
>

I'd like to take this one step further.  Instead of sglists being  
exactly sized, I'd like to see them be much more like mbufs, with a  
header and static storage, maybe somewhere between 128b and 1k in  
total size.  Then they can be allocated and managed in pools, and  
chained together to make for easy appending, splitting, and growing.   
Offset pointers can be stored in the header instead of externally.   
Also, there are a lot of failure points in the API regarding to the  
sglist object being too small.  Those need to be fixed.

>> 4) SGLIST_INIT might be better off as an inline, and may not even  
>> belong
>> in the header file.
>
> That may be true.  I currently only use it in the jhb_bio branch for  
> the
> bus_dma implementations.
>
>> In general I think this is a good idea.  It'd be nice to work on  
>> replacing
>> the buf layer's implementation with something like this that could  
>> be used
>> directly by drivers.  Have you considered a busdma operation to  
>> load from
>> a sglist?
>
> So in regards to the bus_dma stuff, I did work on this a while ago  
> in my
> jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had  
> planned on
> using that in storage drivers directly.  However, I ended up  
> circling back
> to preferring a bus_dmamap_load_bio() and adding a new 'bio_start'  
> field
> to 'struct bio' that is an offset into an attached sglist.

I strongly disagree with forcing busdma to have intimate knowledge of  
bio's.  All of the needed information can be stored in sglist headers.

>  This let me
> carve up I/O requests in geom_dev to satisfy a disk device's max  
> request
> size while still sharing the same read-only sglist across the various
> BIO's (by simply adjusting bio_length and bio_start to be a subrange  
> of
> the sglist) as opposed to doing memory allocations to allocate  
> specific
> ranges of an sglist (using something like sglist_slice()) for each I/O
> request.

I think this is fundamentally wrong.  You're proposing exchanging a  
cheap operation of splitting VA's with an expensive operation of  
allocating, splitting, copying, and refcounting sglists.  Splitting is  
an excessively common operation, and your proposal will impact  
performance as storage becomes exponentially faster.

We need to stop thinking about maxio as a roadbump at the bottom of  
the storage stack, and instead think of it as a fundamental attribute  
that is honored at the top when a BIO is created.  Instead of loading  
up an sglist with all of the pages (and don't forget coalesced pages  
that might need to be broken up), maybe multiple bio's are created  
that honor maxio from the start, or a single bio with a chained  
sglist, with each chain link honoring maxio, allowing for easy  
splitting.

>  I then have bus_dmamap_load_bio() use the subrange of the
> sglist internally or fall back to using the KVA pointer if the sglist
> isn't present.

I completely disagree.  Drivers already deal with the details of  
bio's, and should continue to do so.  If a driver gets a bio that has  
a valid bio_data pointer, it should call bus_dmamap_load().  If it  
get's one with a valid sglist, it should call bus_dmamap_load_sglist 
().  Your proposal means that every storage driver in the system will  
have to change to use bus_dmamap_load_bio().  It's not a big change,  
but it's disruptive both in the tree and out.  Your proposal also  
implies that CAM will have to start carrying BIO's in CCBs and passing  
them to their SIMs.  I absolutely disagree with this.

If we keep unneeded complications out of busdma, we avoid a lot of  
churn.  We also leave the busdma interface available for other forms  
of I/O without requiring more specific APi additions to accommodate  
them.  What about unmapped network i/o coming from something like  
sendfile?

>
> However, I'm not really trying to get the bio stuff into the tree,  
> this
> is mostly for the Nvidia case and for that use case the driver is  
> simply
> creating simple single-entry lists and using sglist_append_phys().
>

Designing the whole API around a single driver that we can't even get  
the source to makes it hard to evaluate the API.

Attilio and I have spoken about this in private and will begin work on  
a prototype.  Here is the outline of what we're going to do:

1. Change struct sglist as so:
   a. Uniform size
   b. Determine an optimal number of elements to include in the size  
(waving my hands here, more research is needed).
   c. Chain, offset, and length pointers, very similar to how mbufs  
already work
2.  Expand the sglist API so that I/O producers can allocate slabs of  
sglists and slice them up into pools that they can manage and own
3.  Add an sglist field to struct bio, and add appropriate flags to  
identify VA vs sglist operation
4.  Extend the CAM_DATA_PHYS attributes in CAM to handle sglists.
5.  Add bus_dmamap_load_sglist().  This will be able to walk chains  
and combine, split, and reassign segments as needed.
6.  Modify a select number of drivers to use it.
7.  Add a flag to disk->d_flags to signal if a driver can handle  
sglists.  Have geom_dev look at this flag and generate a  
kmem_alloc_nofault+pmap_kenter sequence for drivers that can't support  
it.

In the end, no drivers will need to change, but the ones that do  
change will obviously benefit.  We're going to prototype this will an  
i/o source that starts unmapped (via the Xen blkback driver).  The  
downside is that most GEOM transforms that need to touch the data  
won't work, but that's something that can be addressed once the  
prototype is done and evaluated.

Scott


From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 00:05:59 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 26201106568D
	for <arch@freebsd.org>; Mon, 30 Nov 2009 00:05:59 +0000 (UTC)
	(envelope-from julian@elischer.org)
Received: from outW.internet-mail-service.net (outw.internet-mail-service.net
	[216.240.47.246])
	by mx1.freebsd.org (Postfix) with ESMTP id 094C98FC24
	for <arch@freebsd.org>; Mon, 30 Nov 2009 00:05:58 +0000 (UTC)
Received: from idiom.com (mx0.idiom.com [216.240.32.160])
	by out.internet-mail-service.net (Postfix) with ESMTP id 0A4012DA6E;
	Sun, 29 Nov 2009 16:05:59 -0800 (PST)
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org
	(h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137])
	by idiom.com (Postfix) with ESMTP id 1F6202D6016;
	Sun, 29 Nov 2009 16:05:57 -0800 (PST)
Message-ID: <4B130C6A.70406@elischer.org>
Date: Sun, 29 Nov 2009 16:06:02 -0800
From: Julian Elischer <julian@elischer.org>
User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812)
MIME-Version: 1.0
To: Scott Long <scottl@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>	<alpine.BSF.2.00.0905200841230.981@desktop>	<200905201522.58501.jhb@freebsd.org>	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
	<66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
In-Reply-To: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 00:05:59 -0000

Scott Long wrote:

> I think this is fundamentally wrong.  You're proposing exchanging a 
> cheap operation of splitting VA's with an expensive operation of 
> allocating, splitting, copying, and refcounting sglists.  Splitting is 
> an excessively common operation, and your proposal will impact 
> performance as storage becomes exponentially faster.
> 

 From the perspective of a flashdrive driver the more
efficient the better. The current generation of devices are
doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO
and really that will only go up. We are doing over 130,000 independent
transactions per second and we can put multiple drives in a single
machine.

These numbers will only increase with future developments.

From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 00:41:33 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1F3B7106566B;
	Mon, 30 Nov 2009 00:41:33 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id C40198FC17;
	Mon, 30 Nov 2009 00:41:32 +0000 (UTC)
Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAU0eD16057491;
	Sun, 29 Nov 2009 17:40:13 -0700 (MST)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1076)
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
From: Scott Long <scottl@samsco.org>
In-Reply-To: <4B130C6A.70406@elischer.org>
Date: Sun, 29 Nov 2009 17:40:12 -0700
Content-Transfer-Encoding: 7bit
Message-Id: <F39A82E9-36B0-40F1-B3DA-08843A5799F3@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>	<alpine.BSF.2.00.0905200841230.981@desktop>	<200905201522.58501.jhb@freebsd.org>	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
	<66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
	<4B130C6A.70406@elischer.org>
To: Julian Elischer <julian@elischer.org>
X-Mailer: Apple Mail (2.1076)
X-Spam-Status: No, score=-4.5 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 00:41:33 -0000

On Nov 29, 2009, at 5:06 PM, Julian Elischer wrote:
> Scott Long wrote:
>
>> I think this is fundamentally wrong.  You're proposing exchanging a  
>> cheap operation of splitting VA's with an expensive operation of  
>> allocating, splitting, copying, and refcounting sglists.  Splitting  
>> is an excessively common operation, and your proposal will impact  
>> performance as storage becomes exponentially faster.
>
> From the perspective of a flashdrive driver the more
> efficient the better. The current generation of devices are
> doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO
> and really that will only go up. We are doing over 130,000 independent
> transactions per second and we can put multiple drives in a single
> machine.
>
> These numbers will only increase with future developments.

MB/s doesn't tell me much other than the memory bandwidth of the  
pathways (and that that DMA engines involved don't completely suck).   
What about transactions/sec?  That tells me a lot more about the  
efficiency of the OS, drivers, and firmware, as well as latency.

Scott


From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 00:47:15 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4CAB21065676;
	Mon, 30 Nov 2009 00:47:15 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id E8A308FC0A;
	Mon, 30 Nov 2009 00:47:14 +0000 (UTC)
Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAU0l8jm057531;
	Sun, 29 Nov 2009 17:47:08 -0700 (MST)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1076)
From: Scott Long <scottl@samsco.org>
In-Reply-To: <F39A82E9-36B0-40F1-B3DA-08843A5799F3@samsco.org>
Date: Sun, 29 Nov 2009 17:47:08 -0700
Message-Id: <615AB9D0-7171-4FE1-BE38-74E6FA7FE93A@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>	<alpine.BSF.2.00.0905200841230.981@desktop>	<200905201522.58501.jhb@freebsd.org>	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
	<66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
	<4B130C6A.70406@elischer.org>
	<F39A82E9-36B0-40F1-B3DA-08843A5799F3@samsco.org>
To: Scott Long <scottl@samsco.org>
X-Mailer: Apple Mail (2.1076)
X-Spam-Status: No, score=-4.3 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00,
	HTML_40_50,HTML_MESSAGE autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Content-Type: text/plain;
	charset=us-ascii;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org,
	Julian Elischer <julian@elischer.org>
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 00:47:15 -0000

On Nov 29, 2009, at 5:40 PM, Scott Long wrote:
> On Nov 29, 2009, at 5:06 PM, Julian Elischer wrote:
>> Scott Long wrote:
>>
>>> I think this is fundamentally wrong.  You're proposing exchanging  
>>> a cheap operation of splitting VA's with an expensive operation of  
>>> allocating, splitting, copying, and refcounting sglists.   
>>> Splitting is an excessively common operation, and your proposal  
>>> will impact performance as storage becomes exponentially faster.
>>
>> From the perspective of a flashdrive driver the more
>> efficient the better. The current generation of devices are
>> doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO
>> and really that will only go up. We are doing over 130,000  
>> independent
>> transactions per second and we can put multiple drives in a single
>> machine.
>>
>> These numbers will only increase with future developments.
>
> MB/s doesn't tell me much other than the memory bandwidth of the  
> pathways (and that that DMA engines involved don't completely  
> suck).  What about transactions/sec?  That tells me a lot more about  
> the efficiency of the OS, drivers, and firmware, as well as latency.
>
>

Bah, the answer was right in front of me, sorry =-)  130k is impressive.

Scott


From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 11:06:48 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8BCDB106568B
	for <freebsd-arch@FreeBSD.org>; Mon, 30 Nov 2009 11:06:48 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 5FAA58FC29
	for <freebsd-arch@FreeBSD.org>; Mon, 30 Nov 2009 11:06:48 +0000 (UTC)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nAUB6m6S043353
	for <freebsd-arch@FreeBSD.org>; Mon, 30 Nov 2009 11:06:48 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nAUB6lTq043351
	for freebsd-arch@FreeBSD.org; Mon, 30 Nov 2009 11:06:47 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 30 Nov 2009 11:06:47 GMT
Message-Id: <200911301106.nAUB6lTq043351@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 11:06:48 -0000

Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o kern/120749  arch       [request] Suggest upping the default kern.ps_arg_cache

1 problem total.


From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 19:18:06 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9A90A106568D;
	Mon, 30 Nov 2009 19:18:06 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 716438FC08;
	Mon, 30 Nov 2009 19:18:06 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 2AF2846B06;
	Mon, 30 Nov 2009 14:18:06 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id B114B8A024;
	Mon, 30 Nov 2009 14:18:04 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Attilio Rao <attilio@freebsd.org>
Date: Mon, 30 Nov 2009 13:05:30 -0500
User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; )
References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com>
In-Reply-To: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <200911301305.30572.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 30 Nov 2009 14:18:05 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: FreeBSD Arch <arch@freebsd.org>, Ed Maste <emaste@freebsd.org>
Subject: Re: [PATCH] Statclock aliasing by LAPIC
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 19:18:06 -0000

On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote:
> Handling all the three clocks (hardclock, softclock, profclock) within
> the LAPIC can lead to aliasing for the softclock and profclock because
> hz is sized to fit mainly hardclock. The fashion to handle all of them
> within the LAPIC was introduced in 2005 and before than the softclock
> and profclock were supposed to be handled in the rtc. Right now, too,
> there is the necessary support to handle profclock and statclock in
> atrtc which gets enabled if the LAPIC signals it can't take in charge
> the three clocks.
> The proposed patch reverts the situation preferring the atrtc to
> handle the statclock and profclock (then a different source from the
> LAPIC) and then avoid the aliasing problem:
> 
http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff
> 
> In this patch, lapic_setup_clock() has been changed in order to return
> a three-states variable which identified if the LAPIC got in charge
> all the three clocks, just the hardclock or none of them (the current
> situation is just none/all) and the rtc handling runs subsequently.
> A tunable as been added to force LAPI to get in charge all the three
> clocks if needed.
> In ia32 atrtc compiling is linked to atpic compiling, so a compile
> time flag has been added to check if atpic is not present and in case
> force LAPIC to take in charge all the three clocks (which is still
> better than the 'safe belt values' still present in the rtc code).
> 
> Please note that statclock and profclock are widely used in our kernel
> (rusage is, for example, statclock driven) and fixing this would
> result in specific improvements (as a several-reported wrong CPU usage
> statistic in top).
> This bug has been found by Sandvine Incorporated.
> 
> Reviews, comments and testing are welcome.

Presumably in the RTC case lapic_timer_hz should always be hz and not some 
multiple of hz.  Also, did you check to make sure all the lock is present?  I 
think at one point I changed the locking for the RTC and/or ISA timer to just 
use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay 
the locking overhead since the code was only used on UP machines.  It may very 
well be that I only changed that in a development branch though and not in 
HEAD.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 20:59:06 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C12CD1065670;
	Mon, 30 Nov 2009 20:59:06 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 6D3588FC13;
	Mon, 30 Nov 2009 20:59:06 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id EF6C246B06;
	Mon, 30 Nov 2009 15:59:05 -0500 (EST)
Received: from jhbbsd.localnet (unknown [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id C06B98A01F;
	Mon, 30 Nov 2009 15:59:04 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Scott Long <scottl@samsco.org>
Date: Mon, 30 Nov 2009 14:27:23 -0500
User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; )
References: <200905191458.50764.jhb@freebsd.org>
	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
	<66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
In-Reply-To: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
MIME-Version: 1.0
Message-Id: <200911301427.23166.jhb@freebsd.org>
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Mon, 30 Nov 2009 15:59:04 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 20:59:06 -0000

On Sunday 29 November 2009 6:27:49 pm Scott Long wrote:
> > On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
> >> On Tue, 19 May 2009, John Baldwin wrote:
> >>
> >> 2) I worry that if all users do sglist_count() followed by a dynamic
> >> allocation and then an _append() they will be very expensive.
> >> pmap_kextract() is much more expensive than it may first seem to  
> >> be.  Do
> >> you have a user of count already?
> >
> > The only one that does now is sglist_build() and nothing currently  
> > uses that.
> 
> Kinda silly to have it then.  I also don't see the point of it; if the  
> point of the sglist object is to avoid VA mappings, then why start  
> with a VA mapping?  But that aside, Jeff is correct, sglist_build() is  
> horribly inefficient.

It actually does get used by the nvidia driver, but so far in what I have
done in my jhb_bio branch I have tried several different approaches which
is why the API is as verbose as it is.

> > VOP_GET/PUTPAGES would not need to do this since they could simply  
> > append
> > the physical addresses extracted directly from vm_page_t's for  
> > example.  I'm
> > not sure this will be used very much now as originally I thought I  
> > would be
> > changing all storage drivers to do all DMA operations using sglists  
> > and this
> > sort of thing would have been used for non-bio requests like firmware
> > commands; however, as expounded on below, it actually appears better  
> > to
> > still treat bio's separate from non-bio requests for bus_dma so that  
> > the
> > non-bio requests can continue to use bus_dmamap_load_buffer() as  
> > they do
> > now.
> >
> 
> I completely disagree with your approach to busdma, but I'll get into  
> that later.  What I really don't understand is, why have yet another  
> page description format?  Whether the I/O is disk buffers being pushed  
> by a pager to a disk controller, or texture buffers being pushed to  
> video hardware, they already have vm objects associated with them,  
> no?  Why translate to an intermediate format?  I understand that  
> you're creating another vm object format to deal directly with the  
> needs of nvidia, but that's really a one-off case right now.  What  
> about when we want the pagedaemon to push unmapped i/o?  Will it have  
> to spend cycles translating its objects to sglists?  This is really a  
> larger question that I'm not prepared to answer, but would like to  
> discuss.

Textures do not already have objects associated, no.  However, I did not
design sglist with Nvidia in mind.  I hacked on it in conjunction with
phk@, gibbs@, and jeff@ to work on unmapped bio support.  That was the
only motivation for sglist(9).  Only the OBJT_SG bits were specific to
Nvidia and that was added as an afterthought because sglist(9) already
existed in the jhb_bio branch.

If you look at GETPAGES/PUTPAGES they already deal in terms of vm_page_t's,
not VM objects, and vm_page_t's already provide a linear time way of fetching
the physical address (m->phys_addr), so generating an sglist for GETPAGES and
PUTPAGES will be very cheap.  One of the original proposals from phk@ was
actually to pass around arrays of vm_page_t's to describe I/O buffers.  When
Poul, Peter, and I talked about it we figured we had a choice between passing
either the physical address or the vm_page_t.  However, not all physical
addresses have vm_page_t's, and it was deemed that GEOM's and disk drivers did
not need any properties of the vm_page_t aside from the physical address.  For
those reasons, sglist(9) uses physical addresses.

> >> In general I think this is a good idea.  It'd be nice to work on  
> >> replacing
> >> the buf layer's implementation with something like this that could  
> >> be used
> >> directly by drivers.  Have you considered a busdma operation to  
> >> load from
> >> a sglist?
> >
> > So in regards to the bus_dma stuff, I did work on this a while ago  
> > in my
> > jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had  
> > planned on
> > using that in storage drivers directly.  However, I ended up  
> > circling back
> > to preferring a bus_dmamap_load_bio() and adding a new 'bio_start'  
> > field
> > to 'struct bio' that is an offset into an attached sglist.
> 
> I strongly disagree with forcing busdma to have intimate knowledge of  
> bio's.  All of the needed information can be stored in sglist headers.

The alternative is to teach every disk driver to handle the difference
as well as every GEOM module.  Not only that, but it doesn't provide any
easy transition path since you can't change any of the top-level code
to use an unmapped bio at until all the lower levels have been converted
to handle both ways.  Jeff had originally proposed having a
bus_dmamap_load_bio() and I tried to not use it but just have a
bus_dmamap_load_sglist() instead, but when I started looking at the extra
work that would have to be duplicated in every driver to handle both types
of bios, I concluded bus_dmamap_load_bio() would actually be a lot simpler.
 
> >  This let me
> > carve up I/O requests in geom_dev to satisfy a disk device's max  
> > request
> > size while still sharing the same read-only sglist across the various
> > BIO's (by simply adjusting bio_length and bio_start to be a subrange  
> > of
> > the sglist) as opposed to doing memory allocations to allocate  
> > specific
> > ranges of an sglist (using something like sglist_slice()) for each I/O
> > request.
> 
> I think this is fundamentally wrong.  You're proposing exchanging a  
> cheap operation of splitting VA's with an expensive operation of  
> allocating, splitting, copying, and refcounting sglists.  Splitting is  
> an excessively common operation, and your proposal will impact  
> performance as storage becomes exponentially faster.

The whole point is to not do anything to the sglist when splitting up requests 
so that it is more efficient.  I wrote above that splitting up the sglist 
would require allocations and be slow, so I specifically avoided that.  
Instead, one just does a simple refcount bump (refcount_acquire()) when 
cloning the bio (which is already doing an allocation to get the new bio) and 
one does a simple 'bio->bio_start += X' where one already does
bio->bio_data += X' now.

Instead, the sglist describes the "large" buffer at the "top" of the
I/O request tree, and when you split up the large bio into smaller ones
you simply use bio_start and bio_length to specify the sub-range of the
buffer.

> We need to stop thinking about maxio as a roadbump at the bottom of  
> the storage stack, and instead think of it as a fundamental attribute  
> that is honored at the top when a BIO is created.  Instead of loading  
> up an sglist with all of the pages (and don't forget coalesced pages  
> that might need to be broken up), maybe multiple bio's are created  
> that honor maxio from the start, or a single bio with a chained  
> sglist, with each chain link honoring maxio, allowing for easy  
> splitting.

It may be that the splitting that geom_dev does is done at the wrong layer;
I'm not debating that. :)  I attempted to make sglist work efficiently with
what is there now and other things like striping will also want to use cheap
splitting of buffers.

> >  I then have bus_dmamap_load_bio() use the subrange of the
> > sglist internally or fall back to using the KVA pointer if the sglist
> > isn't present.
> 
> I completely disagree.  Drivers already deal with the details of  
> bio's, and should continue to do so.  If a driver gets a bio that has  
> a valid bio_data pointer, it should call bus_dmamap_load().  If it  
> get's one with a valid sglist, it should call bus_dmamap_load_sglist 
> ().  Your proposal means that every storage driver in the system will  
> have to change to use bus_dmamap_load_bio().  It's not a big change,  
> but it's disruptive both in the tree and out.  Your proposal also  
> implies that CAM will have to start carrying BIO's in CCBs and passing  
> them to their SIMs.  I absolutely disagree with this.

Ok.  As I mentioned above, while this does add churn, I think it is less
churn than changing all the drivers to handle the two different types of
bio requests.  I also think it is much less friendly to doing the unmapped
I/O changes in stages that allows the work to progress in parallel in
different areas.  I also believe I specifically mentioned changing CCBs to
pass the bio instead of the raw (data, length) pair when I discussed this
with folks earlier.

> If we keep unneeded complications out of busdma, we avoid a lot of  
> churn.  We also leave the busdma interface available for other forms  
> of I/O without requiring more specific APi additions to accommodate  
> them.  What about unmapped network i/o coming from something like  
> sendfile?

I do have a bus_dmamap_load_sglist() in my tree already.  Do note that we
already have bus_dmamap_load_mbuf() and bus_dmamap_load_uio(), so there is
precedent for letting bus_dma handle slightly more complex data structures
than just a (buffer, length) pair.

> > However, I'm not really trying to get the bio stuff into the tree,  
> > this
> > is mostly for the Nvidia case and for that use case the driver is  
> > simply
> > creating simple single-entry lists and using sglist_append_phys().
> >
> 
> Designing the whole API around a single driver that we can't even get  
> the source to makes it hard to evaluate the API.

The API was designed for the bio stuff, and not for any specific driver.  The 
Nvidia stuff was only done as an afterthought because the sglist(9) structure 
already existed at the time.  It was also designed as a result of the 
discussions among several people and not completely in a vacuum.

> Attilio and I have spoken about this in private and will begin work on  
> a prototype.  Here is the outline of what we're going to do:

For those playing along at home, the things that I suggested to Attilio as far
as the next steps that I would do were to add the following APIs and then make
the necessary changes so that drivers and GEOM modules use these:

- bus_dmamap_load_bio():  Fairly simple.  Just takes a bio instead of (buffer,
  length).
- bio_adjust():  This is a lot like m_adj() but for bio's instead of mbuf's.
  It can be inline, but the point is to have GEOM modules use this to split up
  a bio buffer instead of directly manipulating bio_data and bio_length
  (possibly bio_offset as well).

Once these changes are done, adding support for simple unmapped bio's
consists of adding sglist support to bus_dma for each architecture and
bus_dmamap_load_bio() on each arch.  Then upper layer code could start using
unmapped bios after that (I had hacky prototype changes to physio).

There would still be several big things to work out, such as GEOM modules
that need to manipulate the data and not just pass it through.  phk's
suggestion here was to have the driver or GEOM module fail the request with a
magic error code.  The originator was then supposed to map the buffer and
retry the request.  Presumably one could note the first time a given device
object failed a request that way and always send down mapped requests
afterwards to avoid delays in subsequent I/O requests.  There are other ways
of handling this problem as well I imagine.  I have not made any attempt to
solve this problem.

Also, the changes Jeff has discussed with regards to tearing up
getpages/putpages and the buffer cache in general to take advantage of
unmapped bios are a separate animal that would build on this stuff further.
I have not made any attempt at this either.

I do find the idea of chaining sglist's together interesting.  It would lose
one of the "benefits" of the current layout which is that the segment array is
ABI-compatible with bus_dma's S/G list format so that in the common case the
sglist that physio or getpages/putpages would generate could be passed 
directly to the device driver's bus_dma callback without having to generate an
intermediate data structure.

-- 
John Baldwin

From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 21:03:29 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 19D9F1065670;
	Mon, 30 Nov 2009 21:03:29 +0000 (UTC)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.freebsd.org (Postfix) with ESMTP id CB0558FC15;
	Mon, 30 Nov 2009 21:03:28 +0000 (UTC)
Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3])
	by phk.freebsd.dk (Postfix) with ESMTP id EE7C67E996;
	Mon, 30 Nov 2009 21:03:27 +0000 (UTC)
Received: from critter.freebsd.dk (localhost [127.0.0.1])
	by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nAUL3kxY016357;
	Mon, 30 Nov 2009 21:03:46 GMT (envelope-from phk@critter.freebsd.dk)
To: John Baldwin <jhb@freebsd.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Mon, 30 Nov 2009 14:27:23 EST."
	<200911301427.23166.jhb@freebsd.org> 
Date: Mon, 30 Nov 2009 21:03:46 +0000
Message-ID: <16356.1259615026@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org,
	Scott Long <scottl@samsco.org>
Subject: Re: sglist(9) 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 21:03:29 -0000

In message <200911301427.23166.jhb@freebsd.org>, John Baldwin writes:
>On Sunday 29 November 2009 6:27:49 pm Scott Long wrote:

>It actually does get used by the nvidia driver, but so far in what I have
>done in my jhb_bio branch I have tried several different approaches which
>is why the API is as verbose as it is.

I would warn equally against rigorous simplification and gratuitous
generalization in this, I've tried both approaches in prototypes and
neither works out well from an API point of view.

The insight that expended CPU cycles are practially unmeasurable
in this context should not be forgotten, even in the quest to
get ever higher transactions per second numbers.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

From owner-freebsd-arch@FreeBSD.ORG  Mon Nov 30 22:13:23 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5611D106566C;
	Mon, 30 Nov 2009 22:13:23 +0000 (UTC)
	(envelope-from scottl@samsco.org)
Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57])
	by mx1.freebsd.org (Postfix) with ESMTP id EF4D18FC0A;
	Mon, 30 Nov 2009 22:13:22 +0000 (UTC)
Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57])
	(authenticated bits=0)
	by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAUMDJhZ063860;
	Mon, 30 Nov 2009 15:13:19 -0700 (MST)
	(envelope-from scottl@samsco.org)
Mime-Version: 1.0 (Apple Message framework v1076)
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
From: Scott Long <scottl@samsco.org>
In-Reply-To: <200911301427.23166.jhb@freebsd.org>
Date: Mon, 30 Nov 2009 15:13:19 -0700
Content-Transfer-Encoding: 7bit
Message-Id: <02A7F7FF-EBBC-40F3-8EBB-BFD4E5BE5391@samsco.org>
References: <200905191458.50764.jhb@freebsd.org>
	<3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com>
	<66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org>
	<200911301427.23166.jhb@freebsd.org>
To: John Baldwin <jhb@freebsd.org>
X-Mailer: Apple Mail (2.1076)
X-Spam-Status: No, score=-4.5 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.1.8
X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: sglist(9)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Nov 2009 22:13:23 -0000


On Nov 30, 2009, at 12:27 PM, John Baldwin wrote:

> On Sunday 29 November 2009 6:27:49 pm Scott Long wrote:
>>> On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote:
>>>> On Tue, 19 May 2009, John Baldwin wrote:
>>>>
>>>> 2) I worry that if all users do sglist_count() followed by a  
>>>> dynamic
>>>> allocation and then an _append() they will be very expensive.
>>>> pmap_kextract() is much more expensive than it may first seem to
>>>> be.  Do
>>>> you have a user of count already?
>>>
>>> The only one that does now is sglist_build() and nothing currently
>>> uses that.
>>
>> Kinda silly to have it then.  I also don't see the point of it; if  
>> the
>> point of the sglist object is to avoid VA mappings, then why start
>> with a VA mapping?  But that aside, Jeff is correct, sglist_build()  
>> is
>> horribly inefficient.
>
> It actually does get used by the nvidia driver, but so far in what I  
> have
> done in my jhb_bio branch I have tried several different approaches  
> which
> is why the API is as verbose as it is.
>
>>> VOP_GET/PUTPAGES would not need to do this since they could simply
>>> append
>>> the physical addresses extracted directly from vm_page_t's for
>>> example.  I'm
>>> not sure this will be used very much now as originally I thought I
>>> would be
>>> changing all storage drivers to do all DMA operations using sglists
>>> and this
>>> sort of thing would have been used for non-bio requests like  
>>> firmware
>>> commands; however, as expounded on below, it actually appears better
>>> to
>>> still treat bio's separate from non-bio requests for bus_dma so that
>>> the
>>> non-bio requests can continue to use bus_dmamap_load_buffer() as
>>> they do
>>> now.
>>>
>>
>> I completely disagree with your approach to busdma, but I'll get into
>> that later.  What I really don't understand is, why have yet another
>> page description format?  Whether the I/O is disk buffers being  
>> pushed
>> by a pager to a disk controller, or texture buffers being pushed to
>> video hardware, they already have vm objects associated with them,
>> no?  Why translate to an intermediate format?  I understand that
>> you're creating another vm object format to deal directly with the
>> needs of nvidia, but that's really a one-off case right now.  What
>> about when we want the pagedaemon to push unmapped i/o?  Will it have
>> to spend cycles translating its objects to sglists?  This is really a
>> larger question that I'm not prepared to answer, but would like to
>> discuss.
>
> Textures do not already have objects associated, no.  However, I did  
> not
> design sglist with Nvidia in mind.  I hacked on it in conjunction with
> phk@, gibbs@, and jeff@ to work on unmapped bio support.  That was the
> only motivation for sglist(9).  Only the OBJT_SG bits were specific to
> Nvidia and that was added as an afterthought because sglist(9) already
> existed in the jhb_bio branch.
>
> If you look at GETPAGES/PUTPAGES they already deal in terms of  
> vm_page_t's,
> not VM objects, and vm_page_t's already provide a linear time way of  
> fetching
> the physical address (m->phys_addr), so generating an sglist for  
> GETPAGES and
> PUTPAGES will be very cheap.  One of the original proposals from  
> phk@ was
> actually to pass around arrays of vm_page_t's to describe I/O  
> buffers.  When
> Poul, Peter, and I talked about it we figured we had a choice  
> between passing
> either the physical address or the vm_page_t.  However, not all  
> physical
> addresses have vm_page_t's, and it was deemed that GEOM's and disk  
> drivers did
> not need any properties of the vm_page_t aside from the physical  
> address.  For
> those reasons, sglist(9) uses physical addresses.
>
>>>> In general I think this is a good idea.  It'd be nice to work on
>>>> replacing
>>>> the buf layer's implementation with something like this that could
>>>> be used
>>>> directly by drivers.  Have you considered a busdma operation to
>>>> load from
>>>> a sglist?
>>>
>>> So in regards to the bus_dma stuff, I did work on this a while ago
>>> in my
>>> jhb_bio branch.  I do have a bus_dmamap_load_sglist() and I had
>>> planned on
>>> using that in storage drivers directly.  However, I ended up
>>> circling back
>>> to preferring a bus_dmamap_load_bio() and adding a new 'bio_start'
>>> field
>>> to 'struct bio' that is an offset into an attached sglist.
>>
>> I strongly disagree with forcing busdma to have intimate knowledge of
>> bio's.  All of the needed information can be stored in sglist  
>> headers.
>
> The alternative is to teach every disk driver to handle the difference
> as well as every GEOM module.  Not only that, but it doesn't provide  
> any
> easy transition path since you can't change any of the top-level code
> to use an unmapped bio at until all the lower levels have been  
> converted
> to handle both ways.  Jeff had originally proposed having a
> bus_dmamap_load_bio() and I tried to not use it but just have a
> bus_dmamap_load_sglist() instead, but when I started looking at the  
> extra
> work that would have to be duplicated in every driver to handle both  
> types
> of bios, I concluded bus_dmamap_load_bio() would actually be a lot  
> simpler.

You completely missed the part of my email where I talk about not  
having to update drivers for these new APIs.

In any case, I still respectfully disagree with your approach to  
busdma and bio handling, and ask that you let Attilio and I work on  
our prototype.  Once that's done, we can stop talking in hypotheticals.

Scott


From owner-freebsd-arch@FreeBSD.ORG  Tue Dec  1 15:30:15 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EB465106568B;
	Tue,  1 Dec 2009 15:30:14 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id 25C098FC18;
	Tue,  1 Dec 2009 15:30:13 +0000 (UTC)
Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	[220.239.235.116])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nB1FU9g6019186
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 2 Dec 2009 02:30:11 +1100
Date: Wed, 2 Dec 2009 02:30:09 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: John Baldwin <jhb@freebsd.org>
In-Reply-To: <200911301305.30572.jhb@freebsd.org>
Message-ID: <20091201233938.K1089@besplex.bde.org>
References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com>
	<200911301305.30572.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, FreeBSD Arch <arch@freebsd.org>,
	Ed Maste <emaste@freebsd.org>
Subject: Re: [PATCH] Statclock aliasing by LAPIC
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Dec 2009 15:30:15 -0000

On Mon, 30 Nov 2009, John Baldwin wrote:

> On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote:
>> Handling all the three clocks (hardclock, softclock, profclock) within
>> the LAPIC can lead to aliasing for the softclock and profclock because
>> hz is sized to fit mainly hardclock. The fashion to handle all of them
>> within the LAPIC was introduced in 2005 and before than the softclock
>> and profclock were supposed to be handled in the rtc. Right now, too,
>> there is the necessary support to handle profclock and statclock in
>> atrtc which gets enabled if the LAPIC signals it can't take in charge
>> the three clocks.
>> The proposed patch reverts the situation preferring the atrtc to
>> handle the statclock and profclock (then a different source from the
>> LAPIC) and then avoid the aliasing problem:

This would defeat most of the point of using the lapic timer.  RTC
interrupts are too slow to use for anything if there is an alternative
like the lapic timer.  i8254 interrupts are not so bad, and in fact
are just as efficient as lapic timer interrupts iff they are controlled
by the APIC and not by the ATPIC.  This is because RTC interrupts must
be acked and tested for in RTC registers, and the RTC is on the ISA
bus so accessing it is very slow, while the i8254 is programmed for
its interrupts to not need any acking or testing.  RTC and i8254
interrupts may also be be controlled by the ATPIC, and then the ATPIC
must be acked on the ISA bus too.  This gives the following number of
ISA bus accesses for most interrupts:

device          read    write
------          ----    -----
lapic_timer        0        0
i8254              0    0+0/1
RTC (current)      1    0+0/2
RTC (old)          3    1+0/2

Here "+0/1" and "+0/2" are for the ATPIC ack, if any.  RTC (old) is
before I optimized rtcin() to not write the index register in the usual
case where it has not changed (writing the index register takes 1 extra
write and uses 2 dummy reads in an attempt to satisfy timing requirements).
However, there is apparently broken (or just incompatible) hardware
that fails with this optimization.  There would probably be more reports
of this brokenness if using the RTC became the default again.

The 4-6 ISA accesses for RTC (old) take about 4-9 usec, so using the
RTC at stathz = 128 Hz takes only 0.05-0.12% of 1 CPU, which is
acceptable.  Using the RTC at profhz = 1024 Hz takes 0.4-0.9% of 1
CPU, which may also acceptable, but profhz = 1024 was too slow even
for a 386/20 in 1993; it should be 200-1000 times larger now, but the
RTC just can't support that, and one reason it was never increased is
that the RTC is too inefficient.  Profiling can now be implemented
better using the lapic timer, but using the lapic timer currently
implements profiling slightly worse than using the RTC.

> http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff
>> 
>> In this patch, lapic_setup_clock() has been changed in order to return
>> a three-states variable which identified if the LAPIC got in charge
>> all the three clocks, just the hardclock or none of them (the current
>> situation is just none/all) and the rtc handling runs subsequently.
>> A tunable as been added to force LAPI to get in charge all the three
>> clocks if needed.
>> In ia32 atrtc compiling is linked to atpic compiling, so a compile
>> time flag has been added to check if atpic is not present and in case
>> force LAPIC to take in charge all the three clocks (which is still
>> better than the 'safe belt values' still present in the rtc code).

I don't like tunables, especially to switch from one bug to another.
This can be done better using sysctls only, since it is not needed
for booting.  The sysctls would need to be runnable at any time, but
reprogramming the lapic timer at any time is already needed to
fix profiling (cpu_start/stopprofclock() are missing support for
the lapic timer; instead, the default lapic_timer_hz is set excessively
large but not large enough for a good profhz).  sysctls also let you
test this stuff without rebooting.

>> Please note that statclock and profclock are widely used in our kernel
>> (rusage is, for example, statclock driven) and fixing this would
>> result in specific improvements (as a several-reported wrong CPU usage
>> statistic in top).
>> This bug has been found by Sandvine Incorporated.

What bug exactly?  Bugs like this must have been found before 1993,
since statclock() in 4.4BSD was supposed to fix them.  See "A Randomized
Sampling Clock for CPU Utilization Estimation and Code Profiling"
(ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z).  FreeBSD never
implemented the "Randomized" part, but its statclock() used to sort
of work, since by default stathz was > hz and was not nearly a multiple
of hz.  Someone broke the former by increasing the default hz to 1000.
This allowed malicious programs to easily hide themself from statclock()
while consuming a large fraction of CPU cycles (when stathz was > hz,
it was not so easy to hide, and very difficult to both hide and consume
significant CPU, since timeout granularity makes it hard to control
wakeups).  Then using the single lapic timer to generate all periodic
timer interrupts increased the synchonization of these interrupts,
thus moving further from a randomized statclock().  However, the
defaults with the lapic timer give an even larger beat frequency than
before, so I don't see how using the lapic timer can increase the
problem much.  (The beat frequency of (1000, 128) is 16000.  The beat
frequency of (1000, 133) is 133000.  The latter means that, with
defaults, statclock and hardclock() ticks are only perfectly synced
once in every 133 seconds.  Misconfiguring hz to a multiple of 128 can
give perfect synchronization, which may be a more of a problem, or a
fix -- see below).

>>
>> Reviews, comments and testing are welcome.

Review of the part of the patch visible in the mail:

.

> Presumably in the RTC case lapic_timer_hz should always be hz and not some
> multiple of hz.

Sure.  Except the allocation of the timers is backwards at best.  You
need profhz on the most efficient timer so that it can be very large
(other changes are required for large profhz to actually work).  You
want stathz on the next most efficient timer so that it can be larger
than hz (see above) (other changes are needed for a stathz much larger
than 128 to actually work.  Note that at least SCHED_4BSD wants a
scheduling clock frequency of much less than 128 -- it essentially
divides stathz by 8 to get this.  Scaling in calcru() is currently
broken after several hundred days of runtime, and would break sooner
with larger stathz).

Perhaps your recent changes (that removed the literal constant dividers)
made the synchronization problem worse.  But these changes make it
easy to implement any number of independent timers with optional
randomness using the lapic timer.  E.g., to randomize statclock(), just
add a small random value (+-) as well as stathz.  Note that statistics
utilities won't like this -- some like systat(1) use statistics ticks
as a timebase so they want statclock() to be perfectly periodic.

I don't worry about the synchronization or broken profiling, and use
lapic_timer_hz = profhz = stathz = hz = 100 whenever the lapic timer
is used.  I haven't noticed any problems caused by this (mostly using
SCHED_4BSD), except the unavoidable one that hz = 100 gives less
accurate usr/sys decomposition than does hz = 1000.  I have noticed
that this fixed the cosmetic problem that systat(1) shows glitches in
the lapic timer interrupt rates:  Although using the lapic timer for
all timer interrupts makes them all perfectly periodic, systat cannot
see this because stathz = 133 is too small a sampling rate and is not
an exact divisor of lapic_timer_hz -- it caused a glitch every
lapic_timer_hz/stathz seconds.  For other interrupts, we wouldn't
expect the rates to be constant, but we know that the lapic timer
interrupt rate is constant so we know that the oscillation of its
displayed value is a bug.  Right now on ref8-i386.freebsd.org, I see
the values not oscillating much but being weird: for cpu0-1, they are
near 1973; for cpu2-3, they are near 1981; for cpu4-5, they are near
2043, and for cpu6-7 they are near 2003.

A tickless kernel would need to at least consider running the scheduler
and statistics gathering on most context switches (unless it keeps using
ticks when not idle).  The scheduler parts of this would also fix
timer synchronization problems for !tickless kernels, but I don't see
how they can be as efficient as only considering scheduling at
infrequent tick intervals.

> Also, did you check to make sure all the lock is present?  I
> think at one point I changed the locking for the RTC and/or ISA timer to just
> use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay
> the locking overhead since the code was only used on UP machines.  It may very
> well be that I only changed that in a development branch though and not in
> HEAD.

I don't remember any locking changes for RTC ever being committed.
rtcin() still uses mtx_lock_spin(&clock_lock).  clock_lock is the i8254
clock's lock, and is still abused for the RTC.  This abuse was convenient
when the RTC driver was implemented in the same file as the i8254
driver, but now the RTC driver is in its own file.  The i8254's private
variable `clock_lock' is even declared in the RTC driver's public
header, with other style bugs of course.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec  1 16:01:40 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 99CD31065697
	for <arch@freebsd.org>; Tue,  1 Dec 2009 16:01:40 +0000 (UTC)
	(envelope-from avg@icyb.net.ua)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
	by mx1.freebsd.org (Postfix) with ESMTP id C7EEE8FC14
	for <arch@freebsd.org>; Tue,  1 Dec 2009 16:01:39 +0000 (UTC)
Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua
	[212.40.38.101])
	by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA06125
	for <arch@freebsd.org>; Tue, 01 Dec 2009 18:01:38 +0200 (EET)
	(envelope-from avg@icyb.net.ua)
Message-ID: <4B153DE1.2030707@icyb.net.ua>
Date: Tue, 01 Dec 2009 18:01:37 +0200
From: Andriy Gapon <avg@icyb.net.ua>
User-Agent: Thunderbird 2.0.0.23 (X11/20090825)
MIME-Version: 1.0
References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com>	<200911301305.30572.jhb@freebsd.org>
	<20091201233938.K1089@besplex.bde.org>
In-Reply-To: <20091201233938.K1089@besplex.bde.org>
X-Enigmail-Version: 0.95.7
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org
Subject: Re: [PATCH] Statclock aliasing by LAPIC
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Dec 2009 16:01:40 -0000


BTW, we could also consider using periodic HPET timer (perhaps in legacy mode) for
some of these tasks on modern hardware.


-- 
Andriy Gapon

From owner-freebsd-arch@FreeBSD.ORG  Tue Dec  1 16:32:00 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 55FF8106566B;
	Tue,  1 Dec 2009 16:32:00 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au
	[211.29.132.191])
	by mx1.freebsd.org (Postfix) with ESMTP id DB2838FC19;
	Tue,  1 Dec 2009 16:31:59 +0000 (UTC)
Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116])
	by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	nB1GVu9Z022984
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 2 Dec 2009 03:31:57 +1100
Date: Wed, 2 Dec 2009 03:31:56 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20091201233938.K1089@besplex.bde.org>
Message-ID: <20091202025202.B22732@delplex.bde.org>
References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com>
	<200911301305.30572.jhb@freebsd.org>
	<20091201233938.K1089@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, FreeBSD Arch <arch@freebsd.org>,
	Ed Maste <emaste@freebsd.org>
Subject: Re: [PATCH] Statclock aliasing by LAPIC
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Dec 2009 16:32:00 -0000

On Wed, 2 Dec 2009, Bruce Evans wrote:

> ...  However, the
> defaults with the lapic timer give an even larger beat frequency than
> before, so I don't see how using the lapic timer can increase the
> problem much.  (The beat frequency of (1000, 128) is 16000.  The beat
> frequency of (1000, 133) is 133000.  The latter means that, with
> defaults, statclock and hardclock() ticks are only perfectly synced
> once in every 133 seconds.  Misconfiguring hz to a multiple of 128 can
> give perfect synchronization, which may be a more of a problem, or a
> fix -- see below).

PS (the see below part): with perfect sync, statclock() ticks can be
kept perfectly out of phase, and this might work well.  E.g.:

(1) hz = 1000, stathz = 125, lapic_timer_hz = 2000: hz ticks on lapic
     ticks # 0, 2, 4, ...; stathz ticks on lapic ticks # 7, 15, 23, ...
     Malicious programs can still easily hide from statclock().

(2) hz = 100, stathz = 100, lapic_timer hz = 200: hz ticks on lapic
     ticks # 0, 2, 4, ...; stathz ticks on lapic ticks # 1, 3, 5, ...
     Malicious programs can easily predict statclock(), but can't
     easily use more than half of the CPU: e.g.,
     - run from hz tick N+epsilon to N+0.5-epsilon (seems to need frequent
       clock_gettime() calls to determine when to give up control;
       timeouts are no use since none can occur until tick N+1-epsilon)
     - usleep(1) and/or give up control to another process.  If the
       former only, then there can be no timeout until hz tick N+1-epsilon,
       and we can hog at most half the CPU.  If the latter, then we
       will need to find a different one quite often, else the victim
       processes will accumulate ticks instead of use and they will be
       de-scheduled instead of us.  fork() by us must not be cost-free,
       else we can generate cooperating victim processes too easily for
       this and other types of hogging.

With a randomized statclock(), the randomness would have to be quite
large and not just a small glitch on the increment like I said before,
else maliciousness like in (2) would work to the extent that the
non-glitch part is large.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Dec  3 22:02:23 2009
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BEE6F106568B
	for <freebsd-arch@freebsd.org>; Thu,  3 Dec 2009 22:02:23 +0000 (UTC)
	(envelope-from Hartmut.Brandt@dlr.de)
Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32])
	by mx1.freebsd.org (Postfix) with ESMTP id 56AC88FC0A
	for <freebsd-arch@freebsd.org>; Thu,  3 Dec 2009 22:02:23 +0000 (UTC)
Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over
	TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); 
	Thu, 3 Dec 2009 22:18:12 +0100
Date: Thu, 3 Dec 2009 22:18:08 +0100 (CET)
From: Harti Brandt <hartmut.brandt@dlr.de>
X-X-Sender: brandt_h@beagle.kn.op.dlr.de
To: freebsd-arch@freebsd.org
Message-ID: <20091203220011.H53516@beagle.kn.op.dlr.de>
X-OpenPGP-Key: harti@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-OriginalArrivalTime: 03 Dec 2009 21:18:12.0177 (UTC)
	FILETIME=[1E8CC410:01CA745E]
Subject: struct if_data and ifmibdata
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Harti Brandt <harti@freebsd.org>
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Dec 2009 22:02:23 -0000


Hi,

I'm currently working on the networking MIBs for bsnmpd to implement the 
more recent RFCs (including IPv6 stuff). While doing this I run into 
numerous problems accessing interface information. The two sources of this 
information are $subj, each of which has some problems. The main problem 
is missing flexibility because of ABI issues. I have some ideas how to 
relax this somewhat, but before starting to implement anything I though I 
ask around whether this makes sense.

1. struct if_data.

This is embedded into struct ifnet, so any change in size changes the 
ifnet offsets which is bad once we start keeping the ABI stable in 
-current. Other problems are:

   - hard to find out what version of struct if_data one is retrieving via 
either the if_msghdr routing message or via the interface MIB

   - we've run out of ifi_type (u_char) space. IANAIfType is currently at 
251. Actually some of our private defines in if_types.h already overlap 
IANA assigned types

   - ifi_physical is not used anywhere in the kernel as far as I can see 
and should probably be removed together with the associated ioctls. This 
seems to be replaced long time ago by the if_media stuff.

   - we've run out of if_baudrate space (u_long) on 32-bit architectures 
for 10GBit/s interfaces

   - broadcast packet statistics are missing (they are required by the 
actual IF-MIB)

   - ifi_datalen is rather short (u_char) and restricts structure size to
256 bytes.

So what I would like to do is:

   - add a version field at the beginning and a #define to help user 
programs in working with different versions of this structure

   - add a couple of dozens of bytes at the end to allow extending the 
structure without changing its size

2. struct ifmibdata

   - add a version field here too.

3. struct ifmib_iso_8802_3

   - add a version field here too.

   - add dot3StatsSymbolErrors which are required by the current 
EtherLike-MIB.

Unfortunately only 4 drivers actually implement the ethernet statistics 
:-( so far

So, does this make any sense?

harti