From owner-freebsd-arch@FreeBSD.ORG Sun Nov 29 23:44:42 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1A3261065676 for ; Sun, 29 Nov 2009 23:44:42 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 7B73C8FC0A for ; Sun, 29 Nov 2009 23:44:41 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nATNRnnW057251; Sun, 29 Nov 2009 16:27:49 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) From: Scott Long In-Reply-To: <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> Date: Sun, 29 Nov 2009 16:27:49 -0700 Message-Id: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> References: <200905191458.50764.jhb@freebsd.org> <200905201522.58501.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> To: John Baldwin X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-4.4 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00, HTML_MESSAGE autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 29 Nov 2009 23:44:42 -0000 John, Sorry for the late reply on this. Attilio approached me recently about moving busdma and storage to sglists; up until then I had largely ignored this conversation because I thought that it was only about the nvidia driver. > On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: >> On Tue, 19 May 2009, John Baldwin wrote: >> >> 2) I worry that if all users do sglist_count() followed by a dynamic >> allocation and then an _append() they will be very expensive. >> pmap_kextract() is much more expensive than it may first seem to >> be. Do >> you have a user of count already? > > The only one that does now is sglist_build() and nothing currently > uses that. Kinda silly to have it then. I also don't see the point of it; if the point of the sglist object is to avoid VA mappings, then why start with a VA mapping? But that aside, Jeff is correct, sglist_build() is horribly inefficient. > VOP_GET/PUTPAGES would not need to do this since they could simply > append > the physical addresses extracted directly from vm_page_t's for > example. I'm > not sure this will be used very much now as originally I thought I > would be > changing all storage drivers to do all DMA operations using sglists > and this > sort of thing would have been used for non-bio requests like firmware > commands; however, as expounded on below, it actually appears better > to > still treat bio's separate from non-bio requests for bus_dma so that > the > non-bio requests can continue to use bus_dmamap_load_buffer() as > they do > now. > I completely disagree with your approach to busdma, but I'll get into that later. What I really don't understand is, why have yet another page description format? Whether the I/O is disk buffers being pushed by a pager to a disk controller, or texture buffers being pushed to video hardware, they already have vm objects associated with them, no? Why translate to an intermediate format? I understand that you're creating another vm object format to deal directly with the needs of nvidia, but that's really a one-off case right now. What about when we want the pagedaemon to push unmapped i/o? Will it have to spend cycles translating its objects to sglists? This is really a larger question that I'm not prepared to answer, but would like to discuss. >> 3) Rather than having sg_segs be an actual pointer, did you consider >> making it an unsized array? This removes the overhead of one >> pointer from >> the structure while enforcing that it's always contiguously >> allocated. > > It's actually a feature to be able to have the header in separate > storage from > segs array. I use this in the jhb_bio branch in the bus_dma > implementations > where a pre-allocated segs array is stored in the bus dma tag and > the header > is allocated on the stack. > I'd like to take this one step further. Instead of sglists being exactly sized, I'd like to see them be much more like mbufs, with a header and static storage, maybe somewhere between 128b and 1k in total size. Then they can be allocated and managed in pools, and chained together to make for easy appending, splitting, and growing. Offset pointers can be stored in the header instead of externally. Also, there are a lot of failure points in the API regarding to the sglist object being too small. Those need to be fixed. >> 4) SGLIST_INIT might be better off as an inline, and may not even >> belong >> in the header file. > > That may be true. I currently only use it in the jhb_bio branch for > the > bus_dma implementations. > >> In general I think this is a good idea. It'd be nice to work on >> replacing >> the buf layer's implementation with something like this that could >> be used >> directly by drivers. Have you considered a busdma operation to >> load from >> a sglist? > > So in regards to the bus_dma stuff, I did work on this a while ago > in my > jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had > planned on > using that in storage drivers directly. However, I ended up > circling back > to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' > field > to 'struct bio' that is an offset into an attached sglist. I strongly disagree with forcing busdma to have intimate knowledge of bio's. All of the needed information can be stored in sglist headers. > This let me > carve up I/O requests in geom_dev to satisfy a disk device's max > request > size while still sharing the same read-only sglist across the various > BIO's (by simply adjusting bio_length and bio_start to be a subrange > of > the sglist) as opposed to doing memory allocations to allocate > specific > ranges of an sglist (using something like sglist_slice()) for each I/O > request. I think this is fundamentally wrong. You're proposing exchanging a cheap operation of splitting VA's with an expensive operation of allocating, splitting, copying, and refcounting sglists. Splitting is an excessively common operation, and your proposal will impact performance as storage becomes exponentially faster. We need to stop thinking about maxio as a roadbump at the bottom of the storage stack, and instead think of it as a fundamental attribute that is honored at the top when a BIO is created. Instead of loading up an sglist with all of the pages (and don't forget coalesced pages that might need to be broken up), maybe multiple bio's are created that honor maxio from the start, or a single bio with a chained sglist, with each chain link honoring maxio, allowing for easy splitting. > I then have bus_dmamap_load_bio() use the subrange of the > sglist internally or fall back to using the KVA pointer if the sglist > isn't present. I completely disagree. Drivers already deal with the details of bio's, and should continue to do so. If a driver gets a bio that has a valid bio_data pointer, it should call bus_dmamap_load(). If it get's one with a valid sglist, it should call bus_dmamap_load_sglist (). Your proposal means that every storage driver in the system will have to change to use bus_dmamap_load_bio(). It's not a big change, but it's disruptive both in the tree and out. Your proposal also implies that CAM will have to start carrying BIO's in CCBs and passing them to their SIMs. I absolutely disagree with this. If we keep unneeded complications out of busdma, we avoid a lot of churn. We also leave the busdma interface available for other forms of I/O without requiring more specific APi additions to accommodate them. What about unmapped network i/o coming from something like sendfile? > > However, I'm not really trying to get the bio stuff into the tree, > this > is mostly for the Nvidia case and for that use case the driver is > simply > creating simple single-entry lists and using sglist_append_phys(). > Designing the whole API around a single driver that we can't even get the source to makes it hard to evaluate the API. Attilio and I have spoken about this in private and will begin work on a prototype. Here is the outline of what we're going to do: 1. Change struct sglist as so: a. Uniform size b. Determine an optimal number of elements to include in the size (waving my hands here, more research is needed). c. Chain, offset, and length pointers, very similar to how mbufs already work 2. Expand the sglist API so that I/O producers can allocate slabs of sglists and slice them up into pools that they can manage and own 3. Add an sglist field to struct bio, and add appropriate flags to identify VA vs sglist operation 4. Extend the CAM_DATA_PHYS attributes in CAM to handle sglists. 5. Add bus_dmamap_load_sglist(). This will be able to walk chains and combine, split, and reassign segments as needed. 6. Modify a select number of drivers to use it. 7. Add a flag to disk->d_flags to signal if a driver can handle sglists. Have geom_dev look at this flag and generate a kmem_alloc_nofault+pmap_kenter sequence for drivers that can't support it. In the end, no drivers will need to change, but the ones that do change will obviously benefit. We're going to prototype this will an i/o source that starts unmapped (via the Xen blkback driver). The downside is that most GEOM transforms that need to touch the data won't work, but that's something that can be addressed once the prototype is done and evaluated. Scott From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 00:05:59 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 26201106568D for ; Mon, 30 Nov 2009 00:05:59 +0000 (UTC) (envelope-from julian@elischer.org) Received: from outW.internet-mail-service.net (outw.internet-mail-service.net [216.240.47.246]) by mx1.freebsd.org (Postfix) with ESMTP id 094C98FC24 for ; Mon, 30 Nov 2009 00:05:58 +0000 (UTC) Received: from idiom.com (mx0.idiom.com [216.240.32.160]) by out.internet-mail-service.net (Postfix) with ESMTP id 0A4012DA6E; Sun, 29 Nov 2009 16:05:59 -0800 (PST) X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id 1F6202D6016; Sun, 29 Nov 2009 16:05:57 -0800 (PST) Message-ID: <4B130C6A.70406@elischer.org> Date: Sun, 29 Nov 2009 16:06:02 -0800 From: Julian Elischer User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) MIME-Version: 1.0 To: Scott Long References: <200905191458.50764.jhb@freebsd.org> <200905201522.58501.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> In-Reply-To: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 00:05:59 -0000 Scott Long wrote: > I think this is fundamentally wrong. You're proposing exchanging a > cheap operation of splitting VA's with an expensive operation of > allocating, splitting, copying, and refcounting sglists. Splitting is > an excessively common operation, and your proposal will impact > performance as storage becomes exponentially faster. > From the perspective of a flashdrive driver the more efficient the better. The current generation of devices are doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO and really that will only go up. We are doing over 130,000 independent transactions per second and we can put multiple drives in a single machine. These numbers will only increase with future developments. From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 00:41:33 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1F3B7106566B; Mon, 30 Nov 2009 00:41:33 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id C40198FC17; Mon, 30 Nov 2009 00:41:32 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAU0eD16057491; Sun, 29 Nov 2009 17:40:13 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes From: Scott Long In-Reply-To: <4B130C6A.70406@elischer.org> Date: Sun, 29 Nov 2009 17:40:12 -0700 Content-Transfer-Encoding: 7bit Message-Id: References: <200905191458.50764.jhb@freebsd.org> <200905201522.58501.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> <4B130C6A.70406@elischer.org> To: Julian Elischer X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-4.5 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 00:41:33 -0000 On Nov 29, 2009, at 5:06 PM, Julian Elischer wrote: > Scott Long wrote: > >> I think this is fundamentally wrong. You're proposing exchanging a >> cheap operation of splitting VA's with an expensive operation of >> allocating, splitting, copying, and refcounting sglists. Splitting >> is an excessively common operation, and your proposal will impact >> performance as storage becomes exponentially faster. > > From the perspective of a flashdrive driver the more > efficient the better. The current generation of devices are > doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO > and really that will only go up. We are doing over 130,000 independent > transactions per second and we can put multiple drives in a single > machine. > > These numbers will only increase with future developments. MB/s doesn't tell me much other than the memory bandwidth of the pathways (and that that DMA engines involved don't completely suck). What about transactions/sec? That tells me a lot more about the efficiency of the OS, drivers, and firmware, as well as latency. Scott From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 00:47:15 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4CAB21065676; Mon, 30 Nov 2009 00:47:15 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id E8A308FC0A; Mon, 30 Nov 2009 00:47:14 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAU0l8jm057531; Sun, 29 Nov 2009 17:47:08 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) From: Scott Long In-Reply-To: Date: Sun, 29 Nov 2009 17:47:08 -0700 Message-Id: <615AB9D0-7171-4FE1-BE38-74E6FA7FE93A@samsco.org> References: <200905191458.50764.jhb@freebsd.org> <200905201522.58501.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> <4B130C6A.70406@elischer.org> To: Scott Long X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-4.3 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00, HTML_40_50,HTML_MESSAGE autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Attilio Rao , arch@freebsd.org, Julian Elischer Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 00:47:15 -0000 On Nov 29, 2009, at 5:40 PM, Scott Long wrote: > On Nov 29, 2009, at 5:06 PM, Julian Elischer wrote: >> Scott Long wrote: >> >>> I think this is fundamentally wrong. You're proposing exchanging >>> a cheap operation of splitting VA's with an expensive operation of >>> allocating, splitting, copying, and refcounting sglists. >>> Splitting is an excessively common operation, and your proposal >>> will impact performance as storage becomes exponentially faster. >> >> From the perspective of a flashdrive driver the more >> efficient the better. The current generation of devices are >> doing 800MB/sec (6.4Gb/sec) of scattter-gather random IO >> and really that will only go up. We are doing over 130,000 >> independent >> transactions per second and we can put multiple drives in a single >> machine. >> >> These numbers will only increase with future developments. > > MB/s doesn't tell me much other than the memory bandwidth of the > pathways (and that that DMA engines involved don't completely > suck). What about transactions/sec? That tells me a lot more about > the efficiency of the OS, drivers, and firmware, as well as latency. > > Bah, the answer was right in front of me, sorry =-) 130k is impressive. Scott From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 11:06:48 2009 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8BCDB106568B for ; Mon, 30 Nov 2009 11:06:48 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 5FAA58FC29 for ; Mon, 30 Nov 2009 11:06:48 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.3/8.14.3) with ESMTP id nAUB6m6S043353 for ; Mon, 30 Nov 2009 11:06:48 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.3/8.14.3/Submit) id nAUB6lTq043351 for freebsd-arch@FreeBSD.org; Mon, 30 Nov 2009 11:06:47 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 30 Nov 2009 11:06:47 GMT Message-Id: <200911301106.nAUB6lTq043351@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-arch@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 11:06:48 -0000 Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/120749 arch [request] Suggest upping the default kern.ps_arg_cache 1 problem total. From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 19:18:06 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A90A106568D; Mon, 30 Nov 2009 19:18:06 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 716438FC08; Mon, 30 Nov 2009 19:18:06 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 2AF2846B06; Mon, 30 Nov 2009 14:18:06 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id B114B8A024; Mon, 30 Nov 2009 14:18:04 -0500 (EST) From: John Baldwin To: Attilio Rao Date: Mon, 30 Nov 2009 13:05:30 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; ) References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> In-Reply-To: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <200911301305.30572.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 30 Nov 2009 14:18:05 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 19:18:06 -0000 On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote: > Handling all the three clocks (hardclock, softclock, profclock) within > the LAPIC can lead to aliasing for the softclock and profclock because > hz is sized to fit mainly hardclock. The fashion to handle all of them > within the LAPIC was introduced in 2005 and before than the softclock > and profclock were supposed to be handled in the rtc. Right now, too, > there is the necessary support to handle profclock and statclock in > atrtc which gets enabled if the LAPIC signals it can't take in charge > the three clocks. > The proposed patch reverts the situation preferring the atrtc to > handle the statclock and profclock (then a different source from the > LAPIC) and then avoid the aliasing problem: > http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff > > In this patch, lapic_setup_clock() has been changed in order to return > a three-states variable which identified if the LAPIC got in charge > all the three clocks, just the hardclock or none of them (the current > situation is just none/all) and the rtc handling runs subsequently. > A tunable as been added to force LAPI to get in charge all the three > clocks if needed. > In ia32 atrtc compiling is linked to atpic compiling, so a compile > time flag has been added to check if atpic is not present and in case > force LAPIC to take in charge all the three clocks (which is still > better than the 'safe belt values' still present in the rtc code). > > Please note that statclock and profclock are widely used in our kernel > (rusage is, for example, statclock driven) and fixing this would > result in specific improvements (as a several-reported wrong CPU usage > statistic in top). > This bug has been found by Sandvine Incorporated. > > Reviews, comments and testing are welcome. Presumably in the RTC case lapic_timer_hz should always be hz and not some multiple of hz. Also, did you check to make sure all the lock is present? I think at one point I changed the locking for the RTC and/or ISA timer to just use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay the locking overhead since the code was only used on UP machines. It may very well be that I only changed that in a development branch though and not in HEAD. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 20:59:06 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C12CD1065670; Mon, 30 Nov 2009 20:59:06 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 6D3588FC13; Mon, 30 Nov 2009 20:59:06 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id EF6C246B06; Mon, 30 Nov 2009 15:59:05 -0500 (EST) Received: from jhbbsd.localnet (unknown [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPA id C06B98A01F; Mon, 30 Nov 2009 15:59:04 -0500 (EST) From: John Baldwin To: Scott Long Date: Mon, 30 Nov 2009 14:27:23 -0500 User-Agent: KMail/1.12.1 (FreeBSD/7.2-CBSD-20091103; KDE/4.3.1; amd64; ; ) References: <200905191458.50764.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> In-Reply-To: <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> MIME-Version: 1.0 Message-Id: <200911301427.23166.jhb@freebsd.org> Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Mon, 30 Nov 2009 15:59:04 -0500 (EST) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 20:59:06 -0000 On Sunday 29 November 2009 6:27:49 pm Scott Long wrote: > > On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: > >> On Tue, 19 May 2009, John Baldwin wrote: > >> > >> 2) I worry that if all users do sglist_count() followed by a dynamic > >> allocation and then an _append() they will be very expensive. > >> pmap_kextract() is much more expensive than it may first seem to > >> be. Do > >> you have a user of count already? > > > > The only one that does now is sglist_build() and nothing currently > > uses that. > > Kinda silly to have it then. I also don't see the point of it; if the > point of the sglist object is to avoid VA mappings, then why start > with a VA mapping? But that aside, Jeff is correct, sglist_build() is > horribly inefficient. It actually does get used by the nvidia driver, but so far in what I have done in my jhb_bio branch I have tried several different approaches which is why the API is as verbose as it is. > > VOP_GET/PUTPAGES would not need to do this since they could simply > > append > > the physical addresses extracted directly from vm_page_t's for > > example. I'm > > not sure this will be used very much now as originally I thought I > > would be > > changing all storage drivers to do all DMA operations using sglists > > and this > > sort of thing would have been used for non-bio requests like firmware > > commands; however, as expounded on below, it actually appears better > > to > > still treat bio's separate from non-bio requests for bus_dma so that > > the > > non-bio requests can continue to use bus_dmamap_load_buffer() as > > they do > > now. > > > > I completely disagree with your approach to busdma, but I'll get into > that later. What I really don't understand is, why have yet another > page description format? Whether the I/O is disk buffers being pushed > by a pager to a disk controller, or texture buffers being pushed to > video hardware, they already have vm objects associated with them, > no? Why translate to an intermediate format? I understand that > you're creating another vm object format to deal directly with the > needs of nvidia, but that's really a one-off case right now. What > about when we want the pagedaemon to push unmapped i/o? Will it have > to spend cycles translating its objects to sglists? This is really a > larger question that I'm not prepared to answer, but would like to > discuss. Textures do not already have objects associated, no. However, I did not design sglist with Nvidia in mind. I hacked on it in conjunction with phk@, gibbs@, and jeff@ to work on unmapped bio support. That was the only motivation for sglist(9). Only the OBJT_SG bits were specific to Nvidia and that was added as an afterthought because sglist(9) already existed in the jhb_bio branch. If you look at GETPAGES/PUTPAGES they already deal in terms of vm_page_t's, not VM objects, and vm_page_t's already provide a linear time way of fetching the physical address (m->phys_addr), so generating an sglist for GETPAGES and PUTPAGES will be very cheap. One of the original proposals from phk@ was actually to pass around arrays of vm_page_t's to describe I/O buffers. When Poul, Peter, and I talked about it we figured we had a choice between passing either the physical address or the vm_page_t. However, not all physical addresses have vm_page_t's, and it was deemed that GEOM's and disk drivers did not need any properties of the vm_page_t aside from the physical address. For those reasons, sglist(9) uses physical addresses. > >> In general I think this is a good idea. It'd be nice to work on > >> replacing > >> the buf layer's implementation with something like this that could > >> be used > >> directly by drivers. Have you considered a busdma operation to > >> load from > >> a sglist? > > > > So in regards to the bus_dma stuff, I did work on this a while ago > > in my > > jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had > > planned on > > using that in storage drivers directly. However, I ended up > > circling back > > to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' > > field > > to 'struct bio' that is an offset into an attached sglist. > > I strongly disagree with forcing busdma to have intimate knowledge of > bio's. All of the needed information can be stored in sglist headers. The alternative is to teach every disk driver to handle the difference as well as every GEOM module. Not only that, but it doesn't provide any easy transition path since you can't change any of the top-level code to use an unmapped bio at until all the lower levels have been converted to handle both ways. Jeff had originally proposed having a bus_dmamap_load_bio() and I tried to not use it but just have a bus_dmamap_load_sglist() instead, but when I started looking at the extra work that would have to be duplicated in every driver to handle both types of bios, I concluded bus_dmamap_load_bio() would actually be a lot simpler. > > This let me > > carve up I/O requests in geom_dev to satisfy a disk device's max > > request > > size while still sharing the same read-only sglist across the various > > BIO's (by simply adjusting bio_length and bio_start to be a subrange > > of > > the sglist) as opposed to doing memory allocations to allocate > > specific > > ranges of an sglist (using something like sglist_slice()) for each I/O > > request. > > I think this is fundamentally wrong. You're proposing exchanging a > cheap operation of splitting VA's with an expensive operation of > allocating, splitting, copying, and refcounting sglists. Splitting is > an excessively common operation, and your proposal will impact > performance as storage becomes exponentially faster. The whole point is to not do anything to the sglist when splitting up requests so that it is more efficient. I wrote above that splitting up the sglist would require allocations and be slow, so I specifically avoided that. Instead, one just does a simple refcount bump (refcount_acquire()) when cloning the bio (which is already doing an allocation to get the new bio) and one does a simple 'bio->bio_start += X' where one already does bio->bio_data += X' now. Instead, the sglist describes the "large" buffer at the "top" of the I/O request tree, and when you split up the large bio into smaller ones you simply use bio_start and bio_length to specify the sub-range of the buffer. > We need to stop thinking about maxio as a roadbump at the bottom of > the storage stack, and instead think of it as a fundamental attribute > that is honored at the top when a BIO is created. Instead of loading > up an sglist with all of the pages (and don't forget coalesced pages > that might need to be broken up), maybe multiple bio's are created > that honor maxio from the start, or a single bio with a chained > sglist, with each chain link honoring maxio, allowing for easy > splitting. It may be that the splitting that geom_dev does is done at the wrong layer; I'm not debating that. :) I attempted to make sglist work efficiently with what is there now and other things like striping will also want to use cheap splitting of buffers. > > I then have bus_dmamap_load_bio() use the subrange of the > > sglist internally or fall back to using the KVA pointer if the sglist > > isn't present. > > I completely disagree. Drivers already deal with the details of > bio's, and should continue to do so. If a driver gets a bio that has > a valid bio_data pointer, it should call bus_dmamap_load(). If it > get's one with a valid sglist, it should call bus_dmamap_load_sglist > (). Your proposal means that every storage driver in the system will > have to change to use bus_dmamap_load_bio(). It's not a big change, > but it's disruptive both in the tree and out. Your proposal also > implies that CAM will have to start carrying BIO's in CCBs and passing > them to their SIMs. I absolutely disagree with this. Ok. As I mentioned above, while this does add churn, I think it is less churn than changing all the drivers to handle the two different types of bio requests. I also think it is much less friendly to doing the unmapped I/O changes in stages that allows the work to progress in parallel in different areas. I also believe I specifically mentioned changing CCBs to pass the bio instead of the raw (data, length) pair when I discussed this with folks earlier. > If we keep unneeded complications out of busdma, we avoid a lot of > churn. We also leave the busdma interface available for other forms > of I/O without requiring more specific APi additions to accommodate > them. What about unmapped network i/o coming from something like > sendfile? I do have a bus_dmamap_load_sglist() in my tree already. Do note that we already have bus_dmamap_load_mbuf() and bus_dmamap_load_uio(), so there is precedent for letting bus_dma handle slightly more complex data structures than just a (buffer, length) pair. > > However, I'm not really trying to get the bio stuff into the tree, > > this > > is mostly for the Nvidia case and for that use case the driver is > > simply > > creating simple single-entry lists and using sglist_append_phys(). > > > > Designing the whole API around a single driver that we can't even get > the source to makes it hard to evaluate the API. The API was designed for the bio stuff, and not for any specific driver. The Nvidia stuff was only done as an afterthought because the sglist(9) structure already existed at the time. It was also designed as a result of the discussions among several people and not completely in a vacuum. > Attilio and I have spoken about this in private and will begin work on > a prototype. Here is the outline of what we're going to do: For those playing along at home, the things that I suggested to Attilio as far as the next steps that I would do were to add the following APIs and then make the necessary changes so that drivers and GEOM modules use these: - bus_dmamap_load_bio(): Fairly simple. Just takes a bio instead of (buffer, length). - bio_adjust(): This is a lot like m_adj() but for bio's instead of mbuf's. It can be inline, but the point is to have GEOM modules use this to split up a bio buffer instead of directly manipulating bio_data and bio_length (possibly bio_offset as well). Once these changes are done, adding support for simple unmapped bio's consists of adding sglist support to bus_dma for each architecture and bus_dmamap_load_bio() on each arch. Then upper layer code could start using unmapped bios after that (I had hacky prototype changes to physio). There would still be several big things to work out, such as GEOM modules that need to manipulate the data and not just pass it through. phk's suggestion here was to have the driver or GEOM module fail the request with a magic error code. The originator was then supposed to map the buffer and retry the request. Presumably one could note the first time a given device object failed a request that way and always send down mapped requests afterwards to avoid delays in subsequent I/O requests. There are other ways of handling this problem as well I imagine. I have not made any attempt to solve this problem. Also, the changes Jeff has discussed with regards to tearing up getpages/putpages and the buffer cache in general to take advantage of unmapped bios are a separate animal that would build on this stuff further. I have not made any attempt at this either. I do find the idea of chaining sglist's together interesting. It would lose one of the "benefits" of the current layout which is that the segment array is ABI-compatible with bus_dma's S/G list format so that in the common case the sglist that physio or getpages/putpages would generate could be passed directly to the device driver's bus_dma callback without having to generate an intermediate data structure. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 21:03:29 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 19D9F1065670; Mon, 30 Nov 2009 21:03:29 +0000 (UTC) (envelope-from phk@critter.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id CB0558FC15; Mon, 30 Nov 2009 21:03:28 +0000 (UTC) Received: from critter.freebsd.dk (critter.freebsd.dk [192.168.61.3]) by phk.freebsd.dk (Postfix) with ESMTP id EE7C67E996; Mon, 30 Nov 2009 21:03:27 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.14.3/8.14.3) with ESMTP id nAUL3kxY016357; Mon, 30 Nov 2009 21:03:46 GMT (envelope-from phk@critter.freebsd.dk) To: John Baldwin From: "Poul-Henning Kamp" In-Reply-To: Your message of "Mon, 30 Nov 2009 14:27:23 EST." <200911301427.23166.jhb@freebsd.org> Date: Mon, 30 Nov 2009 21:03:46 +0000 Message-ID: <16356.1259615026@critter.freebsd.dk> Sender: phk@critter.freebsd.dk Cc: Attilio Rao , arch@freebsd.org, Scott Long Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 21:03:29 -0000 In message <200911301427.23166.jhb@freebsd.org>, John Baldwin writes: >On Sunday 29 November 2009 6:27:49 pm Scott Long wrote: >It actually does get used by the nvidia driver, but so far in what I have >done in my jhb_bio branch I have tried several different approaches which >is why the API is as verbose as it is. I would warn equally against rigorous simplification and gratuitous generalization in this, I've tried both approaches in prototypes and neither works out well from an API point of view. The insight that expended CPU cycles are practially unmeasurable in this context should not be forgotten, even in the quest to get ever higher transactions per second numbers. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From owner-freebsd-arch@FreeBSD.ORG Mon Nov 30 22:13:23 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5611D106566C; Mon, 30 Nov 2009 22:13:23 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id EF4D18FC0A; Mon, 30 Nov 2009 22:13:22 +0000 (UTC) Received: from [IPv6:::1] (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.2/8.14.2) with ESMTP id nAUMDJhZ063860; Mon, 30 Nov 2009 15:13:19 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1076) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes From: Scott Long In-Reply-To: <200911301427.23166.jhb@freebsd.org> Date: Mon, 30 Nov 2009 15:13:19 -0700 Content-Transfer-Encoding: 7bit Message-Id: <02A7F7FF-EBBC-40F3-8EBB-BFD4E5BE5391@samsco.org> References: <200905191458.50764.jhb@freebsd.org> <3bbf2fe10911291429k54b4b7cfw9e40aefeca597307@mail.gmail.com> <66707B0F-D0AB-49DB-802F-13146F488E1A@samsco.org> <200911301427.23166.jhb@freebsd.org> To: John Baldwin X-Mailer: Apple Mail (2.1076) X-Spam-Status: No, score=-4.5 required=3.8 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: Attilio Rao , arch@freebsd.org Subject: Re: sglist(9) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 22:13:23 -0000 On Nov 30, 2009, at 12:27 PM, John Baldwin wrote: > On Sunday 29 November 2009 6:27:49 pm Scott Long wrote: >>> On Wednesday 20 May 2009 2:49:30 pm Jeff Roberson wrote: >>>> On Tue, 19 May 2009, John Baldwin wrote: >>>> >>>> 2) I worry that if all users do sglist_count() followed by a >>>> dynamic >>>> allocation and then an _append() they will be very expensive. >>>> pmap_kextract() is much more expensive than it may first seem to >>>> be. Do >>>> you have a user of count already? >>> >>> The only one that does now is sglist_build() and nothing currently >>> uses that. >> >> Kinda silly to have it then. I also don't see the point of it; if >> the >> point of the sglist object is to avoid VA mappings, then why start >> with a VA mapping? But that aside, Jeff is correct, sglist_build() >> is >> horribly inefficient. > > It actually does get used by the nvidia driver, but so far in what I > have > done in my jhb_bio branch I have tried several different approaches > which > is why the API is as verbose as it is. > >>> VOP_GET/PUTPAGES would not need to do this since they could simply >>> append >>> the physical addresses extracted directly from vm_page_t's for >>> example. I'm >>> not sure this will be used very much now as originally I thought I >>> would be >>> changing all storage drivers to do all DMA operations using sglists >>> and this >>> sort of thing would have been used for non-bio requests like >>> firmware >>> commands; however, as expounded on below, it actually appears better >>> to >>> still treat bio's separate from non-bio requests for bus_dma so that >>> the >>> non-bio requests can continue to use bus_dmamap_load_buffer() as >>> they do >>> now. >>> >> >> I completely disagree with your approach to busdma, but I'll get into >> that later. What I really don't understand is, why have yet another >> page description format? Whether the I/O is disk buffers being >> pushed >> by a pager to a disk controller, or texture buffers being pushed to >> video hardware, they already have vm objects associated with them, >> no? Why translate to an intermediate format? I understand that >> you're creating another vm object format to deal directly with the >> needs of nvidia, but that's really a one-off case right now. What >> about when we want the pagedaemon to push unmapped i/o? Will it have >> to spend cycles translating its objects to sglists? This is really a >> larger question that I'm not prepared to answer, but would like to >> discuss. > > Textures do not already have objects associated, no. However, I did > not > design sglist with Nvidia in mind. I hacked on it in conjunction with > phk@, gibbs@, and jeff@ to work on unmapped bio support. That was the > only motivation for sglist(9). Only the OBJT_SG bits were specific to > Nvidia and that was added as an afterthought because sglist(9) already > existed in the jhb_bio branch. > > If you look at GETPAGES/PUTPAGES they already deal in terms of > vm_page_t's, > not VM objects, and vm_page_t's already provide a linear time way of > fetching > the physical address (m->phys_addr), so generating an sglist for > GETPAGES and > PUTPAGES will be very cheap. One of the original proposals from > phk@ was > actually to pass around arrays of vm_page_t's to describe I/O > buffers. When > Poul, Peter, and I talked about it we figured we had a choice > between passing > either the physical address or the vm_page_t. However, not all > physical > addresses have vm_page_t's, and it was deemed that GEOM's and disk > drivers did > not need any properties of the vm_page_t aside from the physical > address. For > those reasons, sglist(9) uses physical addresses. > >>>> In general I think this is a good idea. It'd be nice to work on >>>> replacing >>>> the buf layer's implementation with something like this that could >>>> be used >>>> directly by drivers. Have you considered a busdma operation to >>>> load from >>>> a sglist? >>> >>> So in regards to the bus_dma stuff, I did work on this a while ago >>> in my >>> jhb_bio branch. I do have a bus_dmamap_load_sglist() and I had >>> planned on >>> using that in storage drivers directly. However, I ended up >>> circling back >>> to preferring a bus_dmamap_load_bio() and adding a new 'bio_start' >>> field >>> to 'struct bio' that is an offset into an attached sglist. >> >> I strongly disagree with forcing busdma to have intimate knowledge of >> bio's. All of the needed information can be stored in sglist >> headers. > > The alternative is to teach every disk driver to handle the difference > as well as every GEOM module. Not only that, but it doesn't provide > any > easy transition path since you can't change any of the top-level code > to use an unmapped bio at until all the lower levels have been > converted > to handle both ways. Jeff had originally proposed having a > bus_dmamap_load_bio() and I tried to not use it but just have a > bus_dmamap_load_sglist() instead, but when I started looking at the > extra > work that would have to be duplicated in every driver to handle both > types > of bios, I concluded bus_dmamap_load_bio() would actually be a lot > simpler. You completely missed the part of my email where I talk about not having to update drivers for these new APIs. In any case, I still respectfully disagree with your approach to busdma and bio handling, and ask that you let Attilio and I work on our prototype. Once that's done, we can stop talking in hypotheticals. Scott From owner-freebsd-arch@FreeBSD.ORG Tue Dec 1 15:30:15 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB465106568B; Tue, 1 Dec 2009 15:30:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 25C098FC18; Tue, 1 Dec 2009 15:30:13 +0000 (UTC) Received: from besplex.bde.org (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nB1FU9g6019186 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 2 Dec 2009 02:30:11 +1100 Date: Wed, 2 Dec 2009 02:30:09 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <200911301305.30572.jhb@freebsd.org> Message-ID: <20091201233938.K1089@besplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <200911301305.30572.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Dec 2009 15:30:15 -0000 On Mon, 30 Nov 2009, John Baldwin wrote: > On Friday 27 November 2009 6:42:50 pm Attilio Rao wrote: >> Handling all the three clocks (hardclock, softclock, profclock) within >> the LAPIC can lead to aliasing for the softclock and profclock because >> hz is sized to fit mainly hardclock. The fashion to handle all of them >> within the LAPIC was introduced in 2005 and before than the softclock >> and profclock were supposed to be handled in the rtc. Right now, too, >> there is the necessary support to handle profclock and statclock in >> atrtc which gets enabled if the LAPIC signals it can't take in charge >> the three clocks. >> The proposed patch reverts the situation preferring the atrtc to >> handle the statclock and profclock (then a different source from the >> LAPIC) and then avoid the aliasing problem: This would defeat most of the point of using the lapic timer. RTC interrupts are too slow to use for anything if there is an alternative like the lapic timer. i8254 interrupts are not so bad, and in fact are just as efficient as lapic timer interrupts iff they are controlled by the APIC and not by the ATPIC. This is because RTC interrupts must be acked and tested for in RTC registers, and the RTC is on the ISA bus so accessing it is very slow, while the i8254 is programmed for its interrupts to not need any acking or testing. RTC and i8254 interrupts may also be be controlled by the ATPIC, and then the ATPIC must be acked on the ISA bus too. This gives the following number of ISA bus accesses for most interrupts: device read write ------ ---- ----- lapic_timer 0 0 i8254 0 0+0/1 RTC (current) 1 0+0/2 RTC (old) 3 1+0/2 Here "+0/1" and "+0/2" are for the ATPIC ack, if any. RTC (old) is before I optimized rtcin() to not write the index register in the usual case where it has not changed (writing the index register takes 1 extra write and uses 2 dummy reads in an attempt to satisfy timing requirements). However, there is apparently broken (or just incompatible) hardware that fails with this optimization. There would probably be more reports of this brokenness if using the RTC became the default again. The 4-6 ISA accesses for RTC (old) take about 4-9 usec, so using the RTC at stathz = 128 Hz takes only 0.05-0.12% of 1 CPU, which is acceptable. Using the RTC at profhz = 1024 Hz takes 0.4-0.9% of 1 CPU, which may also acceptable, but profhz = 1024 was too slow even for a 386/20 in 1993; it should be 200-1000 times larger now, but the RTC just can't support that, and one reason it was never increased is that the RTC is too inefficient. Profiling can now be implemented better using the lapic timer, but using the lapic timer currently implements profiling slightly worse than using the RTC. > http://www.freebsd.org/~attilio/Sandvine/STABLE_8/statclock_aliasing/statclock_aliasing3.diff >> >> In this patch, lapic_setup_clock() has been changed in order to return >> a three-states variable which identified if the LAPIC got in charge >> all the three clocks, just the hardclock or none of them (the current >> situation is just none/all) and the rtc handling runs subsequently. >> A tunable as been added to force LAPI to get in charge all the three >> clocks if needed. >> In ia32 atrtc compiling is linked to atpic compiling, so a compile >> time flag has been added to check if atpic is not present and in case >> force LAPIC to take in charge all the three clocks (which is still >> better than the 'safe belt values' still present in the rtc code). I don't like tunables, especially to switch from one bug to another. This can be done better using sysctls only, since it is not needed for booting. The sysctls would need to be runnable at any time, but reprogramming the lapic timer at any time is already needed to fix profiling (cpu_start/stopprofclock() are missing support for the lapic timer; instead, the default lapic_timer_hz is set excessively large but not large enough for a good profhz). sysctls also let you test this stuff without rebooting. >> Please note that statclock and profclock are widely used in our kernel >> (rusage is, for example, statclock driven) and fixing this would >> result in specific improvements (as a several-reported wrong CPU usage >> statistic in top). >> This bug has been found by Sandvine Incorporated. What bug exactly? Bugs like this must have been found before 1993, since statclock() in 4.4BSD was supposed to fix them. See "A Randomized Sampling Clock for CPU Utilization Estimation and Code Profiling" (ftp://ftp.ee.lbl.gov/papers/statclk-usenix93.ps.Z). FreeBSD never implemented the "Randomized" part, but its statclock() used to sort of work, since by default stathz was > hz and was not nearly a multiple of hz. Someone broke the former by increasing the default hz to 1000. This allowed malicious programs to easily hide themself from statclock() while consuming a large fraction of CPU cycles (when stathz was > hz, it was not so easy to hide, and very difficult to both hide and consume significant CPU, since timeout granularity makes it hard to control wakeups). Then using the single lapic timer to generate all periodic timer interrupts increased the synchonization of these interrupts, thus moving further from a randomized statclock(). However, the defaults with the lapic timer give an even larger beat frequency than before, so I don't see how using the lapic timer can increase the problem much. (The beat frequency of (1000, 128) is 16000. The beat frequency of (1000, 133) is 133000. The latter means that, with defaults, statclock and hardclock() ticks are only perfectly synced once in every 133 seconds. Misconfiguring hz to a multiple of 128 can give perfect synchronization, which may be a more of a problem, or a fix -- see below). >> >> Reviews, comments and testing are welcome. Review of the part of the patch visible in the mail: . > Presumably in the RTC case lapic_timer_hz should always be hz and not some > multiple of hz. Sure. Except the allocation of the timers is backwards at best. You need profhz on the most efficient timer so that it can be very large (other changes are required for large profhz to actually work). You want stathz on the next most efficient timer so that it can be larger than hz (see above) (other changes are needed for a stathz much larger than 128 to actually work. Note that at least SCHED_4BSD wants a scheduling clock frequency of much less than 128 -- it essentially divides stathz by 8 to get this. Scaling in calcru() is currently broken after several hundred days of runtime, and would break sooner with larger stathz). Perhaps your recent changes (that removed the literal constant dividers) made the synchronization problem worse. But these changes make it easy to implement any number of independent timers with optional randomness using the lapic timer. E.g., to randomize statclock(), just add a small random value (+-) as well as stathz. Note that statistics utilities won't like this -- some like systat(1) use statistics ticks as a timebase so they want statclock() to be perfectly periodic. I don't worry about the synchronization or broken profiling, and use lapic_timer_hz = profhz = stathz = hz = 100 whenever the lapic timer is used. I haven't noticed any problems caused by this (mostly using SCHED_4BSD), except the unavoidable one that hz = 100 gives less accurate usr/sys decomposition than does hz = 1000. I have noticed that this fixed the cosmetic problem that systat(1) shows glitches in the lapic timer interrupt rates: Although using the lapic timer for all timer interrupts makes them all perfectly periodic, systat cannot see this because stathz = 133 is too small a sampling rate and is not an exact divisor of lapic_timer_hz -- it caused a glitch every lapic_timer_hz/stathz seconds. For other interrupts, we wouldn't expect the rates to be constant, but we know that the lapic timer interrupt rate is constant so we know that the oscillation of its displayed value is a bug. Right now on ref8-i386.freebsd.org, I see the values not oscillating much but being weird: for cpu0-1, they are near 1973; for cpu2-3, they are near 1981; for cpu4-5, they are near 2043, and for cpu6-7 they are near 2003. A tickless kernel would need to at least consider running the scheduler and statistics gathering on most context switches (unless it keeps using ticks when not idle). The scheduler parts of this would also fix timer synchronization problems for !tickless kernels, but I don't see how they can be as efficient as only considering scheduling at infrequent tick intervals. > Also, did you check to make sure all the lock is present? I > think at one point I changed the locking for the RTC and/or ISA timer to just > use critical_enter/exit so that UP machines running an SMP kernel wouldn't pay > the locking overhead since the code was only used on UP machines. It may very > well be that I only changed that in a development branch though and not in > HEAD. I don't remember any locking changes for RTC ever being committed. rtcin() still uses mtx_lock_spin(&clock_lock). clock_lock is the i8254 clock's lock, and is still abused for the RTC. This abuse was convenient when the RTC driver was implemented in the same file as the i8254 driver, but now the RTC driver is in its own file. The i8254's private variable `clock_lock' is even declared in the RTC driver's public header, with other style bugs of course. Bruce From owner-freebsd-arch@FreeBSD.ORG Tue Dec 1 16:01:40 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 99CD31065697 for ; Tue, 1 Dec 2009 16:01:40 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id C7EEE8FC14 for ; Tue, 1 Dec 2009 16:01:39 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id SAA06125 for ; Tue, 01 Dec 2009 18:01:38 +0200 (EET) (envelope-from avg@icyb.net.ua) Message-ID: <4B153DE1.2030707@icyb.net.ua> Date: Tue, 01 Dec 2009 18:01:37 +0200 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.23 (X11/20090825) MIME-Version: 1.0 References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <200911301305.30572.jhb@freebsd.org> <20091201233938.K1089@besplex.bde.org> In-Reply-To: <20091201233938.K1089@besplex.bde.org> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Dec 2009 16:01:40 -0000 BTW, we could also consider using periodic HPET timer (perhaps in legacy mode) for some of these tasks on modern hardware. -- Andriy Gapon From owner-freebsd-arch@FreeBSD.ORG Tue Dec 1 16:32:00 2009 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 55FF8106566B; Tue, 1 Dec 2009 16:32:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by mx1.freebsd.org (Postfix) with ESMTP id DB2838FC19; Tue, 1 Dec 2009 16:31:59 +0000 (UTC) Received: from c220-239-235-116.carlnfd3.nsw.optusnet.com.au (c220-239-235-116.carlnfd3.nsw.optusnet.com.au [220.239.235.116]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id nB1GVu9Z022984 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 2 Dec 2009 03:31:57 +1100 Date: Wed, 2 Dec 2009 03:31:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Bruce Evans In-Reply-To: <20091201233938.K1089@besplex.bde.org> Message-ID: <20091202025202.B22732@delplex.bde.org> References: <3bbf2fe10911271542h2b179874qa0d9a4a7224dcb2f@mail.gmail.com> <200911301305.30572.jhb@freebsd.org> <20091201233938.K1089@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , FreeBSD Arch , Ed Maste Subject: Re: [PATCH] Statclock aliasing by LAPIC X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Dec 2009 16:32:00 -0000 On Wed, 2 Dec 2009, Bruce Evans wrote: > ... However, the > defaults with the lapic timer give an even larger beat frequency than > before, so I don't see how using the lapic timer can increase the > problem much. (The beat frequency of (1000, 128) is 16000. The beat > frequency of (1000, 133) is 133000. The latter means that, with > defaults, statclock and hardclock() ticks are only perfectly synced > once in every 133 seconds. Misconfiguring hz to a multiple of 128 can > give perfect synchronization, which may be a more of a problem, or a > fix -- see below). PS (the see below part): with perfect sync, statclock() ticks can be kept perfectly out of phase, and this might work well. E.g.: (1) hz = 1000, stathz = 125, lapic_timer_hz = 2000: hz ticks on lapic ticks # 0, 2, 4, ...; stathz ticks on lapic ticks # 7, 15, 23, ... Malicious programs can still easily hide from statclock(). (2) hz = 100, stathz = 100, lapic_timer hz = 200: hz ticks on lapic ticks # 0, 2, 4, ...; stathz ticks on lapic ticks # 1, 3, 5, ... Malicious programs can easily predict statclock(), but can't easily use more than half of the CPU: e.g., - run from hz tick N+epsilon to N+0.5-epsilon (seems to need frequent clock_gettime() calls to determine when to give up control; timeouts are no use since none can occur until tick N+1-epsilon) - usleep(1) and/or give up control to another process. If the former only, then there can be no timeout until hz tick N+1-epsilon, and we can hog at most half the CPU. If the latter, then we will need to find a different one quite often, else the victim processes will accumulate ticks instead of use and they will be de-scheduled instead of us. fork() by us must not be cost-free, else we can generate cooperating victim processes too easily for this and other types of hogging. With a randomized statclock(), the randomness would have to be quite large and not just a small glitch on the increment like I said before, else maliciousness like in (2) would work to the extent that the non-glitch part is large. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Dec 3 22:02:23 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BEE6F106568B for ; Thu, 3 Dec 2009 22:02:23 +0000 (UTC) (envelope-from Hartmut.Brandt@dlr.de) Received: from smtp1.dlr.de (smtp1.dlr.de [129.247.252.32]) by mx1.freebsd.org (Postfix) with ESMTP id 56AC88FC0A for ; Thu, 3 Dec 2009 22:02:23 +0000 (UTC) Received: from beagle.kn.op.dlr.de ([129.247.178.136]) by smtp1.dlr.de over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Thu, 3 Dec 2009 22:18:12 +0100 Date: Thu, 3 Dec 2009 22:18:08 +0100 (CET) From: Harti Brandt X-X-Sender: brandt_h@beagle.kn.op.dlr.de To: freebsd-arch@freebsd.org Message-ID: <20091203220011.H53516@beagle.kn.op.dlr.de> X-OpenPGP-Key: harti@freebsd.org MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-OriginalArrivalTime: 03 Dec 2009 21:18:12.0177 (UTC) FILETIME=[1E8CC410:01CA745E] Subject: struct if_data and ifmibdata X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Harti Brandt List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Dec 2009 22:02:23 -0000 Hi, I'm currently working on the networking MIBs for bsnmpd to implement the more recent RFCs (including IPv6 stuff). While doing this I run into numerous problems accessing interface information. The two sources of this information are $subj, each of which has some problems. The main problem is missing flexibility because of ABI issues. I have some ideas how to relax this somewhat, but before starting to implement anything I though I ask around whether this makes sense. 1. struct if_data. This is embedded into struct ifnet, so any change in size changes the ifnet offsets which is bad once we start keeping the ABI stable in -current. Other problems are: - hard to find out what version of struct if_data one is retrieving via either the if_msghdr routing message or via the interface MIB - we've run out of ifi_type (u_char) space. IANAIfType is currently at 251. Actually some of our private defines in if_types.h already overlap IANA assigned types - ifi_physical is not used anywhere in the kernel as far as I can see and should probably be removed together with the associated ioctls. This seems to be replaced long time ago by the if_media stuff. - we've run out of if_baudrate space (u_long) on 32-bit architectures for 10GBit/s interfaces - broadcast packet statistics are missing (they are required by the actual IF-MIB) - ifi_datalen is rather short (u_char) and restricts structure size to 256 bytes. So what I would like to do is: - add a version field at the beginning and a #define to help user programs in working with different versions of this structure - add a couple of dozens of bytes at the end to allow extending the structure without changing its size 2. struct ifmibdata - add a version field here too. 3. struct ifmib_iso_8802_3 - add a version field here too. - add dot3StatsSymbolErrors which are required by the current EtherLike-MIB. Unfortunately only 4 drivers actually implement the ethernet statistics :-( so far So, does this make any sense? harti