From owner-freebsd-arch@FreeBSD.ORG Mon Feb 16 00:14:27 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EFFAAD57 for ; Mon, 16 Feb 2015 00:14:27 +0000 (UTC) Received: from vps.server.com (serv1.makinvest1.com [46.249.46.19]) by mx1.freebsd.org (Postfix) with ESMTP id AE0E9B13 for ; Mon, 16 Feb 2015 00:14:27 +0000 (UTC) Received: from 188.36.251.72.tunnelservers.net (unknown [72.251.36.188]) by vps.server.com (Postfix) with ESMTPA id ECABB1D936AE for ; Mon, 16 Feb 2015 03:12:41 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=serv1.makinvest1.com; s=mail1; t=1424045563; bh=fp1OonPDNqwnA75loGfN7q+XEIadURL9NcADgHXcWlU=; h=From:Subject:To:Content-Type:MIME-Version:Reply-To:Date; b=al9pjKDyoFmy/sVG62Me4JImAjannQauLc3q21MzAY8dWvs9Ek/yWa7XW9He7I7CN 4f1NOGX2O6AwZturb9BHUsrs58X4ABi5qBoDnZPyrf66XaopkQfFKXYFjUR47dLEVN V8Ns8pyvvaxzumUhT4+YWC5hoYs+lweWjPIJ/BoM= From: "Mak Global Investment" Subject: Do you need a loan or investment? To: "freebsd-arch" MIME-Version: 1.0 Reply-To: "Mak Global Investment" Date: Sun, 15 Feb 2015 19:13:54 -0500 X-Greylist: Default is to whitelist mail, not delayed by milter-greylist-4.5.1 (vps.server.com [0.0.0.0]); Mon, 16 Feb 2015 03:12:43 +0300 (MSK) Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Feb 2015 00:14:28 -0000 =EF=BB=BFHello. =20 Do you need a loan or investment?. Apply here for your reliable loan= today at 3% interest rate with EXPO 2020 special offer, kindly contac= t us if you have a reliable and lucrative business that requires finan= cing. MAK Global Investment P.O. Box 471471 105B Salahuddin St Ras Al Khaimah United Arab Emirates Phone:+971(0)529273042 From owner-freebsd-arch@FreeBSD.ORG Tue Feb 17 02:10:15 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E2D41152; Tue, 17 Feb 2015 02:10:15 +0000 (UTC) Received: from mail.karels.net (mail.karels.net [63.231.190.5]) by mx1.freebsd.org (Postfix) with ESMTP id 637F63B7; Tue, 17 Feb 2015 02:10:15 +0000 (UTC) Received: from mail.karels.net (localhost [127.0.0.1]) by mail.karels.net (8.14.7/8.14.7) with ESMTP id t1H1ouxM020621; Mon, 16 Feb 2015 19:50:57 -0600 (CST) (envelope-from mike@karels.net) Message-Id: <201502170150.t1H1ouxM020621@mail.karels.net> To: "George Neville-Neil" From: Mike Karels Reply-to: mike@karels.net Subject: Re: Adding new media types to if_media.h In-reply-to: Your message of Mon, 09 Feb 2015 21:08:41 +0000. Date: Mon, 16 Feb 2015 19:50:56 -0600 Cc: "freebsd-net@freebsd.org" , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 02:10:16 -0000 On Feb 9, gnn wrote: > On 8 Feb 2015, at 22:41, Mike Karels wrote: > > Sorry to reply to a thread after such a long delay, but I think it is > > unresolved, and needs more discussion. I'd like to elaborate a bit on > > my goals and proposal. I believe Adrian has newer thoughts than have > > been > > circulated here as well. > > > > The last message(s) have gone to freebsd-arch and freebsd-net. If > > someone > > wants to pick one, we could consolidate, but this seems relevant to > > both. > > > > I'm going to top-post to try to summarize and extend the discussion, > > but the > > preceding emails follow for reference. > > > > To recap: the existing if_media interface is running out of steam, at > > least > > in that the "Media variant" field, with 5 bits, is going to be > > insufficient > > to express existing 40 Gb/s variants. The if_media media type is a > > 32-bit > > int with a bunch of sub-fields for type (e.g. Ethernet), > > subtype/variant > > (e.g. 10baseT, 10base5, 1000baseT, etc), flags, and some MII-related > > fields. > > > > I made a proposal to extend the interface in a small way, specifically > > to > > replace the "media word" with a 64-bit int that is mostly the same, > > but > > has a new, larger variant/subtype field. The main reason for this > > proposal > > is to maintain the driver KPI (glimpse showed me 240 inclusions of > > if_media.h > > in the kernel in 8.2). That interface includes an initialization > > using a > > scalar value of fields ORed with each other. It would also be easy to > > preserve a 32-bit user-level API/ABI that can express most of the > > current > > state, with a subtype/variant field value reserved for "other" (there > > is > > already one for "unknown", but that is not quite the same). fwiw, I > > found 45 references to this user-level API in our tree, including both > > base and "ports"-type software, which includes libpcap, snmpd, > > dhclient, > > quagga, xorp, atm, devd, and rtsold, which argues for a > > backward-compatible > > API/ABI as well as a more-complete current interface for ifconfig at > > least. > > > > More generally, I see two problems with the existing if_media > > interface: > > > > 1. It doesn't have enough bits for all the fields, in particular, > > variant/ > > subtype for Ethernet. That is the immediate issue. > > > > 2. The interface is not sufficiently generic; it was designed around > > Ethernet > > including MII, token ring, FDDI, and a few other interface types. > > Some of > > the fields like "instance" are primarily for MII as far as I know, and > > are > > basically unused. It is definitely not sufficient for 802.11, which > > has > > rolled its own interfaces. > > > > To solve the second problem, I think the right approach would be to > > reduce > > this interface to a truly generic one, such as media type (e.g. > > Ethernet), > > generic flags, and perhaps generic status. Then there should be a > > separate > > media-specific interface for each type, such as Ethernet and 802.11. > > To a > > small extent, we already have that. Solving the second, more general > > problem, > > requires a whole new driver KPI that will require surgery to every > > driver, > > which is not an exercise that I would consider. > > > > Using a separate int for each existing field, as proposed, would break > > the > > driver KPI, but would not really make the interface generic. Trying > > to > > make a single interface with the union of all network interface > > requirements > > seems like a bad idea to me (we failed last time; the "we" is BSDi, > > where > > I was the architect when this interface was first designed). (No, I > > didn't > > design this interface.) > > > > Solving the first problem only, I think it is preferable to preserve a > > compatible driver KPI, which means using a scalar value encoding what > > is > > necessary. Although that interface is rather Ethernet-centric, that > > is > > really what it is used for. > > > > An additional, selfish goal is to make it easy to back-port drivers > > using > > the new interface to older versions (which I am quite likely to do). > > Preserving the KPI and general user API will be highly useful there. > > I'd be likely to do a 11-style version of ifconfig personally, but it > > might not be difficult to do in a more general way. > > > > I am willing to do a prototype for -current for evaluation. > > > > Comments, alternatives, ? > I agree with your statements above and I'd like to see the prototype. Well, I developed the prototype as I had planned, using a 64-bit media word, and found that I got about 100 files in GENERIC that didn't compile; they attempted to store "media words" in an int. My kingdom for a typedef. That didn't meet my goal of KPI compatibility, so I went to Plan B. Plan B is to steal an unused bit (RFU) to indicate an "extended" media type. I then used the variant/subtype field to store the extended type. Effectively, the previously unused bit doubles the effective size of the subtype field. Given that the previous 5-bit field lasted us 18 years, I figured that doubling it would last a while. I also changed the SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended types are all mapped to IFM_OTHER (31) using the old interface, but are visible using the new one. With these changes, I modified one driver (vtnet) to use an extended type, and the rest of GENERIC is happy. The changes to ifconfig are also fairly small. The patch is appended, where email programs will screw it up, or at ftp://ftp.karels.net/outgoing/if_media.patch. The VFAST subtype is a throw-away for testing. This seems like a reasonably pragmatic change to support the new 40 Gb/s media types until someone wants to design an improved but non-backward- compatible interface. I think it meets the goal of suitability for back-porting; it could be MFCed. Mike Index: sys/net/if_media.h =================================================================== --- sys/net/if_media.h (revision 278804) +++ sys/net/if_media.h (working copy) @@ -120,15 +120,29 @@ * 5-7 Media type * 8-15 Type specific options * 16-18 Mode (for multi-mode devices) - * 19 RFU + * 19 "extended" bit for media variant * 20-27 Shared (global) options * 28-31 Instance */ /* + * As we have used all of the original values for the media variant (subtype) + * for Ethernet, extended subtypes have been added, marked with XSUBTYPE, + * which is effectively the "high bit" of the media variant (subtype) field. + * IFM_OTHER (the highest basic type) is reserved to indicate use of an + * extended type when using an old SIOCGIFMEDIA operation. This is true + * for all media types, not just Ethernet. + */ +#define XSUBTYPE 0x80000 /* extended variant high bit */ +#define _X(var) ((var) | XSUBTYPE) /* extended variant */ +#define IFM_OTHER 31 /* Other: some extended type */ +#define OMEDIA(var) (((var) & XSUBTYPE) ? IFM_OTHER : (var)) + +/* * Ethernet */ #define IFM_ETHER 0x00000020 +/* NB: 0,1,2 are auto, manual, none defined below */ #define IFM_10_T 3 /* 10BaseT - RJ45 */ #define IFM_10_2 4 /* 10Base2 - Thinnet */ #define IFM_10_5 5 /* 10Base5 - AUI */ @@ -156,11 +170,17 @@ #define IFM_40G_CR4 27 /* 40GBase-CR4 */ #define IFM_40G_SR4 28 /* 40GBase-SR4 */ #define IFM_40G_LR4 29 /* 40GBase-LR4 */ +#define IFM_AVAIL30 30 /* available */ +/* #define IFM_OTHER 31 Other: some extended type */ +/* note 31 is the max! */ + +/* Extended variants/subtypes */ +#define IFM_VFAST _X(0) /* test "V.fast" */ +/* note _X(31) is the max! */ /* * Please update ieee8023ad_lacp.c:lacp_compose_key() * after adding new Ethernet media types. */ -/* note 31 is the max! */ #define IFM_ETH_MASTER 0x00000100 /* master mode (1000baseT) */ #define IFM_ETH_RXPAUSE 0x00000200 /* receive PAUSE frames */ @@ -170,6 +190,7 @@ * Token ring */ #define IFM_TOKEN 0x00000040 +/* NB: 0,1,2 are auto, manual, none defined below */ #define IFM_TOK_STP4 3 /* Shielded twisted pair 4m - DB9 */ #define IFM_TOK_STP16 4 /* Shielded twisted pair 16m - DB9 */ #define IFM_TOK_UTP4 5 /* Unshielded twisted pair 4m - RJ45 */ @@ -187,6 +208,7 @@ * FDDI */ #define IFM_FDDI 0x00000060 +/* NB: 0,1,2 are auto, manual, none defined below */ #define IFM_FDDI_SMF 3 /* Single-mode fiber */ #define IFM_FDDI_MMF 4 /* Multi-mode fiber */ #define IFM_FDDI_UTP 5 /* CDDI / UTP */ @@ -220,6 +242,7 @@ #define IFM_IEEE80211_OFDM27 23 /* OFDM 27Mbps */ /* NB: not enough bits to express MCS fully */ #define IFM_IEEE80211_MCS 24 /* HT MCS rate */ +/* #define IFM_OTHER 31 Other: some extended type */ #define IFM_IEEE80211_ADHOC 0x00000100 /* Operate in Adhoc mode */ #define IFM_IEEE80211_HOSTAP 0x00000200 /* Operate in Host AP mode */ @@ -241,6 +264,7 @@ * ATM */ #define IFM_ATM 0x000000a0 +/* NB: 0,1,2 are auto, manual, none defined below */ #define IFM_ATM_UNKNOWN 3 #define IFM_ATM_UTP_25 4 #define IFM_ATM_TAXI_100 5 @@ -277,7 +301,7 @@ * Masks */ #define IFM_NMASK 0x000000e0 /* Network type */ -#define IFM_TMASK 0x0000001f /* Media sub-type */ +#define IFM_TMASK 0x0008001f /* Media sub-type */ #define IFM_IMASK 0xf0000000 /* Instance */ #define IFM_ISHIFT 28 /* Instance shift */ #define IFM_OMASK 0x0000ff00 /* Type specific options */ @@ -372,6 +396,7 @@ { IFM_40G_CR4, "40Gbase-CR4" }, \ { IFM_40G_SR4, "40Gbase-SR4" }, \ { IFM_40G_LR4, "40Gbase-LR4" }, \ + { IFM_VFAST, "V.fast" }, \ { 0, NULL }, \ } @@ -603,6 +628,7 @@ { IFM_AUTO, "autoselect" }, \ { IFM_MANUAL, "manual" }, \ { IFM_NONE, "none" }, \ + { IFM_OTHER, "other" }, \ { 0, NULL }, \ } @@ -673,6 +699,7 @@ { IFM_ETHER | IFM_40G_CR4, IF_Gbps(40ULL) }, \ { IFM_ETHER | IFM_40G_SR4, IF_Gbps(40ULL) }, \ { IFM_ETHER | IFM_40G_LR4, IF_Gbps(40ULL) }, \ + { IFM_ETHER | IFM_VFAST, IF_Gbps(40ULL) }, \ \ { IFM_TOKEN | IFM_TOK_STP4, IF_Mbps(4) }, \ { IFM_TOKEN | IFM_TOK_STP16, IF_Mbps(16) }, \ Index: sys/sys/sockio.h =================================================================== --- sys/sys/sockio.h (revision 278810) +++ sys/sys/sockio.h (working copy) @@ -128,5 +128,6 @@ #define SIOCGIFGROUP _IOWR('i', 136, struct ifgroupreq) /* get ifgroups */ #define SIOCDIFGROUP _IOW('i', 137, struct ifgroupreq) /* delete ifgroup */ #define SIOCGIFGMEMB _IOWR('i', 138, struct ifgroupreq) /* get members */ +#define SIOCGIFXMEDIA _IOWR('i', 139, struct ifmediareq) /* get net xmedia */ #endif /* !_SYS_SOCKIO_H_ */ Index: sys/net/if.c =================================================================== --- sys/net/if.c (revision 278749) +++ sys/net/if.c (working copy) @@ -2561,6 +2561,7 @@ case SIOCGIFPSRCADDR: case SIOCGIFPDSTADDR: case SIOCGIFMEDIA: + case SIOCGIFXMEDIA: case SIOCGIFGENERIC: if (ifp->if_ioctl == NULL) return (EOPNOTSUPP); Index: sys/net/if_media.c =================================================================== --- sys/net/if_media.c (revision 278804) +++ sys/net/if_media.c (working copy) @@ -67,7 +67,9 @@ static struct ifmedia_entry *ifmedia_match(struct ifmedia *ifm, int flags, int mask); +#define IFMEDIA_DEBUG #ifdef IFMEDIA_DEBUG +#include int ifmedia_debug = 0; SYSCTL_INT(_debug, OID_AUTO, ifmedia, CTLFLAG_RW, &ifmedia_debug, 0, "if_media debugging msgs"); @@ -271,6 +273,7 @@ * Get list of available media and current media on interface. */ case SIOCGIFMEDIA: + case SIOCGIFXMEDIA: { struct ifmedia_entry *ep; int *kptr, count; @@ -278,8 +281,13 @@ kptr = NULL; /* XXX gcc */ - ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? - ifm->ifm_cur->ifm_media : IFM_NONE; + if (cmd == SIOCGIFMEDIA) { + ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? + OMEDIA(ifm->ifm_cur->ifm_media) : IFM_NONE; + } else { + ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? + ifm->ifm_cur->ifm_media : IFM_NONE; + } ifmr->ifm_mask = ifm->ifm_mask; ifmr->ifm_status = 0; (*ifm->ifm_status)(ifp, ifmr); @@ -317,7 +325,10 @@ ep = LIST_FIRST(&ifm->ifm_list); for (; ep != NULL && count < ifmr->ifm_count; ep = LIST_NEXT(ep, ifm_list), count++) - kptr[count] = ep->ifm_media; + if (cmd == SIOCGIFMEDIA) + kptr[count] = OMEDIA(ep->ifm_media); + else + kptr[count] = ep->ifm_media; if (ep != NULL) error = E2BIG; /* oops! */ @@ -505,7 +516,7 @@ printf("\n"); return; } - printf(desc->ifmt_string); + printf("%s", desc->ifmt_string); /* Any mode. */ for (desc = ttos->modes; desc && desc->ifmt_string != NULL; desc++) Index: sys/dev/virtio/network/if_vtnet.c =================================================================== --- sys/dev/virtio/network/if_vtnet.c (revision 278749) +++ sys/dev/virtio/network/if_vtnet.c (working copy) @@ -938,6 +938,7 @@ ifmedia_init(&sc->vtnet_media, IFM_IMASK, vtnet_ifmedia_upd, vtnet_ifmedia_sts); ifmedia_add(&sc->vtnet_media, VTNET_MEDIATYPE, 0, NULL); + ifmedia_add(&sc->vtnet_media, IFM_ETHER | IFM_VFAST, 0, NULL); ifmedia_set(&sc->vtnet_media, VTNET_MEDIATYPE); /* Read (or generate) the MAC address for the adapter. */ @@ -1103,6 +1104,7 @@ case SIOCSIFMEDIA: case SIOCGIFMEDIA: + case SIOCGIFXMEDIA: error = ifmedia_ioctl(ifp, ifr, &sc->vtnet_media, cmd); break; Index: sbin/ifconfig/ifmedia.c =================================================================== --- sbin/ifconfig/ifmedia.c (revision 278749) +++ sbin/ifconfig/ifmedia.c (working copy) @@ -109,11 +109,17 @@ { struct ifmediareq ifmr; int *media_list, i; + int xmedia = 1; (void) memset(&ifmr, 0, sizeof(ifmr)); (void) strncpy(ifmr.ifm_name, name, sizeof(ifmr.ifm_name)); - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { + /* + * Check if interface supports extended media types. + */ + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) + xmedia = 0; + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { /* * Interface doesn't support SIOC{G,S}IFMEDIA. */ @@ -130,8 +136,13 @@ err(1, "malloc"); ifmr.ifm_ulist = media_list; - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) - err(1, "SIOCGIFMEDIA"); + if (xmedia) { + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) + err(1, "SIOCGIFXMEDIA"); + } else { + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) + err(1, "SIOCGIFMEDIA"); + } printf("\tmedia: "); print_media_word(ifmr.ifm_current, 1); @@ -194,6 +205,7 @@ { static struct ifmediareq *ifmr = NULL; int *mwords; + int xmedia = 1; if (ifmr == NULL) { ifmr = (struct ifmediareq *)malloc(sizeof(struct ifmediareq)); @@ -213,7 +225,10 @@ * the current media type and the top-level type. */ - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) { + xmedia = 0; + } + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { err(1, "SIOCGIFMEDIA"); } @@ -225,8 +240,13 @@ err(1, "malloc"); ifmr->ifm_ulist = mwords; - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) - err(1, "SIOCGIFMEDIA"); + if (xmedia) { + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) + err(1, "SIOCGIFXMEDIA"); + } else { + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) + err(1, "SIOCGIFMEDIA"); + } } return ifmr; From owner-freebsd-arch@FreeBSD.ORG Tue Feb 17 07:17:11 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E6135DB8; Tue, 17 Feb 2015 07:17:11 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id B75EF6C3; Tue, 17 Feb 2015 07:17:10 +0000 (UTC) Received: from julian-mbp3.pixel8networks.com (50-196-156-133-static.hfc.comcastbusiness.net [50.196.156.133]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t1H7H9K2003756 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Mon, 16 Feb 2015 23:17:09 -0800 (PST) (envelope-from julian@freebsd.org) Message-ID: <54E2EAEF.5050201@freebsd.org> Date: Mon, 16 Feb 2015 23:17:03 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: mike@karels.net, George Neville-Neil Subject: Re: Adding new media types to if_media.h References: <201502170150.t1H1ouxM020621@mail.karels.net> In-Reply-To: <201502170150.t1H1ouxM020621@mail.karels.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: "freebsd-net@freebsd.org" , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 07:17:12 -0000 On 2/16/15 5:50 PM, Mike Karels wrote: > Well, I developed the prototype as I had planned, using a 64-bit > media word, and found that I got about 100 files in GENERIC that > didn't compile; [...] > so I went to Plan B. Plan B is to steal an unused bit (RFU) to > indicate an "extended" media type. I then used the variant/subtype > field to store the extended type. [...] > I modified one driver (vtnet) to use an extended type, and the rest > of GENERIC is happy. The changes to ifconfig are also fairly small. > The patch is appended, where email programs will screw it up, or at > ftp://ftp.karels.net/outgoing/if_media.patch. The VFAST subtype is a > throw-away for testing. This seems like a reasonably pragmatic > change to support the new 40 Gb/s media types until someone wants to > design an improved but non-backward- compatible interface. I think > it meets the goal of suitability for back-porting; it could be > MFCed. Mike Index: sys/net/if_media.h I like it.. The patch seems appropriately manageable. From owner-freebsd-arch@FreeBSD.ORG Tue Feb 17 07:32:37 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2B31EF90; Tue, 17 Feb 2015 07:32:37 +0000 (UTC) Received: from mail-wg0-x230.google.com (mail-wg0-x230.google.com [IPv6:2a00:1450:400c:c00::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7E4C38A9; Tue, 17 Feb 2015 07:32:36 +0000 (UTC) Received: by mail-wg0-f48.google.com with SMTP id l18so30390752wgh.7; Mon, 16 Feb 2015 23:32:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=34WzF5RUBamMbdaJIWuyO+Wy32LXuXeKKCKXIrV9oAA=; b=cighN67Ic2UYhwiTfNnr2KE3sD2VqDoPe8GxZHF5K/oI3npFyKBwLVL2X7r35MFY+2 x6rk9JvZ0DIubrjjS01WBzW6BihzmJchXhfB3mnIyLjScKEtfaq7Wk9fLgZjZq3CSsos oPg0iIdTGbL2LOgzzhThtfonRUp7uz4Je0rs2fRNfZaiv/eDvluOd+nfdAgCPrd2W24L cn+WF+ggpQN0MzsDsRCDaelpaPNGoXlN+FmGTymBrn3TFbTHSSfjcEV6idsu+BG8CPNz q4Hug307dJJwje0bOXWHRsY+vkITuG5AbPGSnPhk4gEupg9+cnlwCCvhuy3RlFdyaWwg pxvg== MIME-Version: 1.0 X-Received: by 10.180.73.205 with SMTP id n13mr22583500wiv.64.1424158354832; Mon, 16 Feb 2015 23:32:34 -0800 (PST) Received: by 10.194.101.106 with HTTP; Mon, 16 Feb 2015 23:32:34 -0800 (PST) In-Reply-To: <201502170150.t1H1ouxM020621@mail.karels.net> References: <201502170150.t1H1ouxM020621@mail.karels.net> Date: Mon, 16 Feb 2015 23:32:34 -0800 Message-ID: Subject: Re: Adding new media types to if_media.h From: Jack Vogel To: Mike Karels Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: "freebsd-net@freebsd.org" , George Neville-Neil , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 07:32:37 -0000 Nice Mike, I like it also. Jack On Mon, Feb 16, 2015 at 5:50 PM, Mike Karels wrote: > On Feb 9, gnn wrote: > > > On 8 Feb 2015, at 22:41, Mike Karels wrote: > > > > Sorry to reply to a thread after such a long delay, but I think it is > > > unresolved, and needs more discussion. I'd like to elaborate a bit on > > > my goals and proposal. I believe Adrian has newer thoughts than have > > > been > > > circulated here as well. > > > > > > The last message(s) have gone to freebsd-arch and freebsd-net. If > > > someone > > > wants to pick one, we could consolidate, but this seems relevant to > > > both. > > > > > > I'm going to top-post to try to summarize and extend the discussion, > > > but the > > > preceding emails follow for reference. > > > > > > To recap: the existing if_media interface is running out of steam, at > > > least > > > in that the "Media variant" field, with 5 bits, is going to be > > > insufficient > > > to express existing 40 Gb/s variants. The if_media media type is a > > > 32-bit > > > int with a bunch of sub-fields for type (e.g. Ethernet), > > > subtype/variant > > > (e.g. 10baseT, 10base5, 1000baseT, etc), flags, and some MII-related > > > fields. > > > > > > I made a proposal to extend the interface in a small way, specifically > > > to > > > replace the "media word" with a 64-bit int that is mostly the same, > > > but > > > has a new, larger variant/subtype field. The main reason for this > > > proposal > > > is to maintain the driver KPI (glimpse showed me 240 inclusions of > > > if_media.h > > > in the kernel in 8.2). That interface includes an initialization > > > using a > > > scalar value of fields ORed with each other. It would also be easy to > > > preserve a 32-bit user-level API/ABI that can express most of the > > > current > > > state, with a subtype/variant field value reserved for "other" (there > > > is > > > already one for "unknown", but that is not quite the same). fwiw, I > > > found 45 references to this user-level API in our tree, including both > > > base and "ports"-type software, which includes libpcap, snmpd, > > > dhclient, > > > quagga, xorp, atm, devd, and rtsold, which argues for a > > > backward-compatible > > > API/ABI as well as a more-complete current interface for ifconfig at > > > least. > > > > > > More generally, I see two problems with the existing if_media > > > interface: > > > > > > 1. It doesn't have enough bits for all the fields, in particular, > > > variant/ > > > subtype for Ethernet. That is the immediate issue. > > > > > > 2. The interface is not sufficiently generic; it was designed around > > > Ethernet > > > including MII, token ring, FDDI, and a few other interface types. > > > Some of > > > the fields like "instance" are primarily for MII as far as I know, and > > > are > > > basically unused. It is definitely not sufficient for 802.11, which > > > has > > > rolled its own interfaces. > > > > > > To solve the second problem, I think the right approach would be to > > > reduce > > > this interface to a truly generic one, such as media type (e.g. > > > Ethernet), > > > generic flags, and perhaps generic status. Then there should be a > > > separate > > > media-specific interface for each type, such as Ethernet and 802.11. > > > To a > > > small extent, we already have that. Solving the second, more general > > > problem, > > > requires a whole new driver KPI that will require surgery to every > > > driver, > > > which is not an exercise that I would consider. > > > > > > Using a separate int for each existing field, as proposed, would break > > > the > > > driver KPI, but would not really make the interface generic. Trying > > > to > > > make a single interface with the union of all network interface > > > requirements > > > seems like a bad idea to me (we failed last time; the "we" is BSDi, > > > where > > > I was the architect when this interface was first designed). (No, I > > > didn't > > > design this interface.) > > > > > > Solving the first problem only, I think it is preferable to preserve a > > > compatible driver KPI, which means using a scalar value encoding what > > > is > > > necessary. Although that interface is rather Ethernet-centric, that > > > is > > > really what it is used for. > > > > > > An additional, selfish goal is to make it easy to back-port drivers > > > using > > > the new interface to older versions (which I am quite likely to do). > > > Preserving the KPI and general user API will be highly useful there. > > > I'd be likely to do a 11-style version of ifconfig personally, but it > > > might not be difficult to do in a more general way. > > > > > > I am willing to do a prototype for -current for evaluation. > > > > > > Comments, alternatives, ? > > > I agree with your statements above and I'd like to see the prototype. > > Well, I developed the prototype as I had planned, using a 64-bit media > word, and found that I got about 100 files in GENERIC that didn't compile; > they attempted to store "media words" in an int. My kingdom for a typedef. > That didn't meet my goal of KPI compatibility, so I went to Plan B. > > Plan B is to steal an unused bit (RFU) to indicate an "extended" media > type. I then used the variant/subtype field to store the extended type. > Effectively, the previously unused bit doubles the effective size of the > subtype field. Given that the previous 5-bit field lasted us 18 years, > I figured that doubling it would last a while. I also changed the > SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended > types are all mapped to IFM_OTHER (31) using the old interface, but > are visible using the new one. > > With these changes, I modified one driver (vtnet) to use an extended type, > and the rest of GENERIC is happy. The changes to ifconfig are also fairly > small. The patch is appended, where email programs will screw it up, > or at ftp://ftp.karels.net/outgoing/if_media.patch. > > The VFAST subtype is a throw-away for testing. > > This seems like a reasonably pragmatic change to support the new 40 Gb/s > media types until someone wants to design an improved but non-backward- > compatible interface. I think it meets the goal of suitability for > back-porting; it could be MFCed. > > Mike > > Index: sys/net/if_media.h > =================================================================== > --- sys/net/if_media.h (revision 278804) > +++ sys/net/if_media.h (working copy) > @@ -120,15 +120,29 @@ > * 5-7 Media type > * 8-15 Type specific options > * 16-18 Mode (for multi-mode devices) > - * 19 RFU > + * 19 "extended" bit for media variant > * 20-27 Shared (global) options > * 28-31 Instance > */ > > /* > + * As we have used all of the original values for the media variant > (subtype) > + * for Ethernet, extended subtypes have been added, marked with XSUBTYPE, > + * which is effectively the "high bit" of the media variant (subtype) > field. > + * IFM_OTHER (the highest basic type) is reserved to indicate use of an > + * extended type when using an old SIOCGIFMEDIA operation. This is true > + * for all media types, not just Ethernet. > + */ > +#define XSUBTYPE 0x80000 /* extended variant high > bit */ > +#define _X(var) ((var) | XSUBTYPE) /* extended > variant */ > +#define IFM_OTHER 31 /* Other: some > extended type */ > +#define OMEDIA(var) (((var) & XSUBTYPE) ? IFM_OTHER : (var)) > + > +/* > * Ethernet > */ > #define IFM_ETHER 0x00000020 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_10_T 3 /* 10BaseT - RJ45 */ > #define IFM_10_2 4 /* 10Base2 - Thinnet */ > #define IFM_10_5 5 /* 10Base5 - AUI */ > @@ -156,11 +170,17 @@ > #define IFM_40G_CR4 27 /* 40GBase-CR4 */ > #define IFM_40G_SR4 28 /* 40GBase-SR4 */ > #define IFM_40G_LR4 29 /* 40GBase-LR4 */ > +#define IFM_AVAIL30 30 /* available */ > +/* #define IFM_OTHER 31 Other: some extended type */ > +/* note 31 is the max! */ > + > +/* Extended variants/subtypes */ > +#define IFM_VFAST _X(0) /* test "V.fast" */ > +/* note _X(31) is the max! */ > /* > * Please update ieee8023ad_lacp.c:lacp_compose_key() > * after adding new Ethernet media types. > */ > -/* note 31 is the max! */ > > #define IFM_ETH_MASTER 0x00000100 /* master mode (1000baseT) > */ > #define IFM_ETH_RXPAUSE 0x00000200 /* receive PAUSE frames */ > @@ -170,6 +190,7 @@ > * Token ring > */ > #define IFM_TOKEN 0x00000040 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_TOK_STP4 3 /* Shielded twisted pair > 4m - DB9 */ > #define IFM_TOK_STP16 4 /* Shielded twisted pair > 16m - DB9 */ > #define IFM_TOK_UTP4 5 /* Unshielded twisted pair > 4m - RJ45 */ > @@ -187,6 +208,7 @@ > * FDDI > */ > #define IFM_FDDI 0x00000060 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_FDDI_SMF 3 /* Single-mode fiber */ > #define IFM_FDDI_MMF 4 /* Multi-mode fiber */ > #define IFM_FDDI_UTP 5 /* CDDI / UTP */ > @@ -220,6 +242,7 @@ > #define IFM_IEEE80211_OFDM27 23 /* OFDM 27Mbps */ > /* NB: not enough bits to express MCS fully */ > #define IFM_IEEE80211_MCS 24 /* HT MCS rate */ > +/* #define IFM_OTHER 31 Other: some extended type */ > > #define IFM_IEEE80211_ADHOC 0x00000100 /* Operate in > Adhoc mode */ > #define IFM_IEEE80211_HOSTAP 0x00000200 /* Operate in Host > AP mode */ > @@ -241,6 +264,7 @@ > * ATM > */ > #define IFM_ATM 0x000000a0 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_ATM_UNKNOWN 3 > #define IFM_ATM_UTP_25 4 > #define IFM_ATM_TAXI_100 5 > @@ -277,7 +301,7 @@ > * Masks > */ > #define IFM_NMASK 0x000000e0 /* Network type */ > -#define IFM_TMASK 0x0000001f /* Media sub-type */ > +#define IFM_TMASK 0x0008001f /* Media sub-type */ > #define IFM_IMASK 0xf0000000 /* Instance */ > #define IFM_ISHIFT 28 /* Instance shift */ > #define IFM_OMASK 0x0000ff00 /* Type specific options */ > @@ -372,6 +396,7 @@ > { IFM_40G_CR4, "40Gbase-CR4" }, \ > { IFM_40G_SR4, "40Gbase-SR4" }, \ > { IFM_40G_LR4, "40Gbase-LR4" }, \ > + { IFM_VFAST, "V.fast" }, \ > { 0, NULL }, \ > } > > @@ -603,6 +628,7 @@ > { IFM_AUTO, "autoselect" }, \ > { IFM_MANUAL, "manual" }, \ > { IFM_NONE, "none" }, \ > + { IFM_OTHER, "other" }, \ > { 0, NULL }, \ > } > > @@ -673,6 +699,7 @@ > { IFM_ETHER | IFM_40G_CR4, IF_Gbps(40ULL) }, \ > { IFM_ETHER | IFM_40G_SR4, IF_Gbps(40ULL) }, \ > { IFM_ETHER | IFM_40G_LR4, IF_Gbps(40ULL) }, \ > + { IFM_ETHER | IFM_VFAST, IF_Gbps(40ULL) }, \ > \ > { IFM_TOKEN | IFM_TOK_STP4, IF_Mbps(4) }, \ > { IFM_TOKEN | IFM_TOK_STP16, IF_Mbps(16) }, \ > Index: sys/sys/sockio.h > =================================================================== > --- sys/sys/sockio.h (revision 278810) > +++ sys/sys/sockio.h (working copy) > @@ -128,5 +128,6 @@ > #define SIOCGIFGROUP _IOWR('i', 136, struct ifgroupreq) /* get > ifgroups */ > #define SIOCDIFGROUP _IOW('i', 137, struct ifgroupreq) /* > delete ifgroup */ > #define SIOCGIFGMEMB _IOWR('i', 138, struct ifgroupreq) /* get > members */ > +#define SIOCGIFXMEDIA _IOWR('i', 139, struct ifmediareq) /* get > net xmedia */ > > #endif /* !_SYS_SOCKIO_H_ */ > Index: sys/net/if.c > =================================================================== > --- sys/net/if.c (revision 278749) > +++ sys/net/if.c (working copy) > @@ -2561,6 +2561,7 @@ > case SIOCGIFPSRCADDR: > case SIOCGIFPDSTADDR: > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > case SIOCGIFGENERIC: > if (ifp->if_ioctl == NULL) > return (EOPNOTSUPP); > Index: sys/net/if_media.c > =================================================================== > --- sys/net/if_media.c (revision 278804) > +++ sys/net/if_media.c (working copy) > @@ -67,7 +67,9 @@ > static struct ifmedia_entry *ifmedia_match(struct ifmedia *ifm, > int flags, int mask); > > +#define IFMEDIA_DEBUG > #ifdef IFMEDIA_DEBUG > +#include > int ifmedia_debug = 0; > SYSCTL_INT(_debug, OID_AUTO, ifmedia, CTLFLAG_RW, &ifmedia_debug, > 0, "if_media debugging msgs"); > @@ -271,6 +273,7 @@ > * Get list of available media and current media on interface. > */ > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > { > struct ifmedia_entry *ep; > int *kptr, count; > @@ -278,8 +281,13 @@ > > kptr = NULL; /* XXX gcc */ > > - ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? > - ifm->ifm_cur->ifm_media : IFM_NONE; > + if (cmd == SIOCGIFMEDIA) { > + ifmr->ifm_active = ifmr->ifm_current = > ifm->ifm_cur ? > + OMEDIA(ifm->ifm_cur->ifm_media) : IFM_NONE; > + } else { > + ifmr->ifm_active = ifmr->ifm_current = > ifm->ifm_cur ? > + ifm->ifm_cur->ifm_media : IFM_NONE; > + } > ifmr->ifm_mask = ifm->ifm_mask; > ifmr->ifm_status = 0; > (*ifm->ifm_status)(ifp, ifmr); > @@ -317,7 +325,10 @@ > ep = LIST_FIRST(&ifm->ifm_list); > for (; ep != NULL && count < ifmr->ifm_count; > ep = LIST_NEXT(ep, ifm_list), count++) > - kptr[count] = ep->ifm_media; > + if (cmd == SIOCGIFMEDIA) > + kptr[count] = > OMEDIA(ep->ifm_media); > + else > + kptr[count] = ep->ifm_media; > > if (ep != NULL) > error = E2BIG; /* oops! */ > @@ -505,7 +516,7 @@ > printf("\n"); > return; > } > - printf(desc->ifmt_string); > + printf("%s", desc->ifmt_string); > > /* Any mode. */ > for (desc = ttos->modes; desc && desc->ifmt_string != NULL; desc++) > > Index: sys/dev/virtio/network/if_vtnet.c > =================================================================== > --- sys/dev/virtio/network/if_vtnet.c (revision 278749) > +++ sys/dev/virtio/network/if_vtnet.c (working copy) > @@ -938,6 +938,7 @@ > ifmedia_init(&sc->vtnet_media, IFM_IMASK, vtnet_ifmedia_upd, > vtnet_ifmedia_sts); > ifmedia_add(&sc->vtnet_media, VTNET_MEDIATYPE, 0, NULL); > + ifmedia_add(&sc->vtnet_media, IFM_ETHER | IFM_VFAST, 0, NULL); > ifmedia_set(&sc->vtnet_media, VTNET_MEDIATYPE); > > /* Read (or generate) the MAC address for the adapter. */ > @@ -1103,6 +1104,7 @@ > > case SIOCSIFMEDIA: > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > error = ifmedia_ioctl(ifp, ifr, &sc->vtnet_media, cmd); > break; > Index: sbin/ifconfig/ifmedia.c > =================================================================== > --- sbin/ifconfig/ifmedia.c (revision 278749) > +++ sbin/ifconfig/ifmedia.c (working copy) > @@ -109,11 +109,17 @@ > { > struct ifmediareq ifmr; > int *media_list, i; > + int xmedia = 1; > > (void) memset(&ifmr, 0, sizeof(ifmr)); > (void) strncpy(ifmr.ifm_name, name, sizeof(ifmr.ifm_name)); > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > + /* > + * Check if interface supports extended media types. > + */ > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > + xmedia = 0; > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > /* > * Interface doesn't support SIOC{G,S}IFMEDIA. > */ > @@ -130,8 +136,13 @@ > err(1, "malloc"); > ifmr.ifm_ulist = media_list; > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > - err(1, "SIOCGIFMEDIA"); > + if (xmedia) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > + err(1, "SIOCGIFXMEDIA"); > + } else { > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > + err(1, "SIOCGIFMEDIA"); > + } > > printf("\tmedia: "); > print_media_word(ifmr.ifm_current, 1); > @@ -194,6 +205,7 @@ > { > static struct ifmediareq *ifmr = NULL; > int *mwords; > + int xmedia = 1; > > if (ifmr == NULL) { > ifmr = (struct ifmediareq *)malloc(sizeof(struct > ifmediareq)); > @@ -213,7 +225,10 @@ > * the current media type and the top-level type. > */ > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) { > + xmedia = 0; > + } > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < > 0) { > err(1, "SIOCGIFMEDIA"); > } > > @@ -225,8 +240,13 @@ > err(1, "malloc"); > > ifmr->ifm_ulist = mwords; > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > - err(1, "SIOCGIFMEDIA"); > + if (xmedia) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) > + err(1, "SIOCGIFXMEDIA"); > + } else { > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > + err(1, "SIOCGIFMEDIA"); > + } > } > > return ifmr; > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Tue Feb 17 17:17:16 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 82CCB37B; Tue, 17 Feb 2015 17:17:16 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5961422B; Tue, 17 Feb 2015 17:17:16 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 742CDB91F; Tue, 17 Feb 2015 12:17:15 -0500 (EST) From: John Baldwin To: freebsd-arch@freebsd.org, mike@karels.net Subject: Re: Adding new media types to if_media.h Date: Tue, 17 Feb 2015 10:44:21 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <201502170150.t1H1ouxM020621@mail.karels.net> In-Reply-To: <201502170150.t1H1ouxM020621@mail.karels.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201502171044.21319.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 17 Feb 2015 12:17:15 -0500 (EST) Cc: "freebsd-net@freebsd.org" , George Neville-Neil X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 17:17:16 -0000 On Monday, February 16, 2015 8:50:56 pm Mike Karels wrote: > On Feb 9, gnn wrote: > > > On 8 Feb 2015, at 22:41, Mike Karels wrote: > > > > Sorry to reply to a thread after such a long delay, but I think it is > > > unresolved, and needs more discussion. I'd like to elaborate a bit on > > > my goals and proposal. I believe Adrian has newer thoughts than have > > > been > > > circulated here as well. > > > > > > The last message(s) have gone to freebsd-arch and freebsd-net. If > > > someone > > > wants to pick one, we could consolidate, but this seems relevant to > > > both. > > > > > > I'm going to top-post to try to summarize and extend the discussion, > > > but the > > > preceding emails follow for reference. > > > > > > To recap: the existing if_media interface is running out of steam, at > > > least > > > in that the "Media variant" field, with 5 bits, is going to be > > > insufficient > > > to express existing 40 Gb/s variants. The if_media media type is a > > > 32-bit > > > int with a bunch of sub-fields for type (e.g. Ethernet), > > > subtype/variant > > > (e.g. 10baseT, 10base5, 1000baseT, etc), flags, and some MII-related > > > fields. > > > > > > I made a proposal to extend the interface in a small way, specifically > > > to > > > replace the "media word" with a 64-bit int that is mostly the same, > > > but > > > has a new, larger variant/subtype field. The main reason for this > > > proposal > > > is to maintain the driver KPI (glimpse showed me 240 inclusions of > > > if_media.h > > > in the kernel in 8.2). That interface includes an initialization > > > using a > > > scalar value of fields ORed with each other. It would also be easy to > > > preserve a 32-bit user-level API/ABI that can express most of the > > > current > > > state, with a subtype/variant field value reserved for "other" (there > > > is > > > already one for "unknown", but that is not quite the same). fwiw, I > > > found 45 references to this user-level API in our tree, including both > > > base and "ports"-type software, which includes libpcap, snmpd, > > > dhclient, > > > quagga, xorp, atm, devd, and rtsold, which argues for a > > > backward-compatible > > > API/ABI as well as a more-complete current interface for ifconfig at > > > least. > > > > > > More generally, I see two problems with the existing if_media > > > interface: > > > > > > 1. It doesn't have enough bits for all the fields, in particular, > > > variant/ > > > subtype for Ethernet. That is the immediate issue. > > > > > > 2. The interface is not sufficiently generic; it was designed around > > > Ethernet > > > including MII, token ring, FDDI, and a few other interface types. > > > Some of > > > the fields like "instance" are primarily for MII as far as I know, and > > > are > > > basically unused. It is definitely not sufficient for 802.11, which > > > has > > > rolled its own interfaces. > > > > > > To solve the second problem, I think the right approach would be to > > > reduce > > > this interface to a truly generic one, such as media type (e.g. > > > Ethernet), > > > generic flags, and perhaps generic status. Then there should be a > > > separate > > > media-specific interface for each type, such as Ethernet and 802.11. > > > To a > > > small extent, we already have that. Solving the second, more general > > > problem, > > > requires a whole new driver KPI that will require surgery to every > > > driver, > > > which is not an exercise that I would consider. > > > > > > Using a separate int for each existing field, as proposed, would break > > > the > > > driver KPI, but would not really make the interface generic. Trying > > > to > > > make a single interface with the union of all network interface > > > requirements > > > seems like a bad idea to me (we failed last time; the "we" is BSDi, > > > where > > > I was the architect when this interface was first designed). (No, I > > > didn't > > > design this interface.) > > > > > > Solving the first problem only, I think it is preferable to preserve a > > > compatible driver KPI, which means using a scalar value encoding what > > > is > > > necessary. Although that interface is rather Ethernet-centric, that > > > is > > > really what it is used for. > > > > > > An additional, selfish goal is to make it easy to back-port drivers > > > using > > > the new interface to older versions (which I am quite likely to do). > > > Preserving the KPI and general user API will be highly useful there. > > > I'd be likely to do a 11-style version of ifconfig personally, but it > > > might not be difficult to do in a more general way. > > > > > > I am willing to do a prototype for -current for evaluation. > > > > > > Comments, alternatives, ? > > > I agree with your statements above and I'd like to see the prototype. > > Well, I developed the prototype as I had planned, using a 64-bit media > word, and found that I got about 100 files in GENERIC that didn't compile; > they attempted to store "media words" in an int. My kingdom for a typedef. > That didn't meet my goal of KPI compatibility, so I went to Plan B. > > Plan B is to steal an unused bit (RFU) to indicate an "extended" media > type. I then used the variant/subtype field to store the extended type. > Effectively, the previously unused bit doubles the effective size of the > subtype field. Given that the previous 5-bit field lasted us 18 years, > I figured that doubling it would last a while. I also changed the > SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended > types are all mapped to IFM_OTHER (31) using the old interface, but > are visible using the new one. > > With these changes, I modified one driver (vtnet) to use an extended type, > and the rest of GENERIC is happy. The changes to ifconfig are also fairly > small. The patch is appended, where email programs will screw it up, > or at ftp://ftp.karels.net/outgoing/if_media.patch. > > The VFAST subtype is a throw-away for testing. > > This seems like a reasonably pragmatic change to support the new 40 Gb/s > media types until someone wants to design an improved but non-backward- > compatible interface. I think it meets the goal of suitability for > back-porting; it could be MFCed. Seems like a reasonable next step to me. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Tue Feb 17 17:26:34 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id BB156626; Tue, 17 Feb 2015 17:26:34 +0000 (UTC) Received: from mail-ie0-f181.google.com (mail-ie0-f181.google.com [209.85.223.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 87E05337; Tue, 17 Feb 2015 17:26:34 +0000 (UTC) Received: by iecar1 with SMTP id ar1so42527763iec.11; Tue, 17 Feb 2015 09:26:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=4Z0iZzhq7+jYIAMhFOUrSE6McX7/QGXYbQmkB9g1wJo=; b=HcCTU3hwb0tJ2MLmAqw5u/bFJWzOBDOZ9/jk/zWdmZhjmdwjQwr9tVL84znTNMvFiK wgn3p+I0q0ZgeN+fCckqkxkyrVQHm0GeSxkX9YrZnqTOQ7EStzqifyOa5xs4wBL+LhK3 xI86D688HKQinSodoQ5exVrS+eSrjRa2dBO9KTU+mtH4MQnnEPYtNoWwrb9gJ131TMk3 Ja7NkuQsY+5Irx5g0wXIw+Vvz4D4Hz+ywfvLr3j5YCMqOsq+eCKmNr9GBctGI3q8m9Y7 d+hQuFPf4NazGQESGuWEZ8nu8SkCYyhOH5KRnsc+C5bCLAdNdweH0HXAdFqJmBQats6F /GQA== MIME-Version: 1.0 X-Received: by 10.107.31.16 with SMTP id f16mr36729262iof.88.1424193993379; Tue, 17 Feb 2015 09:26:33 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.17.66 with HTTP; Tue, 17 Feb 2015 09:26:33 -0800 (PST) In-Reply-To: <201502170150.t1H1ouxM020621@mail.karels.net> References: <201502170150.t1H1ouxM020621@mail.karels.net> Date: Tue, 17 Feb 2015 09:26:33 -0800 X-Google-Sender-Auth: bHYrDhSSF6iSYhcho5QOaxKGc6Q Message-ID: Subject: Re: Adding new media types to if_media.h From: Adrian Chadd To: mike@karels.net Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-net@freebsd.org" , George Neville-Neil , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Feb 2015 17:26:34 -0000 Looks good to me. Thanks for doing this! -a On 16 February 2015 at 17:50, Mike Karels wrote: > On Feb 9, gnn wrote: > >> On 8 Feb 2015, at 22:41, Mike Karels wrote: > >> > Sorry to reply to a thread after such a long delay, but I think it is >> > unresolved, and needs more discussion. I'd like to elaborate a bit on >> > my goals and proposal. I believe Adrian has newer thoughts than have >> > been >> > circulated here as well. >> > >> > The last message(s) have gone to freebsd-arch and freebsd-net. If >> > someone >> > wants to pick one, we could consolidate, but this seems relevant to >> > both. >> > >> > I'm going to top-post to try to summarize and extend the discussion, >> > but the >> > preceding emails follow for reference. >> > >> > To recap: the existing if_media interface is running out of steam, at >> > least >> > in that the "Media variant" field, with 5 bits, is going to be >> > insufficient >> > to express existing 40 Gb/s variants. The if_media media type is a >> > 32-bit >> > int with a bunch of sub-fields for type (e.g. Ethernet), >> > subtype/variant >> > (e.g. 10baseT, 10base5, 1000baseT, etc), flags, and some MII-related >> > fields. >> > >> > I made a proposal to extend the interface in a small way, specifically >> > to >> > replace the "media word" with a 64-bit int that is mostly the same, >> > but >> > has a new, larger variant/subtype field. The main reason for this >> > proposal >> > is to maintain the driver KPI (glimpse showed me 240 inclusions of >> > if_media.h >> > in the kernel in 8.2). That interface includes an initialization >> > using a >> > scalar value of fields ORed with each other. It would also be easy to >> > preserve a 32-bit user-level API/ABI that can express most of the >> > current >> > state, with a subtype/variant field value reserved for "other" (there >> > is >> > already one for "unknown", but that is not quite the same). fwiw, I >> > found 45 references to this user-level API in our tree, including both >> > base and "ports"-type software, which includes libpcap, snmpd, >> > dhclient, >> > quagga, xorp, atm, devd, and rtsold, which argues for a >> > backward-compatible >> > API/ABI as well as a more-complete current interface for ifconfig at >> > least. >> > >> > More generally, I see two problems with the existing if_media >> > interface: >> > >> > 1. It doesn't have enough bits for all the fields, in particular, >> > variant/ >> > subtype for Ethernet. That is the immediate issue. >> > >> > 2. The interface is not sufficiently generic; it was designed around >> > Ethernet >> > including MII, token ring, FDDI, and a few other interface types. >> > Some of >> > the fields like "instance" are primarily for MII as far as I know, and >> > are >> > basically unused. It is definitely not sufficient for 802.11, which >> > has >> > rolled its own interfaces. >> > >> > To solve the second problem, I think the right approach would be to >> > reduce >> > this interface to a truly generic one, such as media type (e.g. >> > Ethernet), >> > generic flags, and perhaps generic status. Then there should be a >> > separate >> > media-specific interface for each type, such as Ethernet and 802.11. >> > To a >> > small extent, we already have that. Solving the second, more general >> > problem, >> > requires a whole new driver KPI that will require surgery to every >> > driver, >> > which is not an exercise that I would consider. >> > >> > Using a separate int for each existing field, as proposed, would break >> > the >> > driver KPI, but would not really make the interface generic. Trying >> > to >> > make a single interface with the union of all network interface >> > requirements >> > seems like a bad idea to me (we failed last time; the "we" is BSDi, >> > where >> > I was the architect when this interface was first designed). (No, I >> > didn't >> > design this interface.) >> > >> > Solving the first problem only, I think it is preferable to preserve a >> > compatible driver KPI, which means using a scalar value encoding what >> > is >> > necessary. Although that interface is rather Ethernet-centric, that >> > is >> > really what it is used for. >> > >> > An additional, selfish goal is to make it easy to back-port drivers >> > using >> > the new interface to older versions (which I am quite likely to do). >> > Preserving the KPI and general user API will be highly useful there. >> > I'd be likely to do a 11-style version of ifconfig personally, but it >> > might not be difficult to do in a more general way. >> > >> > I am willing to do a prototype for -current for evaluation. >> > >> > Comments, alternatives, ? > >> I agree with your statements above and I'd like to see the prototype. > > Well, I developed the prototype as I had planned, using a 64-bit media > word, and found that I got about 100 files in GENERIC that didn't compile; > they attempted to store "media words" in an int. My kingdom for a typedef. > That didn't meet my goal of KPI compatibility, so I went to Plan B. > > Plan B is to steal an unused bit (RFU) to indicate an "extended" media > type. I then used the variant/subtype field to store the extended type. > Effectively, the previously unused bit doubles the effective size of the > subtype field. Given that the previous 5-bit field lasted us 18 years, > I figured that doubling it would last a while. I also changed the > SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended > types are all mapped to IFM_OTHER (31) using the old interface, but > are visible using the new one. > > With these changes, I modified one driver (vtnet) to use an extended type, > and the rest of GENERIC is happy. The changes to ifconfig are also fairly > small. The patch is appended, where email programs will screw it up, > or at ftp://ftp.karels.net/outgoing/if_media.patch. > > The VFAST subtype is a throw-away for testing. > > This seems like a reasonably pragmatic change to support the new 40 Gb/s > media types until someone wants to design an improved but non-backward- > compatible interface. I think it meets the goal of suitability for > back-porting; it could be MFCed. > > Mike > > Index: sys/net/if_media.h > =================================================================== > --- sys/net/if_media.h (revision 278804) > +++ sys/net/if_media.h (working copy) > @@ -120,15 +120,29 @@ > * 5-7 Media type > * 8-15 Type specific options > * 16-18 Mode (for multi-mode devices) > - * 19 RFU > + * 19 "extended" bit for media variant > * 20-27 Shared (global) options > * 28-31 Instance > */ > > /* > + * As we have used all of the original values for the media variant (subtype) > + * for Ethernet, extended subtypes have been added, marked with XSUBTYPE, > + * which is effectively the "high bit" of the media variant (subtype) field. > + * IFM_OTHER (the highest basic type) is reserved to indicate use of an > + * extended type when using an old SIOCGIFMEDIA operation. This is true > + * for all media types, not just Ethernet. > + */ > +#define XSUBTYPE 0x80000 /* extended variant high bit */ > +#define _X(var) ((var) | XSUBTYPE) /* extended variant */ > +#define IFM_OTHER 31 /* Other: some extended type */ > +#define OMEDIA(var) (((var) & XSUBTYPE) ? IFM_OTHER : (var)) > + > +/* > * Ethernet > */ > #define IFM_ETHER 0x00000020 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_10_T 3 /* 10BaseT - RJ45 */ > #define IFM_10_2 4 /* 10Base2 - Thinnet */ > #define IFM_10_5 5 /* 10Base5 - AUI */ > @@ -156,11 +170,17 @@ > #define IFM_40G_CR4 27 /* 40GBase-CR4 */ > #define IFM_40G_SR4 28 /* 40GBase-SR4 */ > #define IFM_40G_LR4 29 /* 40GBase-LR4 */ > +#define IFM_AVAIL30 30 /* available */ > +/* #define IFM_OTHER 31 Other: some extended type */ > +/* note 31 is the max! */ > + > +/* Extended variants/subtypes */ > +#define IFM_VFAST _X(0) /* test "V.fast" */ > +/* note _X(31) is the max! */ > /* > * Please update ieee8023ad_lacp.c:lacp_compose_key() > * after adding new Ethernet media types. > */ > -/* note 31 is the max! */ > > #define IFM_ETH_MASTER 0x00000100 /* master mode (1000baseT) */ > #define IFM_ETH_RXPAUSE 0x00000200 /* receive PAUSE frames */ > @@ -170,6 +190,7 @@ > * Token ring > */ > #define IFM_TOKEN 0x00000040 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_TOK_STP4 3 /* Shielded twisted pair 4m - DB9 */ > #define IFM_TOK_STP16 4 /* Shielded twisted pair 16m - DB9 */ > #define IFM_TOK_UTP4 5 /* Unshielded twisted pair 4m - RJ45 */ > @@ -187,6 +208,7 @@ > * FDDI > */ > #define IFM_FDDI 0x00000060 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_FDDI_SMF 3 /* Single-mode fiber */ > #define IFM_FDDI_MMF 4 /* Multi-mode fiber */ > #define IFM_FDDI_UTP 5 /* CDDI / UTP */ > @@ -220,6 +242,7 @@ > #define IFM_IEEE80211_OFDM27 23 /* OFDM 27Mbps */ > /* NB: not enough bits to express MCS fully */ > #define IFM_IEEE80211_MCS 24 /* HT MCS rate */ > +/* #define IFM_OTHER 31 Other: some extended type */ > > #define IFM_IEEE80211_ADHOC 0x00000100 /* Operate in Adhoc mode */ > #define IFM_IEEE80211_HOSTAP 0x00000200 /* Operate in Host AP mode */ > @@ -241,6 +264,7 @@ > * ATM > */ > #define IFM_ATM 0x000000a0 > +/* NB: 0,1,2 are auto, manual, none defined below */ > #define IFM_ATM_UNKNOWN 3 > #define IFM_ATM_UTP_25 4 > #define IFM_ATM_TAXI_100 5 > @@ -277,7 +301,7 @@ > * Masks > */ > #define IFM_NMASK 0x000000e0 /* Network type */ > -#define IFM_TMASK 0x0000001f /* Media sub-type */ > +#define IFM_TMASK 0x0008001f /* Media sub-type */ > #define IFM_IMASK 0xf0000000 /* Instance */ > #define IFM_ISHIFT 28 /* Instance shift */ > #define IFM_OMASK 0x0000ff00 /* Type specific options */ > @@ -372,6 +396,7 @@ > { IFM_40G_CR4, "40Gbase-CR4" }, \ > { IFM_40G_SR4, "40Gbase-SR4" }, \ > { IFM_40G_LR4, "40Gbase-LR4" }, \ > + { IFM_VFAST, "V.fast" }, \ > { 0, NULL }, \ > } > > @@ -603,6 +628,7 @@ > { IFM_AUTO, "autoselect" }, \ > { IFM_MANUAL, "manual" }, \ > { IFM_NONE, "none" }, \ > + { IFM_OTHER, "other" }, \ > { 0, NULL }, \ > } > > @@ -673,6 +699,7 @@ > { IFM_ETHER | IFM_40G_CR4, IF_Gbps(40ULL) }, \ > { IFM_ETHER | IFM_40G_SR4, IF_Gbps(40ULL) }, \ > { IFM_ETHER | IFM_40G_LR4, IF_Gbps(40ULL) }, \ > + { IFM_ETHER | IFM_VFAST, IF_Gbps(40ULL) }, \ > \ > { IFM_TOKEN | IFM_TOK_STP4, IF_Mbps(4) }, \ > { IFM_TOKEN | IFM_TOK_STP16, IF_Mbps(16) }, \ > Index: sys/sys/sockio.h > =================================================================== > --- sys/sys/sockio.h (revision 278810) > +++ sys/sys/sockio.h (working copy) > @@ -128,5 +128,6 @@ > #define SIOCGIFGROUP _IOWR('i', 136, struct ifgroupreq) /* get ifgroups */ > #define SIOCDIFGROUP _IOW('i', 137, struct ifgroupreq) /* delete ifgroup */ > #define SIOCGIFGMEMB _IOWR('i', 138, struct ifgroupreq) /* get members */ > +#define SIOCGIFXMEDIA _IOWR('i', 139, struct ifmediareq) /* get net xmedia */ > > #endif /* !_SYS_SOCKIO_H_ */ > Index: sys/net/if.c > =================================================================== > --- sys/net/if.c (revision 278749) > +++ sys/net/if.c (working copy) > @@ -2561,6 +2561,7 @@ > case SIOCGIFPSRCADDR: > case SIOCGIFPDSTADDR: > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > case SIOCGIFGENERIC: > if (ifp->if_ioctl == NULL) > return (EOPNOTSUPP); > Index: sys/net/if_media.c > =================================================================== > --- sys/net/if_media.c (revision 278804) > +++ sys/net/if_media.c (working copy) > @@ -67,7 +67,9 @@ > static struct ifmedia_entry *ifmedia_match(struct ifmedia *ifm, > int flags, int mask); > > +#define IFMEDIA_DEBUG > #ifdef IFMEDIA_DEBUG > +#include > int ifmedia_debug = 0; > SYSCTL_INT(_debug, OID_AUTO, ifmedia, CTLFLAG_RW, &ifmedia_debug, > 0, "if_media debugging msgs"); > @@ -271,6 +273,7 @@ > * Get list of available media and current media on interface. > */ > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > { > struct ifmedia_entry *ep; > int *kptr, count; > @@ -278,8 +281,13 @@ > > kptr = NULL; /* XXX gcc */ > > - ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? > - ifm->ifm_cur->ifm_media : IFM_NONE; > + if (cmd == SIOCGIFMEDIA) { > + ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? > + OMEDIA(ifm->ifm_cur->ifm_media) : IFM_NONE; > + } else { > + ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? > + ifm->ifm_cur->ifm_media : IFM_NONE; > + } > ifmr->ifm_mask = ifm->ifm_mask; > ifmr->ifm_status = 0; > (*ifm->ifm_status)(ifp, ifmr); > @@ -317,7 +325,10 @@ > ep = LIST_FIRST(&ifm->ifm_list); > for (; ep != NULL && count < ifmr->ifm_count; > ep = LIST_NEXT(ep, ifm_list), count++) > - kptr[count] = ep->ifm_media; > + if (cmd == SIOCGIFMEDIA) > + kptr[count] = OMEDIA(ep->ifm_media); > + else > + kptr[count] = ep->ifm_media; > > if (ep != NULL) > error = E2BIG; /* oops! */ > @@ -505,7 +516,7 @@ > printf("\n"); > return; > } > - printf(desc->ifmt_string); > + printf("%s", desc->ifmt_string); > > /* Any mode. */ > for (desc = ttos->modes; desc && desc->ifmt_string != NULL; desc++) > > Index: sys/dev/virtio/network/if_vtnet.c > =================================================================== > --- sys/dev/virtio/network/if_vtnet.c (revision 278749) > +++ sys/dev/virtio/network/if_vtnet.c (working copy) > @@ -938,6 +938,7 @@ > ifmedia_init(&sc->vtnet_media, IFM_IMASK, vtnet_ifmedia_upd, > vtnet_ifmedia_sts); > ifmedia_add(&sc->vtnet_media, VTNET_MEDIATYPE, 0, NULL); > + ifmedia_add(&sc->vtnet_media, IFM_ETHER | IFM_VFAST, 0, NULL); > ifmedia_set(&sc->vtnet_media, VTNET_MEDIATYPE); > > /* Read (or generate) the MAC address for the adapter. */ > @@ -1103,6 +1104,7 @@ > > case SIOCSIFMEDIA: > case SIOCGIFMEDIA: > + case SIOCGIFXMEDIA: > error = ifmedia_ioctl(ifp, ifr, &sc->vtnet_media, cmd); > break; > Index: sbin/ifconfig/ifmedia.c > =================================================================== > --- sbin/ifconfig/ifmedia.c (revision 278749) > +++ sbin/ifconfig/ifmedia.c (working copy) > @@ -109,11 +109,17 @@ > { > struct ifmediareq ifmr; > int *media_list, i; > + int xmedia = 1; > > (void) memset(&ifmr, 0, sizeof(ifmr)); > (void) strncpy(ifmr.ifm_name, name, sizeof(ifmr.ifm_name)); > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > + /* > + * Check if interface supports extended media types. > + */ > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > + xmedia = 0; > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > /* > * Interface doesn't support SIOC{G,S}IFMEDIA. > */ > @@ -130,8 +136,13 @@ > err(1, "malloc"); > ifmr.ifm_ulist = media_list; > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > - err(1, "SIOCGIFMEDIA"); > + if (xmedia) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > + err(1, "SIOCGIFXMEDIA"); > + } else { > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > + err(1, "SIOCGIFMEDIA"); > + } > > printf("\tmedia: "); > print_media_word(ifmr.ifm_current, 1); > @@ -194,6 +205,7 @@ > { > static struct ifmediareq *ifmr = NULL; > int *mwords; > + int xmedia = 1; > > if (ifmr == NULL) { > ifmr = (struct ifmediareq *)malloc(sizeof(struct ifmediareq)); > @@ -213,7 +225,10 @@ > * the current media type and the top-level type. > */ > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) { > + xmedia = 0; > + } > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { > err(1, "SIOCGIFMEDIA"); > } > > @@ -225,8 +240,13 @@ > err(1, "malloc"); > > ifmr->ifm_ulist = mwords; > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > - err(1, "SIOCGIFMEDIA"); > + if (xmedia) { > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) > + err(1, "SIOCGIFXMEDIA"); > + } else { > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > + err(1, "SIOCGIFMEDIA"); > + } > } > > return ifmr; > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-arch@FreeBSD.ORG Wed Feb 18 21:29:07 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0A842135 for ; Wed, 18 Feb 2015 21:29:07 +0000 (UTC) Received: from mail53.atl11.rsgsv.net (mail53.atl11.rsgsv.net [205.201.133.53]) by mx1.freebsd.org (Postfix) with ESMTP id A167EDD0 for ; Wed, 18 Feb 2015 21:29:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; s=k1; d=mail53.atl11.rsgsv.net; h=Subject:From:Reply-To:To:Date:Message-ID:List-ID:List-Unsubscribe:Sender:Content-Type:MIME-Version; i=Nonna=3Dsalvagedrive.com@mail53.atl11.rsgsv.net; bh=MJk3E4gcu1RdqQ0XmM+f+yYowzI=; b=MEXgNpkJi20xqF8zQhr6qcXVAuz8mlOV8170YTwJLFnnxSfiN73W2KztKhMw77ynZF7QW0ijej7d VKtKrZJcBrYg/dFAxDEtvxZM98bNEiMPWQ8SP7xhcQpfPgzLCBv2LGkux/6yWg7ejA/TNPjXraxo kffa1YrVmSFZtLr7vyk= DomainKey-Signature: a=rsa-sha1; c=nofws; q=dns; s=k1; d=mail53.atl11.rsgsv.net; b=mfvU83lFI+EiE0LAJkV8clK4gdOr+TwnXI2xpcdxr0fbYYNdd9fcTZjo0ywHVmI4Y2aXq4efdEkm WVH8qPExvDQ87SpqEaF55glykukaw8671zDvDFzsfoDBSOaxSE2t5stnCJWTKPuL8dJrIHKajuUY 6dC6/z66POys8olGzh4=; Received: from (127.0.0.1) by mail53.atl11.rsgsv.net id hsk2241lgi0b for ; Wed, 18 Feb 2015 21:13:55 +0000 (envelope-from ) Subject: =?utf-8?Q?Clean=2C=20Salvage=20and=20Repossessed=20Vehicles?= From: =?utf-8?Q?SalvageDrive.com?= Reply-To: =?utf-8?Q?SalvageDrive.com?= To: =?utf-8?Q??= Date: Wed, 18 Feb 2015 21:13:55 +0000 Message-ID: <38668d8a390eaafe346aac8d5c538f13530.20150218211345@mail53.atl11.rsgsv.net> X-Mailer: MailChimp Mailer - **CID6d11553fffc538f13530** X-Campaign: mailchimp38668d8a390eaafe346aac8d5.6d11553fff X-campaignid: mailchimp38668d8a390eaafe346aac8d5.6d11553fff X-Report-Abuse: Please report abuse for this campaign here: http://www.mailchimp.com/abuse/abuse.phtml?u=38668d8a390eaafe346aac8d5&id=6d11553fff&e=c538f13530 X-MC-User: 38668d8a390eaafe346aac8d5 X-Feedback-ID: 30493583:30493583.951693:us8:mc X-Accounttype: pd Sender: "SalvageDrive.com" x-mcda: FALSE MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format="fixed" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Feb 2015 21:29:07 -0000 https://www.salvagedrive.com ** CARS CARS CARS ------------------------------------------------------------ Buy Clean & Salvage Title Cars from North American Auctions! NO DEALER LICENSE REQUIRED! https://www.salvagedrive.com BUY NOW (https://www.salvagedrive.com/Cars/Category?SalvageType=3DAutomobi= le&BuyNowOnly=3DTrue) https://www.salvagedrive.com/Cars/Category?SalvageType=3DAutomobile&Title= =3D0 ** ------------------------------------------------------------ Salvaged=2C Damaged and Repossessed Cars https://www.salvagedrive.com/Cars/Category?SalvageType=3DAutomobile&Title= =3D1 ** ------------------------------------------------------------ Clean Title Used Cars https://www.salvagedrive.com ** ------------------------------------------------------------ Salvage Motorcycles https://www.salvagedrive.com ** ------------------------------------------------------------ Salvage Boats =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ** SalvageDrive.com (https://www.salvagedrive.com/) 1-844-CARS-411 Toll Free 1-347-492-1727 Tel. Skype: salvagedrive ** info@SalvageDrive.com (mailto:info@SalvageDrive.com) Don't forget to add info@salvagedrive.com to your Address Book to keep it= from skipping your inbox or getting caught in spam filters. We want your experience with the Salvage Drive to be as smooth and reassur= ing as possible. Accordingly=2C we diligently safeguard your privacy. If y= ou wish to review our Privacy Policy at any time=2C please click on the li= nk below=2C or copy and paste it into your web browser's location window ** Salvage Drive Privacy Policy (http://sdimages.salvagedrive.com/salvaged= rive/PrivacyPolicy.pdf) __________________________________________________________________________= __________________________ You can choose to unsubscribe from our Email Newsletters service by replyi= ng to this email with the word "STOP" and we will remove you from any futu= re mailings. __________________________________________________________________________= __________________________ =C2=A9 2014 SalvageDrive=2C Inc. | All rights reserved Salvage Drive=2C Inc. | 217 Broadway | Suite 505 | New York | NY | 10007 This email was sent to arch@freebsd.org (mailto:arch@freebsd.org) why did I get this? (http://salvagedrive.us8.list-manage.com/about?u=3D386= 68d8a390eaafe346aac8d5&id=3Db173b821c7&e=3Dc538f13530&c=3D6d11553fff) un= subscribe from this list (http://salvagedrive.us8.list-manage.com/unsubscr= ibe?u=3D38668d8a390eaafe346aac8d5&id=3Db173b821c7&e=3Dc538f13530&c=3D6d11553= fff) update subscription preferences (http://salvagedrive.us8.list-man= age1.com/profile?u=3D38668d8a390eaafe346aac8d5&id=3Db173b821c7&e=3D= c538f13530) Salvage Drive=2C Inc. =C2=B7 217 Broadway =C2=B7 Suite 505 =C2=B7 New York= =2C NY 10007 =C2=B7 USA From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 04:10:15 2015 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 75FF24F2; Thu, 19 Feb 2015 04:10:15 +0000 (UTC) Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "gold.funkthat.com", Issuer "gold.funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 512A63B8; Thu, 19 Feb 2015 04:10:14 +0000 (UTC) Received: from gold.funkthat.com (localhost [127.0.0.1]) by gold.funkthat.com (8.14.5/8.14.5) with ESMTP id t1J4ACQI041145 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 18 Feb 2015 20:10:12 -0800 (PST) (envelope-from jmg@gold.funkthat.com) Received: (from jmg@localhost) by gold.funkthat.com (8.14.5/8.14.5/Submit) id t1J4ACaY041144; Wed, 18 Feb 2015 20:10:12 -0800 (PST) (envelope-from jmg) Date: Wed, 18 Feb 2015 20:10:12 -0800 From: John-Mark Gurney To: freebsd-arch@FreeBSD.org Subject: getting NUMA into the tree (userland most interesting for me) Message-ID: <20150219041012.GJ1953@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Operating-System: FreeBSD 9.1-PRERELEASE amd64 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (gold.funkthat.com [127.0.0.1]); Wed, 18 Feb 2015 20:10:12 -0800 (PST) Cc: alc@FreeBSD.org, kib@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 04:10:15 -0000 I would like to help drive getting NUMA into the tree. Specificly, getting userland allocations to be done from a specified domain. I've looked at the projects/numa tree, but it appears that not much was done to get userland mappings to be NUMA aware. How are we going to do this? Do people have code to do this? I've looked at how Linux does this, at least from a programming interface. They use mmap to create the mapping, and then use the call mbind to tell the kernel where to handle the allocations. Is this what people are thinking? I've checked the wiki status, and the userland section is quite empty. Thanks. -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not." From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 04:38:01 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 849DAD0E for ; Thu, 19 Feb 2015 04:38:01 +0000 (UTC) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 7211E899 for ; Thu, 19 Feb 2015 04:38:00 +0000 (UTC) Received: from Alfreds-MacBook-Pro.local (64-60-248-106.static-ip.telepacific.net [64.60.248.106]) by elvis.mu.org (Postfix) with ESMTPSA id 1C011341F8B2 for ; Wed, 18 Feb 2015 20:38:00 -0800 (PST) Message-ID: <54E568E3.3000905@freebsd.org> Date: Wed, 18 Feb 2015 20:38:59 -0800 From: Alfred Perlstein Organization: FreeBSD User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: freebsd-arch@freebsd.org Subject: Re: getting NUMA into the tree (userland most interesting for me) References: <20150219041012.GJ1953@funkthat.com> In-Reply-To: <20150219041012.GJ1953@funkthat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 04:38:01 -0000 On 2/18/15 8:10 PM, John-Mark Gurney wrote: > I would like to help drive getting NUMA into the tree. Specificly, > getting userland allocations to be done from a specified domain. > > I've looked at the projects/numa tree, but it appears that not much was > done to get userland mappings to be NUMA aware. > > How are we going to do this? Do people have code to do this? > > I've looked at how Linux does this, at least from a programming > interface. They use mmap to create the mapping, and then use the call > mbind to tell the kernel where to handle the allocations. Is this > what people are thinking? > > I've checked the wiki status, and the userland section is quite > empty. > > Thanks. > Going with Linux makes sense at a glance just because it means that the software can run on us without mods. -Alfred From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 13:16:07 2015 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 37419673; Thu, 19 Feb 2015 13:16:07 +0000 (UTC) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E2A7C305; Thu, 19 Feb 2015 13:16:06 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD)) (envelope-from ) id 1YOQYy-000CVv-93; Thu, 19 Feb 2015 15:50:40 +0300 Date: Thu, 19 Feb 2015 15:50:40 +0300 From: Slawa Olhovchenkov To: John-Mark Gurney Subject: Re: getting NUMA into the tree (userland most interesting for me) Message-ID: <20150219125040.GA46228@zxy.spb.ru> References: <20150219041012.GJ1953@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150219041012.GJ1953@funkthat.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: alc@FreeBSD.org, kib@FreeBSD.org, freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 13:16:07 -0000 On Wed, Feb 18, 2015 at 08:10:12PM -0800, John-Mark Gurney wrote: > I would like to help drive getting NUMA into the tree. Specificly, > getting userland allocations to be done from a specified domain. When I workimh with Tilera NUMA memory allocation/organization will be loss compared with unified. I think only very specific case will be win with NUMA memory allocation. From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 14:46:43 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AC98C11F for ; Thu, 19 Feb 2015 14:46:43 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 84F20E88 for ; Thu, 19 Feb 2015 14:46:43 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 07660B91F for ; Thu, 19 Feb 2015 09:46:42 -0500 (EST) From: John Baldwin To: arch@freebsd.org Subject: RFC: bus_get_cpus(9) Date: Thu, 19 Feb 2015 09:46:35 -0500 Message-ID: <1848011.eGOHhpCEMm@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Feb 2015 09:46:42 -0500 (EST) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 14:46:43 -0000 One of the next steps for NUMA device-awareness is a way to let drivers know which CPUs are ideal to use for interrupts (and in particular this is targeted at multiqueue NICs that want to create a TX/RX ring pair per CPU). However, for modern Intel systems at least, it is usually best to use CPUs from the physical processor package that contains the I/O hub that a device connects to (e.g. to allow DDIO to work). The PoC API I came up with is a new bus method called bus_get_cpus() that returns a requested cpuset for a given device. It accepts an enum for the second parameter that says the type of cpuset being requested. Currently two valus are supported: - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when NUMA is enabled) - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) For a NIC driver the expectation is that the driver will call 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the CPUs in 'set'. (In my current patchset I have updated igb(4) to use this approach.) For systems that do not support NUMA (or if it is not enabled in the kernel config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' driver. INTR_CPUS is also mapped to 'all_cpus' by default. The x86 interrupt code maintains its own set of interrupt CPUs which this patch now exposes via INTR_CPUS in the x86 nexus driver. The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the global INTR_CPUS set from the nexus driver with the per-domain set from _PXM to generate a local INTR_CPUS set for child devices. The current patch can be found here: https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus It includes a few other fixes besides the implementation of bus_get_cpu() (and some things have already been committed such as taskqueue_start_threads_cpuset() and CPU_COUNT()): - It fixes the x86 interrupt code to exclude modern SMT threads from the default interrupt set. (Previously only Pentium 4-era HTT threads were excluded.) - It has a sample conversion of igb(4) to this interface (albeit ugly using #if's). Longer term I think I would like to make the INTR_CPUS thing a bit more formal. In particular, Solaris allows you to alter the set of CPUs that handle interrupts via prctl (or a tool named something close to that). I think I would like to have a dedicated global cpuset for that (but not named "2", it would be a new WHICH level). That would allow userland to use cpuset to alter the set of CPUs that handle interrupts in case you wanted to use SMT for example. I think if we do this that all ithreads would have their cpusets hang off of this set instead of the root set (which would also remove some of the recent special case handling for ithreads I believe). The one uglier part about this is that we should probably then have a way to notify drivers that INTR_CPUS changed so that they could try to cope gracefully. I think that's a bit of a longer horizon thing, but for now I think bus_get_cpus() is a good next step. What do other folks think? (And yes, I know it needs a manpage before it goes in, but I'd rather get the API agreed on before polishing that.) -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 15:28:15 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 4E588C6A; Thu, 19 Feb 2015 15:28:15 +0000 (UTC) Received: from mail-ig0-x231.google.com (mail-ig0-x231.google.com [IPv6:2607:f8b0:4001:c05::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 15777352; Thu, 19 Feb 2015 15:28:15 +0000 (UTC) Received: by mail-ig0-f177.google.com with SMTP id z20so9979585igj.4; Thu, 19 Feb 2015 07:28:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=BpjX/RCOoIJIpHwEJnv/MH12xLiUb+/Y4HDnKymKpMU=; b=XkMJKRl17KqX7ViaYRStbdAcMWEch89MLB7Jf/BaCZ6UpSMNe8jiozqzWk57M59qEa uLYz2b6It1mUwWNmyiAWW9rjKu7D/L/U0R0bsGIKsIHgdYiVKlY5px83yllYewXaZ3dG Za+TfwWneHatGn1JvHzXeKgKTG4oex0coMzyEWO86sufZZ1CNOV6j48IUvgUzjCgN2bc Sq3tuOrv26rGfqq1plL1ZpDbuOuBtVp8cvEWcnmcYs3eaEZdMOEbsR1W1f0nCkZ66/js zfViAV7eJEwUIR2+mls3WUkj50GQxRbpGXZ/euxfl6qOU1OTph3DF7/s5cxQEdEyxk2E 4WxQ== MIME-Version: 1.0 X-Received: by 10.50.79.135 with SMTP id j7mr5312205igx.32.1424359694287; Thu, 19 Feb 2015 07:28:14 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.17.66 with HTTP; Thu, 19 Feb 2015 07:28:14 -0800 (PST) In-Reply-To: <1848011.eGOHhpCEMm@ralph.baldwin.cx> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 07:28:14 -0800 X-Google-Sender-Auth: 35H0WDcYEnyNeQCs6z89dzOhYd8 Message-ID: Subject: Re: RFC: bus_get_cpus(9) From: Adrian Chadd To: John Baldwin Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 15:28:15 -0000 Hi, On 19 February 2015 at 06:46, John Baldwin wrote: > One of the next steps for NUMA device-awareness is a way to let drivers know > which CPUs are ideal to use for interrupts (and in particular this is targeted > at multiqueue NICs that want to create a TX/RX ring pair per CPU). However, > for modern Intel systems at least, it is usually best to use CPUs from the > physical processor package that contains the I/O hub that a device connects to > (e.g. to allow DDIO to work). > > The PoC API I came up with is a new bus method called bus_get_cpus() that > returns a requested cpuset for a given device. It accepts an enum for the > second parameter that says the type of cpuset being requested. Currently two > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the > device when NUMA is enabled) > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > For a NIC driver the expectation is that the driver will call > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the CPUs in > 'set'. (In my current patchset I have updated igb(4) to use this approach.) > > For systems that do not support NUMA (or if it is not enabled in the kernel > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > The x86 interrupt code maintains its own set of interrupt CPUs which this > patch now exposes via INTR_CPUS in the x86 nexus driver. > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the global > INTR_CPUS set from the nexus driver with the per-domain set from _PXM to > generate a local INTR_CPUS set for child devices. > > The current patch can be found here: > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus > > It includes a few other fixes besides the implementation of bus_get_cpu() (and > some things have already been committed such as > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > default interrupt set. (Previously only Pentium 4-era HTT threads were > excluded.) > - It has a sample conversion of igb(4) to this interface (albeit ugly using > #if's). > > Longer term I think I would like to make the INTR_CPUS thing a bit more > formal. In particular, Solaris allows you to alter the set of CPUs that > handle interrupts via prctl (or a tool named something close to that). I > think I would like to have a dedicated global cpuset for that (but not named > "2", it would be a new WHICH level). That would allow userland to use cpuset > to alter the set of CPUs that handle interrupts in case you wanted to use SMT > for example. I think if we do this that all ithreads would have their cpusets > hang off of this set instead of the root set (which would also remove some of > the recent special case handling for ithreads I believe). The one uglier part > about this is that we should probably then have a way to notify drivers that > INTR_CPUS changed so that they could try to cope gracefully. I think that's a > bit of a longer horizon thing, but for now I think bus_get_cpus() is a good > next step. > > What do other folks think? (And yes, I know it needs a manpage before it goes > in, but I'd rather get the API agreed on before polishing that.) So I'd rather something slightly more descriptive about the iteration of cpu ids and cpusets for each queue in a driver. Eg, you're iterating over the interrupt set for a CPU, but it's still up to the driver to walk the cpuset array and figure out which queue goes to which CPU. For RSS I'm going to take this stuff and have the driver call into RSS to do this. It'll provide the deviceid, and it'll return: * how many queues? * for each queue id, what is the cpuset the interrupts/taskqueues should run on. It already does this, but it currently has no idea about the underlying device topology. (Later on for rebalancing we'll want to have those cpusets returned be some top level cpusets per RSS bucket that we can change and have everything cascade "right", but that's a later thing to worry about.) For the RSS and non-RSS case, I can see situations where the admin may wish to define a mapping for queues to cpusets that aren't necessarily 1:1 mapping like you've done. For example, saying "I want 16 queues, but I have four CPUs, here's how they're mapped." Right now drivers do round-robin, but it's hard coded. If we have an iteration API like what exists for RSS then we can hide the policy config in the bus code and have it check kenv/hints at boot time for what the config should be. So I'd like to have the API a little more higher level so we can do interesting things with it. -adrian From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 15:37:31 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8DF80E40; Thu, 19 Feb 2015 15:37:31 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6169A63E; Thu, 19 Feb 2015 15:37:31 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 0A6C8B915; Thu, 19 Feb 2015 10:37:30 -0500 (EST) From: John Baldwin To: Adrian Chadd Subject: Re: RFC: bus_get_cpus(9) Date: Thu, 19 Feb 2015 10:37:28 -0500 Message-ID: <6147240.5Rne9DUXyM@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Feb 2015 10:37:30 -0500 (EST) Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 15:37:31 -0000 On Thursday, February 19, 2015 07:28:14 AM Adrian Chadd wrote: > Hi, > > On 19 February 2015 at 06:46, John Baldwin wrote: > > One of the next steps for NUMA device-awareness is a way to let drivers > > know which CPUs are ideal to use for interrupts (and in particular this > > is targeted at multiqueue NICs that want to create a TX/RX ring pair per > > CPU). However, for modern Intel systems at least, it is usually best to > > use CPUs from the physical processor package that contains the I/O hub > > that a device connects to (e.g. to allow DDIO to work). > > > > The PoC API I came up with is a new bus method called bus_get_cpus() that > > returns a requested cpuset for a given device. It accepts an enum for the > > second parameter that says the type of cpuset being requested. Currently > > two> > > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to > > the> > > device when NUMA is enabled) > > > > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > > > For a NIC driver the expectation is that the driver will call > > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the > > CPUs in 'set'. (In my current patchset I have updated igb(4) to use this > > approach.) > > > > For systems that do not support NUMA (or if it is not enabled in the > > kernel > > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > > > The x86 interrupt code maintains its own set of interrupt CPUs which this > > patch now exposes via INTR_CPUS in the x86 nexus driver. > > > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the > > global INTR_CPUS set from the nexus driver with the per-domain set from > > _PXM to generate a local INTR_CPUS set for child devices. > > > > The current patch can be found here: > > > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpu > > s > > > > It includes a few other fixes besides the implementation of bus_get_cpu() > > (and some things have already been committed such as > > > > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > > > > default interrupt set. (Previously only Pentium 4-era HTT threads were > > excluded.) > > > > - It has a sample conversion of igb(4) to this interface (albeit ugly > > using > > > > #if's). > > > > Longer term I think I would like to make the INTR_CPUS thing a bit more > > formal. In particular, Solaris allows you to alter the set of CPUs that > > handle interrupts via prctl (or a tool named something close to that). I > > think I would like to have a dedicated global cpuset for that (but not > > named "2", it would be a new WHICH level). That would allow userland to > > use cpuset to alter the set of CPUs that handle interrupts in case you > > wanted to use SMT for example. I think if we do this that all ithreads > > would have their cpusets hang off of this set instead of the root set > > (which would also remove some of the recent special case handling for > > ithreads I believe). The one uglier part about this is that we should > > probably then have a way to notify drivers that INTR_CPUS changed so that > > they could try to cope gracefully. I think that's a bit of a longer > > horizon thing, but for now I think bus_get_cpus() is a good next step. > > > > What do other folks think? (And yes, I know it needs a manpage before it > > goes in, but I'd rather get the API agreed on before polishing that.) > > So I'd rather something slightly more descriptive about the iteration > of cpu ids and cpusets for each queue in a driver. > > Eg, you're iterating over the interrupt set for a CPU, but it's still > up to the driver to walk the cpuset array and figure out which queue > goes to which CPU. > > For RSS I'm going to take this stuff and have the driver call into RSS > to do this. It'll provide the deviceid, and it'll return: > > * how many queues? > * for each queue id, what is the cpuset the interrupts/taskqueues should run > on. > > It already does this, but it currently has no idea about the > underlying device topology. > > (Later on for rebalancing we'll want to have those cpusets returned be > some top level cpusets per RSS bucket that we can change and have > everything cascade "right", but that's a later thing to worry about.) > > For the RSS and non-RSS case, I can see situations where the admin may > wish to define a mapping for queues to cpusets that aren't necessarily > 1:1 mapping like you've done. For example, saying "I want 16 queues, > but I have four CPUs, here's how they're mapped." Right now drivers do > round-robin, but it's hard coded. If we have an iteration API like > what exists for RSS then we can hide the policy config in the bus code > and have it check kenv/hints at boot time for what the config should > be. > > So I'd like to have the API a little more higher level so we can do > interesting things with it. There's nothing preventing the RSS code from calling bus_get_cpus() internally to populate the info it returns in its APIs. That is, I imagine something like: #ifdef RSS queue_info = fetch_rss_info(dev); for (queue in queue_info) { create queue for CPU queue->cpu } #else /* Use bus_get_cpus directly and do 1:1 */ #endif That is, I think RSS should provide a layer on top of new-bus, not be a bus_foo API. At some point all drivers might only have the #ifdef RSS case and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS API is quite there yet. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 16:28:35 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9B96627F; Thu, 19 Feb 2015 16:28:35 +0000 (UTC) Received: from mail-ig0-x22a.google.com (mail-ig0-x22a.google.com [IPv6:2607:f8b0:4001:c05::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 60216C50; Thu, 19 Feb 2015 16:28:35 +0000 (UTC) Received: by mail-ig0-f170.google.com with SMTP id l13so1369820iga.1; Thu, 19 Feb 2015 08:28:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=+rCjaHYEXyfoAR+PXxeIYGb1r7P/1vGA7PfDkRjsOCI=; b=V5FLyD6hKn2BECpKNrjf/08HDgczpuEFoLKpjYHllRAyBHouYNe2L7tfq/v+L75eqT 62oFP5jJaXQB6rM5k8CfgSah3cknPliQxuTNVLqIuHwt6lA28HvBO0yxgTOgCuJMyARE l5wwmOtSTCFIdGppX9zjC/TsRIMBOy0+GV2HQSi4lCgpug0rs51WYE2CnXP+dbhp6LTt OaLE9/XbKHXU4IVM7hSorS8/CBRvXltabn2b4/dnrRhBY9mXIk6AQMvbm061gkWiKPiN CAG6+rqQyibhu/VKP5rLSUbpwZ/Nf+13cQAP9rLqg7r04ghKX+73HRdeEFDaCmGzXKxP sWeQ== MIME-Version: 1.0 X-Received: by 10.50.107.7 with SMTP id gy7mr8284025igb.49.1424363314662; Thu, 19 Feb 2015 08:28:34 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.17.66 with HTTP; Thu, 19 Feb 2015 08:28:34 -0800 (PST) In-Reply-To: <6147240.5Rne9DUXyM@ralph.baldwin.cx> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <6147240.5Rne9DUXyM@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 08:28:34 -0800 X-Google-Sender-Auth: uMlfsLBtbGcvrgmq0H1cqjX-qMI Message-ID: Subject: Re: RFC: bus_get_cpus(9) From: Adrian Chadd To: John Baldwin Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 16:28:35 -0000 On 19 February 2015 at 07:37, John Baldwin wrote: > There's nothing preventing the RSS code from calling bus_get_cpus() internally > to populate the info it returns in its APIs. > > That is, I imagine something like: > > #ifdef RSS > queue_info = fetch_rss_info(dev); > for (queue in queue_info) { > create queue for CPU queue->cpu > } > #else > /* Use bus_get_cpus directly and do 1:1 */ > #endif > > That is, I think RSS should provide a layer on top of new-bus, not be a > bus_foo API. At some point all drivers might only have the #ifdef RSS case > and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS > API is quite there yet. I wasn't suggesting we have RSS as a newbus method, just that drivers don't necessarily need to call the bus method and iterate themselves. I was suggesting that we do what i've done for rss, but as a generic "how should devices create queues and map them to cpusets / interrupt locality" and that calls the bus method(s) to discover topology and query local-interrupt and local-memory sets to do things appropriately. Then RSS is just a flavour of that API call - network drivers could either be RSS aware and call it to get the mapping, or call some higher level bus API call to do the "generic" hints or whatever based mapping. -adrian From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:03:35 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 6AFD2DA0; Thu, 19 Feb 2015 17:03:35 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 42450C5; Thu, 19 Feb 2015 17:03:35 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D13C7B91F; Thu, 19 Feb 2015 12:03:33 -0500 (EST) From: John Baldwin To: Adrian Chadd Subject: Re: RFC: bus_get_cpus(9) Date: Thu, 19 Feb 2015 12:03:01 -0500 Message-ID: <2650364.MV3AvSBuVe@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <6147240.5Rne9DUXyM@ralph.baldwin.cx> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Feb 2015 12:03:33 -0500 (EST) Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:03:35 -0000 On Thursday, February 19, 2015 08:28:34 AM Adrian Chadd wrote: > On 19 February 2015 at 07:37, John Baldwin wrote: > > There's nothing preventing the RSS code from calling bus_get_cpus() > > internally to populate the info it returns in its APIs. > > > > That is, I imagine something like: > > > > #ifdef RSS > > > > queue_info = fetch_rss_info(dev); > > for (queue in queue_info) { > > > > create queue for CPU queue->cpu > > > > } > > > > #else > > > > /* Use bus_get_cpus directly and do 1:1 */ > > > > #endif > > > > That is, I think RSS should provide a layer on top of new-bus, not be a > > bus_foo API. At some point all drivers might only have the #ifdef RSS > > case > > and not use bus_get_cpus() directly at all, but it doesn't seem like the > > RSS API is quite there yet. > > I wasn't suggesting we have RSS as a newbus method, just that drivers > don't necessarily need to call the bus method and iterate themselves. > > I was suggesting that we do what i've done for rss, but as a generic > "how should devices create queues and map them to cpusets / interrupt > locality" and that calls the bus method(s) to discover topology and > query local-interrupt and local-memory sets to do things > appropriately. > > Then RSS is just a flavour of that API call - network drivers could > either be RSS aware and call it to get the mapping, or call some > higher level bus API call to do the "generic" hints or whatever based > mapping. Can you provide a sample API (function prototype, etc.)? Aside from RSS (which will have its own API for other reasons), I don't know of another use case that is well understood enough to let us build an abstraction on yet (we all know the perils of abstracting from one use case), so I'm hesitant to go much further than "these are the best place to do interrupts". -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:16:06 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 96C9726D for ; Thu, 19 Feb 2015 17:16:06 +0000 (UTC) Received: from mail-pd0-f172.google.com (mail-pd0-f172.google.com [209.85.192.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 66BDA204 for ; Thu, 19 Feb 2015 17:16:06 +0000 (UTC) Received: by pdbnh10 with SMTP id nh10so1016962pdb.11 for ; Thu, 19 Feb 2015 09:16:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=OZEuXycPx4ol2P4iUK6ltTjEcMQOn/nYcntTCEmHXKQ=; b=KvyPU7EfdmZ70iYF2Ok9R2P98xmADR9l7FwM6O6Rxyc5T62zDDIQmkzFyHwtPjebB1 IXP2BuPCbJoJWFSdo9rCoRcXAG4Tl4gJbx3kiJSIQbOdQlgFO2f2aPE1uaVvGw7CAkXp uQT8zF2hlp5vC+tPw2LDB57LvAdGZfKn4rGgJ+gOfQXzMVEJo89jfwlaohl6uAYOOdUA XYrTVnXvnmy0nSBGmf6HpvnX1wjtbGrB3oU18obiqJPgGhO+rGMmNgTU/N6H5HeNAyeb yv+FgUvX6JIhqyZUlpUAKBUPL4y2mUrtYvu7taPP9Rs+2u2s9tkE3mL6HfwTGW4LQUJC nUkw== X-Gm-Message-State: ALoCoQkhFpXCSCjqwJol55i+9K5etQnO+oSbB9sOte856F+binIzcVZr/cLf6bm6DV4QdkaotrRT X-Received: by 10.68.68.235 with SMTP id z11mr9370110pbt.77.1424366159772; Thu, 19 Feb 2015 09:15:59 -0800 (PST) Received: from lgmac-jlieser.corp.netflix.com ([69.53.236.236]) by mx.google.com with ESMTPSA id sq6sm24269562pbc.40.2015.02.19.09.15.58 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 19 Feb 2015 09:15:58 -0800 (PST) Sender: Warner Losh Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: RFC: bus_get_cpus(9) From: Warner Losh In-Reply-To: <2650364.MV3AvSBuVe@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 10:15:56 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <6147240.5Rne9DUXyM@ralph.baldwin.cx> <2650364.MV3AvSBuVe@ralph.baldwin.cx> To: John Baldwin X-Mailer: Apple Mail (2.2070.6) Cc: "freebsd-arch@freebsd.org" , Adrian Chadd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:16:06 -0000 > On Feb 19, 2015, at 10:03 AM, John Baldwin wrote: >=20 > On Thursday, February 19, 2015 08:28:34 AM Adrian Chadd wrote: >> On 19 February 2015 at 07:37, John Baldwin wrote: >>> There's nothing preventing the RSS code from calling bus_get_cpus() >>> internally to populate the info it returns in its APIs. >>>=20 >>> That is, I imagine something like: >>>=20 >>> #ifdef RSS >>>=20 >>> queue_info =3D fetch_rss_info(dev); >>> for (queue in queue_info) { >>>=20 >>> create queue for CPU queue->cpu >>>=20 >>> } >>>=20 >>> #else >>>=20 >>> /* Use bus_get_cpus directly and do 1:1 */ >>>=20 >>> #endif >>>=20 >>> That is, I think RSS should provide a layer on top of new-bus, not = be a >>> bus_foo API. At some point all drivers might only have the #ifdef = RSS >>> case >>> and not use bus_get_cpus() directly at all, but it doesn't seem like = the >>> RSS API is quite there yet. >>=20 >> I wasn't suggesting we have RSS as a newbus method, just that drivers >> don't necessarily need to call the bus method and iterate themselves. >>=20 >> I was suggesting that we do what i've done for rss, but as a generic >> "how should devices create queues and map them to cpusets / interrupt >> locality" and that calls the bus method(s) to discover topology and >> query local-interrupt and local-memory sets to do things >> appropriately. >>=20 >> Then RSS is just a flavour of that API call - network drivers could >> either be RSS aware and call it to get the mapping, or call some >> higher level bus API call to do the "generic" hints or whatever based >> mapping. >=20 > Can you provide a sample API (function prototype, etc.)? Aside from = RSS=20 > (which will have its own API for other reasons), I don't know of = another use=20 > case that is well understood enough to let us build an abstraction on = yet (we=20 > all know the perils of abstracting from one use case), so I'm hesitant = to go=20 > much further than "these are the best place to do interrupts=E2=80=9D. Newer LSI cards could benefit from that, but the rest of the storage = stack may need tweaks to allow for true multi-queue implementations. = Interrupts would be part of that. Warner= From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:16:34 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EEFD930E; Thu, 19 Feb 2015 17:16:34 +0000 (UTC) Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com [IPv6:2607:f8b0:4001:c05::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B440820A; Thu, 19 Feb 2015 17:16:34 +0000 (UTC) Received: by mail-ig0-f179.google.com with SMTP id l13so10921470iga.0; Thu, 19 Feb 2015 09:16:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=hHZMajTPvGk9Gj143VOQZzbnrwx5zxF1hIKk/d2/RYA=; b=cE3jw1fGxbouDjDBjQ3JvQcJ5Cp3MW1NsZwNeYsUJ9YWCmF0EkNStzhMHdqP4YtC+P V0OpTf3WcXOTHr/SHmEJYm1Eg7T9eSf7mMtgIKKL/vM0iBHqm6m+SxOpQMyh6O8yvk64 EWDF0RQmBaeeggPH17wcBEn3vsGiMyLCy0yAZJ/wnj1L96cQp5kNWR9i9lWnTDKfrXzl wwJSFFSC8rtEMjHlaoEFy2IwHInceHTbYfwuvNLGoISVGN1qSkn9Xo/Ano4k8TuN+7Bc X4LvQ4CtR4iOi6q96yrDffirEuCVxZsPcr5wVpCG3HKFqY5XK1DWUH25fZDd9EjIvr4s W5hQ== MIME-Version: 1.0 X-Received: by 10.42.201.78 with SMTP id ez14mr7305813icb.22.1424366194110; Thu, 19 Feb 2015 09:16:34 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.17.66 with HTTP; Thu, 19 Feb 2015 09:16:34 -0800 (PST) In-Reply-To: <2650364.MV3AvSBuVe@ralph.baldwin.cx> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <6147240.5Rne9DUXyM@ralph.baldwin.cx> <2650364.MV3AvSBuVe@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 09:16:34 -0800 X-Google-Sender-Auth: vEqe6CcQID2bzF8pBg7bzyJu19I Message-ID: Subject: Re: RFC: bus_get_cpus(9) From: Adrian Chadd To: John Baldwin Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:16:35 -0000 [snip] We talked on IRC about this. The TL;DR is that once more of the groundwork changes are in place we can look at the higher level stuff I'm looking to do. -adrian From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:49:38 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EEEC5DE; Thu, 19 Feb 2015 17:49:37 +0000 (UTC) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A553282D; Thu, 19 Feb 2015 17:49:37 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD)) (envelope-from ) id 1YOVEE-000JIx-N9; Thu, 19 Feb 2015 20:49:34 +0300 Date: Thu, 19 Feb 2015 20:49:34 +0300 From: Slawa Olhovchenkov To: John Baldwin Subject: Re: RFC: bus_get_cpus(9) Message-ID: <20150219174934.GB46228@zxy.spb.ru> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1848011.eGOHhpCEMm@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:49:38 -0000 On Thu, Feb 19, 2015 at 09:46:35AM -0500, John Baldwin wrote: > One of the next steps for NUMA device-awareness is a way to let drivers know > which CPUs are ideal to use for interrupts (and in particular this is targeted > at multiqueue NICs that want to create a TX/RX ring pair per CPU). However, > for modern Intel systems at least, it is usually best to use CPUs from the > physical processor package that contains the I/O hub that a device connects to > (e.g. to allow DDIO to work). > > The PoC API I came up with is a new bus method called bus_get_cpus() that > returns a requested cpuset for a given device. It accepts an enum for the > second parameter that says the type of cpuset being requested. Currently two > valus are supported: > > - LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the > device when NUMA is enabled) > - INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core) > > For a NIC driver the expectation is that the driver will call > 'bus_get_cpus(dev, INTR_CPUS, &set)' and create queues for each of the CPUs in > 'set'. (In my current patchset I have updated igb(4) to use this approach.) > > For systems that do not support NUMA (or if it is not enabled in the kernel > config), LOCAL_CPUS is mapped to 'all_cpus' by default in the 'root_bus' > driver. INTR_CPUS is also mapped to 'all_cpus' by default. > > The x86 interrupt code maintains its own set of interrupt CPUs which this > patch now exposes via INTR_CPUS in the x86 nexus driver. > > The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable > LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the global > INTR_CPUS set from the nexus driver with the per-domain set from _PXM to > generate a local INTR_CPUS set for child devices. > > The current patch can be found here: > > https://github.com/bsdjhb/freebsd/compare/bsdjhb:master...numa_bus_get_cpus > > It includes a few other fixes besides the implementation of bus_get_cpu() (and > some things have already been committed such as > taskqueue_start_threads_cpuset() and CPU_COUNT()): > > - It fixes the x86 interrupt code to exclude modern SMT threads from the > default interrupt set. (Previously only Pentium 4-era HTT threads were > excluded.) > - It has a sample conversion of igb(4) to this interface (albeit ugly using > #if's). > > Longer term I think I would like to make the INTR_CPUS thing a bit more > formal. In particular, Solaris allows you to alter the set of CPUs that > handle interrupts via prctl (or a tool named something close to that). I > think I would like to have a dedicated global cpuset for that (but not named > "2", it would be a new WHICH level). That would allow userland to use cpuset > to alter the set of CPUs that handle interrupts in case you wanted to use SMT > for example. I think if we do this that all ithreads would have their cpusets > hang off of this set instead of the root set (which would also remove some of > the recent special case handling for ithreads I believe). The one uglier part > about this is that we should probably then have a way to notify drivers that > INTR_CPUS changed so that they could try to cope gracefully. I think that's a > bit of a longer horizon thing, but for now I think bus_get_cpus() is a good > next step. > > What do other folks think? (And yes, I know it needs a manpage before it goes > in, but I'd rather get the API agreed on before polishing that.) I am already use this way by manual using cpuset. For some setups need dedicate one cpu set for interrupt handling and other cpsu set for some application. Because application may be not allow modification we need cpuset aware arithmetic, i.e. utility that may answer like 'cpu set not used by interrupt handlers device ix0 and ix1' From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 17:58:05 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5666E4D6; Thu, 19 Feb 2015 17:58:05 +0000 (UTC) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0BEFF960; Thu, 19 Feb 2015 17:58:05 +0000 (UTC) Received: from slw by zxy.spb.ru with local (Exim 4.84 (FreeBSD)) (envelope-from ) id 1YOVMQ-000Jfd-Gl; Thu, 19 Feb 2015 20:58:02 +0300 Date: Thu, 19 Feb 2015 20:58:02 +0300 From: Slawa Olhovchenkov To: John Baldwin Subject: Re: RFC: bus_get_cpus(9) Message-ID: <20150219175802.GC46228@zxy.spb.ru> References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <6147240.5Rne9DUXyM@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6147240.5Rne9DUXyM@ralph.baldwin.cx> User-Agent: Mutt/1.5.23 (2014-03-12) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false Cc: "freebsd-arch@freebsd.org" , Adrian Chadd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 17:58:05 -0000 On Thu, Feb 19, 2015 at 10:37:28AM -0500, John Baldwin wrote: > There's nothing preventing the RSS code from calling bus_get_cpus() internally > to populate the info it returns in its APIs. > > That is, I imagine something like: > > #ifdef RSS > queue_info = fetch_rss_info(dev); > for (queue in queue_info) { > create queue for CPU queue->cpu > } > #else > /* Use bus_get_cpus directly and do 1:1 */ > #endif > > That is, I think RSS should provide a layer on top of new-bus, not be a > bus_foo API. At some point all drivers might only have the #ifdef RSS case > and not use bus_get_cpus() directly at all, but it doesn't seem like the RSS > API is quite there yet. I don't play with RSS (and RSS descrption wery complexity for me, besides I think RSS API may be very simple (for listen socket case) -- just inform select/kevent/poll only pined to cpu handled interrupt), but for RSS may be need use all cores -- and NUMA near and NUMA far, for RSS-less case for interrupt best use only NUME near cores, leave NUMA far cores for application (this separation in my case give aprox. 100% performance rise). From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 20:22:13 2015 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8504B540; Thu, 19 Feb 2015 20:22:13 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5CB45D08; Thu, 19 Feb 2015 20:22:13 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id D05B1B94E; Thu, 19 Feb 2015 15:22:11 -0500 (EST) From: John Baldwin To: Warner Losh Subject: Re: RFC: bus_get_cpus(9) Date: Thu, 19 Feb 2015 12:33:38 -0500 Message-ID: <6632720.8QN4idWR9d@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <1848011.eGOHhpCEMm@ralph.baldwin.cx> <2650364.MV3AvSBuVe@ralph.baldwin.cx> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Feb 2015 15:22:11 -0500 (EST) Cc: "freebsd-arch@freebsd.org" , Adrian Chadd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 20:22:13 -0000 On Thursday, February 19, 2015 10:15:56 AM Warner Losh wrote: > > On Feb 19, 2015, at 10:03 AM, John Baldwin wrote:= > >=20 > > On Thursday, February 19, 2015 08:28:34 AM Adrian Chadd wrote: > >> On 19 February 2015 at 07:37, John Baldwin wrote= : > >>> There's nothing preventing the RSS code from calling bus_get_cpus= () > >>> internally to populate the info it returns in its APIs. > >>>=20 > >>> That is, I imagine something like: > >>>=20 > >>> #ifdef RSS > >>>=20 > >>> queue_info =3D fetch_rss_info(dev); > >>> for (queue in queue_info) { > >>> =20 > >>> create queue for CPU queue->cpu > >>> =20 > >>> } > >>>=20 > >>> #else > >>>=20 > >>> /* Use bus_get_cpus directly and do 1:1 */ > >>>=20 > >>> #endif > >>>=20 > >>> That is, I think RSS should provide a layer on top of new-bus, no= t be a > >>> bus_foo API. At some point all drivers might only have the #ifde= f RSS > >>> case > >>> and not use bus_get_cpus() directly at all, but it doesn't seem l= ike the > >>> RSS API is quite there yet. > >>=20 > >> I wasn't suggesting we have RSS as a newbus method, just that driv= ers > >> don't necessarily need to call the bus method and iterate themselv= es. > >>=20 > >> I was suggesting that we do what i've done for rss, but as a gener= ic > >> "how should devices create queues and map them to cpusets / interr= upt > >> locality" and that calls the bus method(s) to discover topology an= d > >> query local-interrupt and local-memory sets to do things > >> appropriately. > >>=20 > >> Then RSS is just a flavour of that API call - network drivers coul= d > >> either be RSS aware and call it to get the mapping, or call some > >> higher level bus API call to do the "generic" hints or whatever ba= sed > >> mapping. > >=20 > > Can you provide a sample API (function prototype, etc.)? Aside fro= m RSS > > (which will have its own API for other reasons), I don't know of an= other > > use case that is well understood enough to let us build an abstract= ion on > > yet (we all know the perils of abstracting from one use case), so I= 'm > > hesitant to go much further than "these are the best place to do > > interrupts=E2=80=9D. >=20 > Newer LSI cards could benefit from that, but the rest of the storage = stack > may need tweaks to allow for true multi-queue implementations. Interr= upts > would be part of that. Right, storage in particular I think we don't really know what model we= want=20 yet, so aren't really ready for an API that tries to be abstract across= both=20 network and storage. --=20 John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 21:32:14 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id C6ED7606; Thu, 19 Feb 2015 21:32:14 +0000 (UTC) Received: from mail-yk0-x22c.google.com (mail-yk0-x22c.google.com [IPv6:2607:f8b0:4002:c07::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 812E3764; Thu, 19 Feb 2015 21:32:14 +0000 (UTC) Received: by mail-yk0-f172.google.com with SMTP id 131so5312785ykp.3; Thu, 19 Feb 2015 13:32:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=By7xfaGP80656gfzFGb2PQpVRAEyVo6h4NQb23wGnfc=; b=GWQwp5Ole68sosGN2maeBWdHLuPFYIdDRJVwgdLCjRV+I5FriOQ086fNSpkrRaoCrD sr7lNH8MLG+pZNcBUHlnkILKjlIDrdshWDYububkaCzGRam42P29LnkdSCrj8JqdbICq lHDL5Wv6/NGjKPZTdHJpGOty/m3sAWF/H4pe6s9DzXiLa4g/QRlzcGeQNiczFSoZBBwq ySgW9P5YPYtwvwVGD4jEUBPSRgJb+0cMln6wkQko5W+Mj80H7E8IzzRIUDbrVk6OfsXC 422R14dYVdaWZTMgcYJBh60coLGQ5a/W5d72/w/pR0cPf8w/ctJJi8hsXJsd2a0lKonq YXRA== MIME-Version: 1.0 X-Received: by 10.236.36.110 with SMTP id v74mr4518107yha.186.1424381533683; Thu, 19 Feb 2015 13:32:13 -0800 (PST) Sender: kmacybsd@gmail.com Received: by 10.170.76.66 with HTTP; Thu, 19 Feb 2015 13:32:13 -0800 (PST) In-Reply-To: <20150219041012.GJ1953@funkthat.com> References: <20150219041012.GJ1953@funkthat.com> Date: Thu, 19 Feb 2015 13:32:13 -0800 X-Google-Sender-Auth: RhjicIFmSaCgRCbGrh8wi8sw_pU Message-ID: Subject: Re: getting NUMA into the tree (userland most interesting for me) From: "K. Macy" To: John-Mark Gurney Content-Type: text/plain; charset=UTF-8 Cc: Alan Cox , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 21:32:14 -0000 On Wed, Feb 18, 2015 at 8:10 PM, John-Mark Gurney wrote: > I would like to help drive getting NUMA into the tree. Specificly, > getting userland allocations to be done from a specified domain. > > I've looked at the projects/numa tree, but it appears that not much was > done to get userland mappings to be NUMA aware. > > How are we going to do this? Do people have code to do this? > > I've looked at how Linux does this, at least from a programming > interface. They use mmap to create the mapping, and then use the call > mbind to tell the kernel where to handle the allocations. Is this > what people are thinking? > > I've checked the wiki status, and the userland section is quite > empty. > I personally don't think the infrastructure is far enough along that this is near to be an interesting value proposition. However, that said, I do believe that maintaining linux compatibility is important. Thus I would be for adding it to the linux compatibility layer and export it on the FreeBSD API side purely as an SPI until consensus is reached. -K From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 21:53:00 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 93DF3CA9; Thu, 19 Feb 2015 21:53:00 +0000 (UTC) Received: from mail-pa0-x22e.google.com (mail-pa0-x22e.google.com [IPv6:2607:f8b0:400e:c03::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5B63B984; Thu, 19 Feb 2015 21:53:00 +0000 (UTC) Received: by pabkx10 with SMTP id kx10so2708661pab.13; Thu, 19 Feb 2015 13:53:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=VOBohlgkbCpSdImO5xaCN8gCyxGFILXlek7q2elYk44=; b=D6VzpP9AyLfTaJar9JwD9MSrqQIuRNHIrMuuBxoBTJyvIeHt7//0ZI/X1cqdZYeVj+ 2eL7XuePOjYqaTe1SVmbl3bauUh+95GGQHRiZ8DGkoyOSbg7SGbOtMv7AMwu6VbJIvl+ dghjOAGx2biRjmYb6krWvcAVWWEoaIOSrQsbPxbSKVmK2o3dHTOQdF+Ygo09xG7e/gA4 Irhk+6RnBoExlDVUATFjkVoS5rLluTLRj2G41YGDggDZAe/5T83pU+H6DoI1ntH7AiP8 S4NFYkSyzTPbyKqNjCyRBSxP9EN6qRgK+V12np27PnTi+49PFLKU49rhXspG15TMLPvP JKcg== X-Received: by 10.68.65.36 with SMTP id u4mr10911317pbs.91.1424382779973; Thu, 19 Feb 2015 13:52:59 -0800 (PST) MIME-Version: 1.0 Received: by 10.70.89.108 with HTTP; Thu, 19 Feb 2015 13:52:19 -0800 (PST) In-Reply-To: References: <201502170150.t1H1ouxM020621@mail.karels.net> From: Eric Joyner Date: Thu, 19 Feb 2015 13:52:19 -0800 Message-ID: Subject: Re: Adding new media types to if_media.h To: Adrian Chadd Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: "freebsd-net@freebsd.org" , George Neville-Neil , mike@karels.net, "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 21:53:00 -0000 It does look good! We already have at least a half-dozen new media types to add. --- - Eric Joyner On Tue, Feb 17, 2015 at 9:26 AM, Adrian Chadd wrote: > Looks good to me. > > Thanks for doing this! > > > -a > > > On 16 February 2015 at 17:50, Mike Karels wrote: > > On Feb 9, gnn wrote: > > > >> On 8 Feb 2015, at 22:41, Mike Karels wrote: > > > >> > Sorry to reply to a thread after such a long delay, but I think it is > >> > unresolved, and needs more discussion. I'd like to elaborate a bit on > >> > my goals and proposal. I believe Adrian has newer thoughts than have > >> > been > >> > circulated here as well. > >> > > >> > The last message(s) have gone to freebsd-arch and freebsd-net. If > >> > someone > >> > wants to pick one, we could consolidate, but this seems relevant to > >> > both. > >> > > >> > I'm going to top-post to try to summarize and extend the discussion, > >> > but the > >> > preceding emails follow for reference. > >> > > >> > To recap: the existing if_media interface is running out of steam, at > >> > least > >> > in that the "Media variant" field, with 5 bits, is going to be > >> > insufficient > >> > to express existing 40 Gb/s variants. The if_media media type is a > >> > 32-bit > >> > int with a bunch of sub-fields for type (e.g. Ethernet), > >> > subtype/variant > >> > (e.g. 10baseT, 10base5, 1000baseT, etc), flags, and some MII-related > >> > fields. > >> > > >> > I made a proposal to extend the interface in a small way, specifically > >> > to > >> > replace the "media word" with a 64-bit int that is mostly the same, > >> > but > >> > has a new, larger variant/subtype field. The main reason for this > >> > proposal > >> > is to maintain the driver KPI (glimpse showed me 240 inclusions of > >> > if_media.h > >> > in the kernel in 8.2). That interface includes an initialization > >> > using a > >> > scalar value of fields ORed with each other. It would also be easy to > >> > preserve a 32-bit user-level API/ABI that can express most of the > >> > current > >> > state, with a subtype/variant field value reserved for "other" (there > >> > is > >> > already one for "unknown", but that is not quite the same). fwiw, I > >> > found 45 references to this user-level API in our tree, including both > >> > base and "ports"-type software, which includes libpcap, snmpd, > >> > dhclient, > >> > quagga, xorp, atm, devd, and rtsold, which argues for a > >> > backward-compatible > >> > API/ABI as well as a more-complete current interface for ifconfig at > >> > least. > >> > > >> > More generally, I see two problems with the existing if_media > >> > interface: > >> > > >> > 1. It doesn't have enough bits for all the fields, in particular, > >> > variant/ > >> > subtype for Ethernet. That is the immediate issue. > >> > > >> > 2. The interface is not sufficiently generic; it was designed around > >> > Ethernet > >> > including MII, token ring, FDDI, and a few other interface types. > >> > Some of > >> > the fields like "instance" are primarily for MII as far as I know, and > >> > are > >> > basically unused. It is definitely not sufficient for 802.11, which > >> > has > >> > rolled its own interfaces. > >> > > >> > To solve the second problem, I think the right approach would be to > >> > reduce > >> > this interface to a truly generic one, such as media type (e.g. > >> > Ethernet), > >> > generic flags, and perhaps generic status. Then there should be a > >> > separate > >> > media-specific interface for each type, such as Ethernet and 802.11. > >> > To a > >> > small extent, we already have that. Solving the second, more general > >> > problem, > >> > requires a whole new driver KPI that will require surgery to every > >> > driver, > >> > which is not an exercise that I would consider. > >> > > >> > Using a separate int for each existing field, as proposed, would break > >> > the > >> > driver KPI, but would not really make the interface generic. Trying > >> > to > >> > make a single interface with the union of all network interface > >> > requirements > >> > seems like a bad idea to me (we failed last time; the "we" is BSDi, > >> > where > >> > I was the architect when this interface was first designed). (No, I > >> > didn't > >> > design this interface.) > >> > > >> > Solving the first problem only, I think it is preferable to preserve a > >> > compatible driver KPI, which means using a scalar value encoding what > >> > is > >> > necessary. Although that interface is rather Ethernet-centric, that > >> > is > >> > really what it is used for. > >> > > >> > An additional, selfish goal is to make it easy to back-port drivers > >> > using > >> > the new interface to older versions (which I am quite likely to do). > >> > Preserving the KPI and general user API will be highly useful there. > >> > I'd be likely to do a 11-style version of ifconfig personally, but it > >> > might not be difficult to do in a more general way. > >> > > >> > I am willing to do a prototype for -current for evaluation. > >> > > >> > Comments, alternatives, ? > > > >> I agree with your statements above and I'd like to see the prototype. > > > > Well, I developed the prototype as I had planned, using a 64-bit media > > word, and found that I got about 100 files in GENERIC that didn't > compile; > > they attempted to store "media words" in an int. My kingdom for a > typedef. > > That didn't meet my goal of KPI compatibility, so I went to Plan B. > > > > Plan B is to steal an unused bit (RFU) to indicate an "extended" media > > type. I then used the variant/subtype field to store the extended type. > > Effectively, the previously unused bit doubles the effective size of the > > subtype field. Given that the previous 5-bit field lasted us 18 years, > > I figured that doubling it would last a while. I also changed the > > SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended > > types are all mapped to IFM_OTHER (31) using the old interface, but > > are visible using the new one. > > > > With these changes, I modified one driver (vtnet) to use an extended > type, > > and the rest of GENERIC is happy. The changes to ifconfig are also > fairly > > small. The patch is appended, where email programs will screw it up, > > or at ftp://ftp.karels.net/outgoing/if_media.patch. > > > > The VFAST subtype is a throw-away for testing. > > > > This seems like a reasonably pragmatic change to support the new 40 Gb/s > > media types until someone wants to design an improved but non-backward- > > compatible interface. I think it meets the goal of suitability for > > back-porting; it could be MFCed. > > > > Mike > > > > Index: sys/net/if_media.h > > =================================================================== > > --- sys/net/if_media.h (revision 278804) > > +++ sys/net/if_media.h (working copy) > > @@ -120,15 +120,29 @@ > > * 5-7 Media type > > * 8-15 Type specific options > > * 16-18 Mode (for multi-mode devices) > > - * 19 RFU > > + * 19 "extended" bit for media variant > > * 20-27 Shared (global) options > > * 28-31 Instance > > */ > > > > /* > > + * As we have used all of the original values for the media variant > (subtype) > > + * for Ethernet, extended subtypes have been added, marked with > XSUBTYPE, > > + * which is effectively the "high bit" of the media variant (subtype) > field. > > + * IFM_OTHER (the highest basic type) is reserved to indicate use of an > > + * extended type when using an old SIOCGIFMEDIA operation. This is true > > + * for all media types, not just Ethernet. > > + */ > > +#define XSUBTYPE 0x80000 /* extended variant high > bit */ > > +#define _X(var) ((var) | XSUBTYPE) /* extended > variant */ > > +#define IFM_OTHER 31 /* Other: some > extended type */ > > +#define OMEDIA(var) (((var) & XSUBTYPE) ? IFM_OTHER : (var)) > > + > > +/* > > * Ethernet > > */ > > #define IFM_ETHER 0x00000020 > > +/* NB: 0,1,2 are auto, manual, none defined below */ > > #define IFM_10_T 3 /* 10BaseT - RJ45 */ > > #define IFM_10_2 4 /* 10Base2 - Thinnet */ > > #define IFM_10_5 5 /* 10Base5 - AUI */ > > @@ -156,11 +170,17 @@ > > #define IFM_40G_CR4 27 /* 40GBase-CR4 */ > > #define IFM_40G_SR4 28 /* 40GBase-SR4 */ > > #define IFM_40G_LR4 29 /* 40GBase-LR4 */ > > +#define IFM_AVAIL30 30 /* available */ > > +/* #define IFM_OTHER 31 Other: some extended type */ > > +/* note 31 is the max! */ > > + > > +/* Extended variants/subtypes */ > > +#define IFM_VFAST _X(0) /* test "V.fast" */ > > +/* note _X(31) is the max! */ > > /* > > * Please update ieee8023ad_lacp.c:lacp_compose_key() > > * after adding new Ethernet media types. > > */ > > -/* note 31 is the max! */ > > > > #define IFM_ETH_MASTER 0x00000100 /* master mode > (1000baseT) */ > > #define IFM_ETH_RXPAUSE 0x00000200 /* receive PAUSE frames > */ > > @@ -170,6 +190,7 @@ > > * Token ring > > */ > > #define IFM_TOKEN 0x00000040 > > +/* NB: 0,1,2 are auto, manual, none defined below */ > > #define IFM_TOK_STP4 3 /* Shielded twisted pair > 4m - DB9 */ > > #define IFM_TOK_STP16 4 /* Shielded twisted pair > 16m - DB9 */ > > #define IFM_TOK_UTP4 5 /* Unshielded twisted > pair 4m - RJ45 */ > > @@ -187,6 +208,7 @@ > > * FDDI > > */ > > #define IFM_FDDI 0x00000060 > > +/* NB: 0,1,2 are auto, manual, none defined below */ > > #define IFM_FDDI_SMF 3 /* Single-mode fiber */ > > #define IFM_FDDI_MMF 4 /* Multi-mode fiber */ > > #define IFM_FDDI_UTP 5 /* CDDI / UTP */ > > @@ -220,6 +242,7 @@ > > #define IFM_IEEE80211_OFDM27 23 /* OFDM 27Mbps */ > > /* NB: not enough bits to express MCS fully */ > > #define IFM_IEEE80211_MCS 24 /* HT MCS rate */ > > +/* #define IFM_OTHER 31 Other: some extended type */ > > > > #define IFM_IEEE80211_ADHOC 0x00000100 /* Operate in > Adhoc mode */ > > #define IFM_IEEE80211_HOSTAP 0x00000200 /* Operate in > Host AP mode */ > > @@ -241,6 +264,7 @@ > > * ATM > > */ > > #define IFM_ATM 0x000000a0 > > +/* NB: 0,1,2 are auto, manual, none defined below */ > > #define IFM_ATM_UNKNOWN 3 > > #define IFM_ATM_UTP_25 4 > > #define IFM_ATM_TAXI_100 5 > > @@ -277,7 +301,7 @@ > > * Masks > > */ > > #define IFM_NMASK 0x000000e0 /* Network type */ > > -#define IFM_TMASK 0x0000001f /* Media sub-type */ > > +#define IFM_TMASK 0x0008001f /* Media sub-type */ > > #define IFM_IMASK 0xf0000000 /* Instance */ > > #define IFM_ISHIFT 28 /* Instance shift */ > > #define IFM_OMASK 0x0000ff00 /* Type specific options > */ > > @@ -372,6 +396,7 @@ > > { IFM_40G_CR4, "40Gbase-CR4" }, \ > > { IFM_40G_SR4, "40Gbase-SR4" }, \ > > { IFM_40G_LR4, "40Gbase-LR4" }, \ > > + { IFM_VFAST, "V.fast" }, \ > > { 0, NULL }, \ > > } > > > > @@ -603,6 +628,7 @@ > > { IFM_AUTO, "autoselect" }, \ > > { IFM_MANUAL, "manual" }, \ > > { IFM_NONE, "none" }, \ > > + { IFM_OTHER, "other" }, \ > > { 0, NULL }, \ > > } > > > > @@ -673,6 +699,7 @@ > > { IFM_ETHER | IFM_40G_CR4, IF_Gbps(40ULL) }, \ > > { IFM_ETHER | IFM_40G_SR4, IF_Gbps(40ULL) }, \ > > { IFM_ETHER | IFM_40G_LR4, IF_Gbps(40ULL) }, \ > > + { IFM_ETHER | IFM_VFAST, IF_Gbps(40ULL) }, \ > > \ > > { IFM_TOKEN | IFM_TOK_STP4, IF_Mbps(4) }, \ > > { IFM_TOKEN | IFM_TOK_STP16, IF_Mbps(16) }, \ > > Index: sys/sys/sockio.h > > =================================================================== > > --- sys/sys/sockio.h (revision 278810) > > +++ sys/sys/sockio.h (working copy) > > @@ -128,5 +128,6 @@ > > #define SIOCGIFGROUP _IOWR('i', 136, struct ifgroupreq) /* > get ifgroups */ > > #define SIOCDIFGROUP _IOW('i', 137, struct ifgroupreq) /* > delete ifgroup */ > > #define SIOCGIFGMEMB _IOWR('i', 138, struct ifgroupreq) /* > get members */ > > +#define SIOCGIFXMEDIA _IOWR('i', 139, struct ifmediareq) /* > get net xmedia */ > > > > #endif /* !_SYS_SOCKIO_H_ */ > > Index: sys/net/if.c > > =================================================================== > > --- sys/net/if.c (revision 278749) > > +++ sys/net/if.c (working copy) > > @@ -2561,6 +2561,7 @@ > > case SIOCGIFPSRCADDR: > > case SIOCGIFPDSTADDR: > > case SIOCGIFMEDIA: > > + case SIOCGIFXMEDIA: > > case SIOCGIFGENERIC: > > if (ifp->if_ioctl == NULL) > > return (EOPNOTSUPP); > > Index: sys/net/if_media.c > > =================================================================== > > --- sys/net/if_media.c (revision 278804) > > +++ sys/net/if_media.c (working copy) > > @@ -67,7 +67,9 @@ > > static struct ifmedia_entry *ifmedia_match(struct ifmedia *ifm, > > int flags, int mask); > > > > +#define IFMEDIA_DEBUG > > #ifdef IFMEDIA_DEBUG > > +#include > > int ifmedia_debug = 0; > > SYSCTL_INT(_debug, OID_AUTO, ifmedia, CTLFLAG_RW, &ifmedia_debug, > > 0, "if_media debugging msgs"); > > @@ -271,6 +273,7 @@ > > * Get list of available media and current media on interface. > > */ > > case SIOCGIFMEDIA: > > + case SIOCGIFXMEDIA: > > { > > struct ifmedia_entry *ep; > > int *kptr, count; > > @@ -278,8 +281,13 @@ > > > > kptr = NULL; /* XXX gcc */ > > > > - ifmr->ifm_active = ifmr->ifm_current = ifm->ifm_cur ? > > - ifm->ifm_cur->ifm_media : IFM_NONE; > > + if (cmd == SIOCGIFMEDIA) { > > + ifmr->ifm_active = ifmr->ifm_current = > ifm->ifm_cur ? > > + OMEDIA(ifm->ifm_cur->ifm_media) : IFM_NONE; > > + } else { > > + ifmr->ifm_active = ifmr->ifm_current = > ifm->ifm_cur ? > > + ifm->ifm_cur->ifm_media : IFM_NONE; > > + } > > ifmr->ifm_mask = ifm->ifm_mask; > > ifmr->ifm_status = 0; > > (*ifm->ifm_status)(ifp, ifmr); > > @@ -317,7 +325,10 @@ > > ep = LIST_FIRST(&ifm->ifm_list); > > for (; ep != NULL && count < ifmr->ifm_count; > > ep = LIST_NEXT(ep, ifm_list), count++) > > - kptr[count] = ep->ifm_media; > > + if (cmd == SIOCGIFMEDIA) > > + kptr[count] = > OMEDIA(ep->ifm_media); > > + else > > + kptr[count] = ep->ifm_media; > > > > if (ep != NULL) > > error = E2BIG; /* oops! */ > > @@ -505,7 +516,7 @@ > > printf("\n"); > > return; > > } > > - printf(desc->ifmt_string); > > + printf("%s", desc->ifmt_string); > > > > /* Any mode. */ > > for (desc = ttos->modes; desc && desc->ifmt_string != NULL; > desc++) > > > > Index: sys/dev/virtio/network/if_vtnet.c > > =================================================================== > > --- sys/dev/virtio/network/if_vtnet.c (revision 278749) > > +++ sys/dev/virtio/network/if_vtnet.c (working copy) > > @@ -938,6 +938,7 @@ > > ifmedia_init(&sc->vtnet_media, IFM_IMASK, vtnet_ifmedia_upd, > > vtnet_ifmedia_sts); > > ifmedia_add(&sc->vtnet_media, VTNET_MEDIATYPE, 0, NULL); > > + ifmedia_add(&sc->vtnet_media, IFM_ETHER | IFM_VFAST, 0, NULL); > > ifmedia_set(&sc->vtnet_media, VTNET_MEDIATYPE); > > > > /* Read (or generate) the MAC address for the adapter. */ > > @@ -1103,6 +1104,7 @@ > > > > case SIOCSIFMEDIA: > > case SIOCGIFMEDIA: > > + case SIOCGIFXMEDIA: > > error = ifmedia_ioctl(ifp, ifr, &sc->vtnet_media, cmd); > > break; > > Index: sbin/ifconfig/ifmedia.c > > =================================================================== > > --- sbin/ifconfig/ifmedia.c (revision 278749) > > +++ sbin/ifconfig/ifmedia.c (working copy) > > @@ -109,11 +109,17 @@ > > { > > struct ifmediareq ifmr; > > int *media_list, i; > > + int xmedia = 1; > > > > (void) memset(&ifmr, 0, sizeof(ifmr)); > > (void) strncpy(ifmr.ifm_name, name, sizeof(ifmr.ifm_name)); > > > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > > + /* > > + * Check if interface supports extended media types. > > + */ > > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > > + xmedia = 0; > > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) { > > /* > > * Interface doesn't support SIOC{G,S}IFMEDIA. > > */ > > @@ -130,8 +136,13 @@ > > err(1, "malloc"); > > ifmr.ifm_ulist = media_list; > > > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > > - err(1, "SIOCGIFMEDIA"); > > + if (xmedia) { > > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)&ifmr) < 0) > > + err(1, "SIOCGIFXMEDIA"); > > + } else { > > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)&ifmr) < 0) > > + err(1, "SIOCGIFMEDIA"); > > + } > > > > printf("\tmedia: "); > > print_media_word(ifmr.ifm_current, 1); > > @@ -194,6 +205,7 @@ > > { > > static struct ifmediareq *ifmr = NULL; > > int *mwords; > > + int xmedia = 1; > > > > if (ifmr == NULL) { > > ifmr = (struct ifmediareq *)malloc(sizeof(struct > ifmediareq)); > > @@ -213,7 +225,10 @@ > > * the current media type and the top-level type. > > */ > > > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) { > > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) { > > + xmedia = 0; > > + } > > + if (xmedia == 0 && ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) > < 0) { > > err(1, "SIOCGIFMEDIA"); > > } > > > > @@ -225,8 +240,13 @@ > > err(1, "malloc"); > > > > ifmr->ifm_ulist = mwords; > > - if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > > - err(1, "SIOCGIFMEDIA"); > > + if (xmedia) { > > + if (ioctl(s, SIOCGIFXMEDIA, (caddr_t)ifmr) < 0) > > + err(1, "SIOCGIFXMEDIA"); > > + } else { > > + if (ioctl(s, SIOCGIFMEDIA, (caddr_t)ifmr) < 0) > > + err(1, "SIOCGIFMEDIA"); > > + } > > } > > > > return ifmr; > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-arch@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 22:41:26 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E24F0D37; Thu, 19 Feb 2015 22:41:25 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B71B6EE2; Thu, 19 Feb 2015 22:41:25 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3193EB915; Thu, 19 Feb 2015 17:41:24 -0500 (EST) From: John Baldwin To: freebsd-arch@freebsd.org Subject: Re: getting NUMA into the tree (userland most interesting for me) Date: Thu, 19 Feb 2015 17:40:58 -0500 Message-ID: <83795148.GHHzUeRKp6@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <20150219041012.GJ1953@funkthat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 19 Feb 2015 17:41:24 -0500 (EST) Cc: Alan Cox , John-Mark Gurney , Konstantin Belousov , "K. Macy" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 22:41:26 -0000 On Thursday, February 19, 2015 01:32:13 PM K. Macy wrote: > On Wed, Feb 18, 2015 at 8:10 PM, John-Mark Gurney wrote: > > I would like to help drive getting NUMA into the tree. Specificly, > > getting userland allocations to be done from a specified domain. > > > > I've looked at the projects/numa tree, but it appears that not much was > > done to get userland mappings to be NUMA aware. > > > > How are we going to do this? Do people have code to do this? > > > > I've looked at how Linux does this, at least from a programming > > interface. They use mmap to create the mapping, and then use the call > > mbind to tell the kernel where to handle the allocations. Is this > > what people are thinking? > > > > I've checked the wiki status, and the userland section is quite > > empty. > > I personally don't think the infrastructure is far enough along that > this is near to be an interesting value proposition. However, that > said, I do believe that maintaining linux compatibility is important. > Thus I would be for adding it to the linux compatibility layer and > export it on the FreeBSD API side purely as an SPI until consensus is > reached. Yes, I think we have a fair bit to do in the kernel before we are in a position to export anything truly useful to userland unfortunately. The last time I talked with Jeff about projects/numa (after the first draft of the wiki page) I came away with the impression that there might be some things we can pull out of that branch, but that it isn't suitable for merging upstream directly. Jeff noted that he and Alan had gone through several iterations of this already (I believe at least 3 completely different policy designs) all of which had their own issues. Outside of the VM I think that we can keep the APIs somewhat stable by having this opaque policy cookie to pass around that we can redefine the guts of later. However, various parts of the VM all have to handle whatever the policy defines, and while the vm_phys bits and contigmalloc() might be kind of obvious to implement, higher level VM layers like kmem() and malloc() are more complicated. One thing that is in projects/numa is changes for UMA that we can hopefully reuse much of, but I don't recall how much (if any) of kmem/malloc is in there. Also, while vm_phys is one of the first things to do, I know that Alan and Jeff have pending patches to remove the cache queue (since it is far less useful than it seems) which simplify vm_phys making it easier to implement NUMA policies there, so I'm hoping we can get that in sooner before having to start tearing up the VM too much. This is why the stuff I currently have is targeted non-VM bits like interrupts as getting that correct is lower-hanging fruit that might provide some gains regardless. Even once vm_phys is done I think the first thing to tackle next is contigmalloc to facilitate static bus_dma allocations (descriptor rings and such) being local to a device. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Thu Feb 19 22:49:19 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E51792A1; Thu, 19 Feb 2015 22:49:18 +0000 (UTC) Received: from mail-yh0-x22e.google.com (mail-yh0-x22e.google.com [IPv6:2607:f8b0:4002:c01::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9DC3CF3D; Thu, 19 Feb 2015 22:49:18 +0000 (UTC) Received: by yhab6 with SMTP id b6so1663820yha.10; Thu, 19 Feb 2015 14:49:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=ROTBN0Rw+gg0WT8gkfIUFAGUAi+Ct97qNKJYvbmcqGk=; b=dB+OC4qdqXrYrnivV6ZLMT81SEXyD+Y2v3uNDMcUA81mvQ7lRWXJqVv9FzidYF9/DN Tf4oN9PqmMklbr7CoVfpF9rAu1/rzp29JJ3CNeNAIkrdZcUARohfuriMQEbzzladDlXm FedLPlysgnvhl68vl+87cy3MD9COslH1vsl4sIeRHFT42Je/u5os8bvbF6WpCkLAL5hL IQxb61c4Er2MxBEHPRljuPQ5Pf5ZwDrpznC6GRfF84/KFKma+MTlukhW726oiosJzvXG ncGt74E+TloFacpUE45OmcnqytB4da1s+fspFEuWODHZCLW6FfFxx0NgWmwrAh3lyY6z Fvcg== MIME-Version: 1.0 X-Received: by 10.170.185.6 with SMTP id b6mr5212836yke.25.1424386157775; Thu, 19 Feb 2015 14:49:17 -0800 (PST) Sender: kmacybsd@gmail.com Received: by 10.170.76.66 with HTTP; Thu, 19 Feb 2015 14:49:17 -0800 (PST) In-Reply-To: <83795148.GHHzUeRKp6@ralph.baldwin.cx> References: <20150219041012.GJ1953@funkthat.com> <83795148.GHHzUeRKp6@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 14:49:17 -0800 X-Google-Sender-Auth: YIcKVfple5swbc2zQ93mb5LIVnw Message-ID: Subject: Re: getting NUMA into the tree (userland most interesting for me) From: "K. Macy" To: John Baldwin Content-Type: text/plain; charset=UTF-8 Cc: Alan Cox , John-Mark Gurney , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Feb 2015 22:49:19 -0000 >> I personally don't think the infrastructure is far enough along that >> this is near to be an interesting value proposition. However, that >> said, I do believe that maintaining linux compatibility is important. >> Thus I would be for adding it to the linux compatibility layer and >> export it on the FreeBSD API side purely as an SPI until consensus is >> reached. > > Yes, I think we have a fair bit to do in the kernel before we are in a > position to export anything truly useful to userland unfortunately. The last > time I talked with Jeff about projects/numa (after the first draft of the wiki > page) I came away with the impression that there might be some things we can > pull out of that branch, but that it isn't suitable for merging upstream > directly. Jeff noted that he and Alan had gone through several iterations of > this already (I believe at least 3 completely different policy designs) all of > which had their own issues. > > Outside of the VM I think that we can keep the APIs somewhat stable by having > this opaque policy cookie to pass around that we can redefine the guts of > later. However, various parts of the VM all have to handle whatever the > policy defines, and while the vm_phys bits and contigmalloc() might be kind of > obvious to implement, higher level VM layers like kmem() and malloc() are more > complicated. One thing that is in projects/numa is changes for UMA that we > can hopefully reuse much of, but I don't recall how much (if any) of > kmem/malloc is in there. Also, while vm_phys is one of the first things to > do, I know that Alan and Jeff have pending patches to remove the cache queue > (since it is far less useful than it seems) which simplify vm_phys making it > easier to implement NUMA policies there, so I'm hoping we can get that in > sooner before having to start tearing up the VM too much. This is why the > stuff I currently have is targeted non-VM bits like interrupts as getting that > correct is lower-hanging fruit that might provide some gains regardless. Even > once vm_phys is done I think the first thing to tackle next is contigmalloc to > facilitate static bus_dma allocations (descriptor rings and such) being local > to a device. > Contigmalloc improvements and cache queue removal are in the phabricator queue now. They are also prerequisites for per-cpu free page caches which are a huge scalability improvement for some workloads such as Netflix's. There is still a fair amount of scalability work (including Jeffr's per-domain pagedaemon work) that really needs to happens before we can seriously think about a general user-level NUMA interface. -K From owner-freebsd-arch@FreeBSD.ORG Fri Feb 20 01:15:33 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 567C7FE; Fri, 20 Feb 2015 01:15:33 +0000 (UTC) Received: from mail-ig0-x235.google.com (mail-ig0-x235.google.com [IPv6:2607:f8b0:4001:c05::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1813425C; Fri, 20 Feb 2015 01:15:33 +0000 (UTC) Received: by mail-ig0-f181.google.com with SMTP id hn18so13732603igb.2; Thu, 19 Feb 2015 17:15:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=yMLAZl2kDryfKRN6KpE7BfFizPiHeiwhpux88rUbNEk=; b=E4nAWbaQ0gL2P9eYy2X+pMAujvIr7UHNUKujxucKHHEXmJY8vIVfY0kPRYdaKG1zHv kmhymnYucZwvj+CnO4LGp2W8FcBkwy7+1zpMzyjayrD6rfz4fLf+aRa1SeZLQHOwAv/N /w+a7OFnAJ8AuybpglH25+QacQuQ3DE1pLkE2Pdk0Y1MNroRAbBIBPI0AFAhsRLfxHSl nAxYDxxhKKIhqqaCgNYfXuGjLVOOgGg3rIHqb/ElhPOobTfqNA3ugB6CQomk2VNrpFou kZKgu2vgPWAW0y/K3waJT9s/H9rBqiHzQby4SkjMVax5EdlndT7iGzlvRM2qZcEd9z4G X2Eg== MIME-Version: 1.0 X-Received: by 10.42.130.74 with SMTP id u10mr9140784ics.61.1424394932402; Thu, 19 Feb 2015 17:15:32 -0800 (PST) Sender: adrian.chadd@gmail.com Received: by 10.36.17.66 with HTTP; Thu, 19 Feb 2015 17:15:32 -0800 (PST) In-Reply-To: References: <20150219041012.GJ1953@funkthat.com> <83795148.GHHzUeRKp6@ralph.baldwin.cx> Date: Thu, 19 Feb 2015 17:15:32 -0800 X-Google-Sender-Auth: X44NlrdTd1uYsMludV7Gcze1zwU Message-ID: Subject: Re: getting NUMA into the tree (userland most interesting for me) From: Adrian Chadd To: "K. Macy" Content-Type: text/plain; charset=UTF-8 Cc: John Baldwin , Alan Cox , John-Mark Gurney , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Feb 2015 01:15:33 -0000 On 19 February 2015 at 14:49, K. Macy wrote: >>> I personally don't think the infrastructure is far enough along that >>> this is near to be an interesting value proposition. However, that >>> said, I do believe that maintaining linux compatibility is important. >>> Thus I would be for adding it to the linux compatibility layer and >>> export it on the FreeBSD API side purely as an SPI until consensus is >>> reached. >> >> Yes, I think we have a fair bit to do in the kernel before we are in a >> position to export anything truly useful to userland unfortunately. The last >> time I talked with Jeff about projects/numa (after the first draft of the wiki >> page) I came away with the impression that there might be some things we can >> pull out of that branch, but that it isn't suitable for merging upstream >> directly. Jeff noted that he and Alan had gone through several iterations of >> this already (I believe at least 3 completely different policy designs) all of >> which had their own issues. >> >> Outside of the VM I think that we can keep the APIs somewhat stable by having >> this opaque policy cookie to pass around that we can redefine the guts of >> later. However, various parts of the VM all have to handle whatever the >> policy defines, and while the vm_phys bits and contigmalloc() might be kind of >> obvious to implement, higher level VM layers like kmem() and malloc() are more >> complicated. One thing that is in projects/numa is changes for UMA that we >> can hopefully reuse much of, but I don't recall how much (if any) of >> kmem/malloc is in there. Also, while vm_phys is one of the first things to >> do, I know that Alan and Jeff have pending patches to remove the cache queue >> (since it is far less useful than it seems) which simplify vm_phys making it >> easier to implement NUMA policies there, so I'm hoping we can get that in >> sooner before having to start tearing up the VM too much. This is why the >> stuff I currently have is targeted non-VM bits like interrupts as getting that >> correct is lower-hanging fruit that might provide some gains regardless. Even >> once vm_phys is done I think the first thing to tackle next is contigmalloc to >> facilitate static bus_dma allocations (descriptor rings and such) being local >> to a device. >> > > Contigmalloc improvements and cache queue removal are in the > phabricator queue now. They are also prerequisites for per-cpu free > page caches which are a huge scalability improvement for some > workloads such as Netflix's. > > There is still a fair amount of scalability work (including Jeffr's > per-domain pagedaemon work) that really needs to happens before we can > seriously think about a general user-level NUMA interface. Is there anything wrong with maybe bringing over the basic low level allocator changes from projects/numa so the basics are there? -adrian From owner-freebsd-arch@FreeBSD.ORG Fri Feb 20 08:17:11 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2A9A0F18; Fri, 20 Feb 2015 08:17:11 +0000 (UTC) Received: from mail-yk0-x22d.google.com (mail-yk0-x22d.google.com [IPv6:2607:f8b0:4002:c07::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D54FE3B1; Fri, 20 Feb 2015 08:17:10 +0000 (UTC) Received: by mail-yk0-f173.google.com with SMTP id 19so6392025ykq.4; Fri, 20 Feb 2015 00:17:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=cGUScDvj2ODwc+hnYtMMXlRG5Jr8BdFm+jdvmOEcQII=; b=NEdS7+pCkhqcu8JJCzkDJELW8PoFN6bnONxMBSdw1s49mGuV5OZ3munbda3stEYzPr JAZwGZFvQRIpR7BvJ6jkKuYsmXWDYOT9XlpxDuG/eYuTtMgHw89nV7fj/kpw3zjVgD8z hMoV2acbIb1JCe/krSoRP1ccOeGrPl/cEwryb2zaL8NnJJkGi10Ia7dNUpNZ8gvmZJDt LY7fQnInls54spynjcrYuydCZSeMoP1WEbEijLdPXzuC2VTFdj6IiycEyqvPtH207lXa 4N0L4lliPL84MWxCAvGRTNGpwFb3XDjelYGccPeYEU2mqNRTvcqOEgdTariGb/bBjSHJ ibGQ== MIME-Version: 1.0 X-Received: by 10.236.230.103 with SMTP id i97mr5593343yhq.47.1424420229973; Fri, 20 Feb 2015 00:17:09 -0800 (PST) Sender: kmacybsd@gmail.com Received: by 10.170.76.66 with HTTP; Fri, 20 Feb 2015 00:17:09 -0800 (PST) In-Reply-To: References: <20150219041012.GJ1953@funkthat.com> <83795148.GHHzUeRKp6@ralph.baldwin.cx> Date: Fri, 20 Feb 2015 00:17:09 -0800 X-Google-Sender-Auth: L5BvS7vN1YYpXvkL9p17rRl2BIs Message-ID: Subject: Re: getting NUMA into the tree (userland most interesting for me) From: "K. Macy" To: Adrian Chadd Content-Type: text/plain; charset=UTF-8 Cc: John Baldwin , Alan Cox , John-Mark Gurney , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Feb 2015 08:17:11 -0000 >>> Yes, I think we have a fair bit to do in the kernel before we are in a >>> position to export anything truly useful to userland unfortunately. The last >>> time I talked with Jeff about projects/numa (after the first draft of the wiki >>> page) I came away with the impression that there might be some things we can >>> pull out of that branch, but that it isn't suitable for merging upstream >>> directly. Jeff noted that he and Alan had gone through several iterations of >>> this already (I believe at least 3 completely different policy designs) all of >>> which had their own issues. >>> >>> Outside of the VM I think that we can keep the APIs somewhat stable by having >>> this opaque policy cookie to pass around that we can redefine the guts of >>> later. However, various parts of the VM all have to handle whatever the >>> policy defines, and while the vm_phys bits and contigmalloc() might be kind of >>> obvious to implement, higher level VM layers like kmem() and malloc() are more >>> complicated. One thing that is in projects/numa is changes for UMA that we >>> can hopefully reuse much of, but I don't recall how much (if any) of >>> kmem/malloc is in there. Also, while vm_phys is one of the first things to >>> do, I know that Alan and Jeff have pending patches to remove the cache queue >>> (since it is far less useful than it seems) which simplify vm_phys making it >>> easier to implement NUMA policies there, so I'm hoping we can get that in >>> sooner before having to start tearing up the VM too much. This is why the >>> stuff I currently have is targeted non-VM bits like interrupts as getting that >>> correct is lower-hanging fruit that might provide some gains regardless. Even >>> once vm_phys is done I think the first thing to tackle next is contigmalloc to >>> facilitate static bus_dma allocations (descriptor rings and such) being local >>> to a device. >>> >> >> Contigmalloc improvements and cache queue removal are in the >> phabricator queue now. They are also prerequisites for per-cpu free >> page caches which are a huge scalability improvement for some >> workloads such as Netflix's. >> >> There is still a fair amount of scalability work (including Jeffr's >> per-domain pagedaemon work) that really needs to happens before we can >> seriously think about a general user-level NUMA interface. > > Is there anything wrong with maybe bringing over the basic low level > allocator changes from projects/numa so the basics are there? I think they're probably predicated on the work that is being shepherded in now. Even if not, it would require someone to shepherd it in and the corresponding spare cycles from alc to review / revise / repeat - which seem to be in short supply. -K From owner-freebsd-arch@FreeBSD.ORG Fri Feb 20 20:16:25 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id B29A51A5; Fri, 20 Feb 2015 20:16:25 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8747F198; Fri, 20 Feb 2015 20:16:25 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 14720B915; Fri, 20 Feb 2015 15:16:24 -0500 (EST) From: John Baldwin To: "K. Macy" Subject: Re: getting NUMA into the tree (userland most interesting for me) Date: Fri, 20 Feb 2015 15:14:28 -0500 Message-ID: <2069208.rjIe3PXOHb@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: References: <20150219041012.GJ1953@funkthat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 20 Feb 2015 15:16:24 -0500 (EST) Cc: Alan Cox , Adrian Chadd , Konstantin Belousov , John-Mark Gurney , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Feb 2015 20:16:25 -0000 On Friday, February 20, 2015 12:17:09 AM K. Macy wrote: > >>> Yes, I think we have a fair bit to do in the kernel before we are in a > >>> position to export anything truly useful to userland unfortunately. The > >>> last time I talked with Jeff about projects/numa (after the first draft > >>> of the wiki page) I came away with the impression that there might be > >>> some things we can pull out of that branch, but that it isn't suitable > >>> for merging upstream directly. Jeff noted that he and Alan had gone > >>> through several iterations of this already (I believe at least 3 > >>> completely different policy designs) all of which had their own issues. > >>> > >>> Outside of the VM I think that we can keep the APIs somewhat stable by > >>> having this opaque policy cookie to pass around that we can redefine > >>> the guts of later. However, various parts of the VM all have to handle > >>> whatever the policy defines, and while the vm_phys bits and > >>> contigmalloc() might be kind of obvious to implement, higher level VM > >>> layers like kmem() and malloc() are more complicated. One thing that > >>> is in projects/numa is changes for UMA that we can hopefully reuse much > >>> of, but I don't recall how much (if any) of kmem/malloc is in there. > >>> Also, while vm_phys is one of the first things to do, I know that Alan > >>> and Jeff have pending patches to remove the cache queue (since it is > >>> far less useful than it seems) which simplify vm_phys making it easier > >>> to implement NUMA policies there, so I'm hoping we can get that in > >>> sooner before having to start tearing up the VM too much. This is why > >>> the stuff I currently have is targeted non-VM bits like interrupts as > >>> getting that correct is lower-hanging fruit that might provide some > >>> gains regardless. Even once vm_phys is done I think the first thing to > >>> tackle next is contigmalloc to facilitate static bus_dma allocations > >>> (descriptor rings and such) being local to a device. > >> > >> Contigmalloc improvements and cache queue removal are in the > >> phabricator queue now. They are also prerequisites for per-cpu free > >> page caches which are a huge scalability improvement for some > >> workloads such as Netflix's. > >> > >> There is still a fair amount of scalability work (including Jeffr's > >> per-domain pagedaemon work) that really needs to happens before we can > >> seriously think about a general user-level NUMA interface. > > > > Is there anything wrong with maybe bringing over the basic low level > > allocator changes from projects/numa so the basics are there? > > I think they're probably predicated on the work that is being > shepherded in now. Even if not, it would require someone to shepherd > it in and the corresponding spare cycles from alc to review / revise / > repeat - which seem to be in short supply. Can you add entries for these to the wiki page with links to the phab reviews? I know there is an entry for the page cache queue removal already, but you could add one for contigmalloc right next to it. -- John Baldwin From owner-freebsd-arch@FreeBSD.ORG Fri Feb 20 22:37:21 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id AD9C5E27; Fri, 20 Feb 2015 22:37:21 +0000 (UTC) Received: from pp2.rice.edu (proofpoint2.mail.rice.edu [128.42.201.101]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6E30D66A; Fri, 20 Feb 2015 22:37:20 +0000 (UTC) Received: from pps.filterd (pp2.rice.edu [127.0.0.1]) by pp2.rice.edu (8.14.5/8.14.5) with SMTP id t1KMbB4V029402; Fri, 20 Feb 2015 16:37:13 -0600 Received: from mh1.mail.rice.edu (mh1.mail.rice.edu [128.42.201.20]) by pp2.rice.edu with ESMTP id 1snuunrjyd-1; Fri, 20 Feb 2015 16:37:13 -0600 X-Virus-Scanned: by amavis-2.7.0 at mh1.mail.rice.edu, auth channel Received: from [10.87.76.252] (unknown [10.87.76.252]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh1.mail.rice.edu (Postfix) with ESMTPSA id 3BACC460129; Fri, 20 Feb 2015 16:37:13 -0600 (CST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: getting NUMA into the tree (userland most interesting for me) From: Alan Cox In-Reply-To: <2069208.rjIe3PXOHb@ralph.baldwin.cx> Date: Fri, 20 Feb 2015 16:37:12 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: <968C1AD7-D806-4E69-87E4-AB88A4C5AA70@rice.edu> References: <20150219041012.GJ1953@funkthat.com> <2069208.rjIe3PXOHb@ralph.baldwin.cx> To: John Baldwin X-Mailer: Apple Mail (2.1878.6) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=1.11015641124368e-11 kscore.compositescore=0 circleOfTrustscore=0 compositescore=0.248919945447816 suspectscore=2 recipient_domain_to_sender_totalscore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 rbsscore=0.248919945447816 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=0 spamscore=0 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.248919945447816 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1502200211 Cc: Adrian Chadd , "K. Macy" , Alan Cox , John-Mark Gurney , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Feb 2015 22:37:21 -0000 On Feb 20, 2015, at 2:14 PM, John Baldwin wrote: > On Friday, February 20, 2015 12:17:09 AM K. Macy wrote: >>>>> Yes, I think we have a fair bit to do in the kernel before we are = in a >>>>> position to export anything truly useful to userland = unfortunately. The >>>>> last time I talked with Jeff about projects/numa (after the first = draft >>>>> of the wiki page) I came away with the impression that there might = be >>>>> some things we can pull out of that branch, but that it isn't = suitable >>>>> for merging upstream directly. Jeff noted that he and Alan had = gone >>>>> through several iterations of this already (I believe at least 3 >>>>> completely different policy designs) all of which had their own = issues. >>>>>=20 >>>>> Outside of the VM I think that we can keep the APIs somewhat = stable by >>>>> having this opaque policy cookie to pass around that we can = redefine >>>>> the guts of later. However, various parts of the VM all have to = handle >>>>> whatever the policy defines, and while the vm_phys bits and >>>>> contigmalloc() might be kind of obvious to implement, higher level = VM >>>>> layers like kmem() and malloc() are more complicated. One thing = that >>>>> is in projects/numa is changes for UMA that we can hopefully reuse = much >>>>> of, but I don't recall how much (if any) of kmem/malloc is in = there.=20 >>>>> Also, while vm_phys is one of the first things to do, I know that = Alan >>>>> and Jeff have pending patches to remove the cache queue (since it = is >>>>> far less useful than it seems) which simplify vm_phys making it = easier >>>>> to implement NUMA policies there, so I'm hoping we can get that in >>>>> sooner before having to start tearing up the VM too much. This is = why >>>>> the stuff I currently have is targeted non-VM bits like interrupts = as >>>>> getting that correct is lower-hanging fruit that might provide = some >>>>> gains regardless. Even once vm_phys is done I think the first = thing to >>>>> tackle next is contigmalloc to facilitate static bus_dma = allocations >>>>> (descriptor rings and such) being local to a device. >>>>=20 >>>> Contigmalloc improvements and cache queue removal are in the >>>> phabricator queue now. They are also prerequisites for per-cpu free >>>> page caches which are a huge scalability improvement for some >>>> workloads such as Netflix's. >>>>=20 >>>> There is still a fair amount of scalability work (including = Jeffr's >>>> per-domain pagedaemon work) that really needs to happens before we = can >>>> seriously think about a general user-level NUMA interface. >>>=20 >>> Is there anything wrong with maybe bringing over the basic low level >>> allocator changes from projects/numa so the basics are there? >>=20 >> I think they're probably predicated on the work that is being >> shepherded in now. Even if not, it would require someone to shepherd >> it in and the corresponding spare cycles from alc to review / revise = / >> repeat - which seem to be in short supply. >=20 > Can you add entries for these to the wiki page with links to the phab = reviews? =20 > I know there is an entry for the page cache queue removal already, but = you=20 > could add one for contigmalloc right next to it. >=20 Essentially, the =93Remove the =91cache=92 page queue=94 task has a = number of significant subtasks that aren=92t listed, and the = contigmalloc() rewrite is the biggest of them. Specifically, the = current contigmalloc(M_WAITOK) implementation exploits the existence of = the =91cache=92 page queue, and to eliminate that dependence requires = the M_WAITOK case to work very differently.