From owner-svn-src-all@FreeBSD.ORG Tue Feb 18 05:01:06 2014 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 14B461FF; Tue, 18 Feb 2014 05:01:06 +0000 (UTC) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:1900:2254:2068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 25FAC1B99; Tue, 18 Feb 2014 05:01:05 +0000 (UTC) Received: from svn.freebsd.org ([127.0.1.70]) by svn.freebsd.org (8.14.8/8.14.8) with ESMTP id s1I515lN038762; Tue, 18 Feb 2014 05:01:05 GMT (envelope-from luigi@svn.freebsd.org) Received: (from luigi@localhost) by svn.freebsd.org (8.14.8/8.14.8/Submit) id s1I515E3038759; Tue, 18 Feb 2014 05:01:05 GMT (envelope-from luigi@svn.freebsd.org) Message-Id: <201402180501.s1I515E3038759@svn.freebsd.org> From: Luigi Rizzo Date: Tue, 18 Feb 2014 05:01:05 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-10@freebsd.org Subject: svn commit: r262151 - in stable/10: share/man/man4 sys/conf sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/dev/re sys/modules/netmap sys/net tools/tools/netmap X-SVN-Group: stable-10 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 18 Feb 2014 05:01:06 -0000 Author: luigi Date: Tue Feb 18 05:01:04 2014 New Revision: 262151 URL: http://svnweb.freebsd.org/changeset/base/262151 Log: MFH: sync the netmap code with the one in HEAD (enhanced VALE switch, netmap pipes, emulated netmap mode). See details in the log for svn 261909. Deleted: stable/10/tools/tools/netmap/click-test.cfg stable/10/tools/tools/netmap/nm_util.c stable/10/tools/tools/netmap/nm_util.h stable/10/tools/tools/netmap/pcap.c Modified: stable/10/share/man/man4/netmap.4 stable/10/sys/conf/files stable/10/sys/dev/e1000/if_em.c stable/10/sys/dev/e1000/if_igb.c stable/10/sys/dev/e1000/if_lem.c stable/10/sys/dev/ixgbe/ixgbe.c stable/10/sys/dev/netmap/if_em_netmap.h stable/10/sys/dev/netmap/if_igb_netmap.h stable/10/sys/dev/netmap/if_lem_netmap.h stable/10/sys/dev/netmap/if_re_netmap.h stable/10/sys/dev/netmap/ixgbe_netmap.h stable/10/sys/dev/netmap/netmap.c stable/10/sys/dev/netmap/netmap_kern.h stable/10/sys/dev/netmap/netmap_mem2.c stable/10/sys/dev/re/if_re.c stable/10/sys/modules/netmap/Makefile stable/10/sys/net/netmap.h stable/10/sys/net/netmap_user.h stable/10/tools/tools/netmap/Makefile stable/10/tools/tools/netmap/README stable/10/tools/tools/netmap/bridge.c stable/10/tools/tools/netmap/pkt-gen.c stable/10/tools/tools/netmap/vale-ctl.c Modified: stable/10/share/man/man4/netmap.4 ============================================================================== --- stable/10/share/man/man4/netmap.4 Tue Feb 18 04:38:26 2014 (r262150) +++ stable/10/share/man/man4/netmap.4 Tue Feb 18 05:01:04 2014 (r262151) @@ -1,4 +1,4 @@ -.\" Copyright (c) 2011 Matteo Landi, Luigi Rizzo, Universita` di Pisa +.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without @@ -21,230 +21,636 @@ .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. -.\" +.\" .\" This document is derived in part from the enet man page (enet.4) .\" distributed with 4.3BSD Unix. .\" .\" $FreeBSD$ -.\" $Id: netmap.4 11563 2012-08-02 08:59:12Z luigi $: stable/8/share/man/man4/bpf.4 181694 2008-08-13 17:45:06Z ed $ .\" -.Dd September 23, 2013 +.Dd February 13, 2014 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O +.br +.Nm VALE +.Nd a fast VirtuAl Local Ethernet using the netmap API +.br +.Nm netmap pipes +.Nd a shared memory packet transport channel .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm -is a framework for fast and safe access to network devices -(reaching 14.88 Mpps at less than 1 GHz). -.Nm -uses memory mapped buffers and metadata -(buffer indexes and lengths) to communicate with the kernel, -which is in charge of validating information through -.Pa ioctl() +is a framework for extremely fast and efficient packet I/O +for both userspace and kernel clients. +It runs on FreeBSD and Linux, +and includes +.Nm VALE , +a very fast and modular in-kernel software switch/dataplane, +and +.Nm netmap pipes , +a shared memory packet transport channel. +All these are accessed interchangeably with the same API. +.Pp +.Nm , VALE and -.Pa select()/poll(). +.Nm netmap pipes +are at least one order of magnitude faster than +standard OS mechanisms +(sockets, bpf, tun/tap interfaces, native switches, pipes), +reaching 14.88 million packets per second (Mpps) +with much less than one core on a 10 Gbit NIC, +about 20 Mpps per core for VALE ports, +and over 100 Mpps for netmap pipes. +.Pp +Userspace clients can dynamically switch NICs into .Nm -can exploit the parallelism in multiqueue devices and -multicore systems. +mode and send and receive raw packets through +memory mapped buffers. +Similarly, +.Nm VALE +switch instances and ports, and +.Nm netmap pipes +can be created dynamically, +providing high speed packet I/O between processes, +virtual machines, NICs and the host stack. .Pp .Nm +suports both non-blocking I/O through +.Xr ioctls() , +synchronization and blocking I/O through a file descriptor +and standard OS mechanisms such as +.Xr select 2 , +.Xr poll 2 , +.Xr epoll 2 , +.Xr kqueue 2 . +.Nm VALE +and +.Nm netmap pipes +are implemented by a single kernel module, which also emulates the +.Nm +API over standard drivers for devices without native +.Nm +support. +For best performance, +.Nm requires explicit support in device drivers. -For a list of supported devices, see the end of this manual page. -.Sh OPERATION +.Pp +In the rest of this (long) manual page we document +various aspects of the .Nm -clients must first open the -.Pa open("/dev/netmap") , -and then issue an -.Pa ioctl(...,NIOCREGIF,...) -to bind the file descriptor to a network device. -.Pp -When a device is put in -.Nm -mode, its data path is disconnected from the host stack. -The processes owning the file descriptor -can exchange packets with the device, or with the host stack, -through an mmapped memory region that contains pre-allocated -buffers and metadata. +and +.Nm VALE +architecture, features and usage. +.Pp +.Sh ARCHITECTURE +.Nm +supports raw packet I/O through a +.Em port , +which can be connected to a physical interface +.Em ( NIC ) , +to the host stack, +or to a +.Nm VALE +switch). +Ports use preallocated circular queues of buffers +.Em ( rings ) +residing in an mmapped region. +There is one ring for each transmit/receive queue of a +NIC or virtual port. +An additional ring pair connects to the host stack. +.Pp +After binding a file descriptor to a port, a +.Nm +client can send or receive packets in batches through +the rings, and possibly implement zero-copy forwarding +between ports. +.Pp +All NICs operating in +.Nm +mode use the same memory region, +accessible to all processes who own +.Nm /dev/netmap +file descriptors bound to NICs. +Independent +.Nm VALE +and +.Nm netmap pipe +ports +by default use separate memory regions, +but can be independently configured to share memory. +.Pp +.Sh ENTERING AND EXITING NETMAP MODE +The following section describes the system calls to create +and control +.Nm netmap +ports (including +.Nm VALE +and +.Nm netmap pipe +ports). +Simpler, higher level functions are described in section +.Xr LIBRARIES . +.Pp +Ports and rings are created and controlled through a file descriptor, +created by opening a special device +.Dl fd = open("/dev/netmap"); +and then bound to a specific port with an +.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); +.Pp +.Nm +has multiple modes of operation controlled by the +.Vt struct nmreq +argument. +.Va arg.nr_name +specifies the port name, as follows: +.Bl -tag -width XXXX +.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) +the data path of the NIC is disconnected from the host stack, +and the file descriptor is bound to the NIC (one or all queues), +or to the host stack; +.It Dv valeXXX:YYY (arbitrary XXX and YYY) +the file descriptor is bound to port YYY of a VALE switch called XXX, +both dynamically created if necessary. +The string cannot exceed IFNAMSIZ characters, and YYY cannot +be the name of any existing OS network interface. +.El +.Pp +On return, +.Va arg +indicates the size of the shared memory region, +and the number, size and location of all the +.Nm +data structures, which can be accessed by mmapping the memory +.Dl char *mem = mmap(0, arg.nr_memsize, fd); .Pp Non blocking I/O is done with special -.Pa ioctl()'s , -whereas the file descriptor can be passed to -.Pa select()/poll() -to be notified about incoming packet or available transmit buffers. -.Ss Data structures -All data structures for all devices in -.Nm -mode are in a memory -region shared by the kernel and all processes -who open -.Pa /dev/netmap -(NOTE: visibility may be restricted in future implementations). -All references between the shared data structure -are relative (offsets or indexes). Some macros help converting -them into actual pointers. +.Xr ioctl 2 +.Xr select 2 +and +.Xr poll 2 +on the file descriptor permit blocking I/O. +.Xr epoll 2 +and +.Xr kqueue 2 +are not supported on +.Nm +file descriptors. .Pp -The data structures in shared memory are the following: +While a NIC is in +.Nm +mode, the OS will still believe the interface is up and running. +OS-generated packets for that NIC end up into a +.Nm +ring, and another ring is used to send packets into the OS network stack. +A +.Xr close 2 +on the file descriptor removes the binding, +and returns the NIC to normal mode (reconnecting the data path +to the host stack), or destroys the virtual port. +.Pp +.Sh DATA STRUCTURES +The data structures in the mmapped memory region are detailed in +.Xr sys/net/netmap.h , +which is the ultimate reference for the +.Nm +API. The main structures and fields are indicated below: .Bl -tag -width XXX .It Dv struct netmap_if (one per interface) -indicates the number of rings supported by an interface, their -sizes, and the offsets of the -.Pa netmap_rings -associated to the interface. -The offset of a -.Pa struct netmap_if -in the shared memory region is indicated by the -.Pa nr_offset -field in the structure returned by the -.Pa NIOCREGIF -(see below). .Bd -literal struct netmap_if { - char ni_name[IFNAMSIZ]; /* name of the interface. */ - const u_int ni_num_queues; /* number of hw ring pairs */ - const ssize_t ring_ofs[]; /* offset of tx and rx rings */ + ... + const uint32_t ni_flags; /* properties */ + ... + const uint32_t ni_tx_rings; /* NIC tx rings */ + const uint32_t ni_rx_rings; /* NIC rx rings */ + uint32_t ni_bufs_head; /* head of extra bufs list */ + ... }; .Ed +.Pp +Indicates the number of available rings +.Pa ( struct netmap_rings ) +and their position in the mmapped region. +The number of tx and rx rings +.Pa ( ni_tx_rings , ni_rx_rings ) +normally depends on the hardware. +NICs also have an extra tx/rx ring pair connected to the host stack. +.Em NIOCREGIF +can also request additional unbound buffers in the same memory space, +to be used as temporary storage for packets. +.Pa ni_bufs_head +contains the index of the first of these free rings, +which are connected in a list (the first uint32_t of each +buffer being the index of the next buffer in the list). +A 0 indicates the end of the list. +.Pp .It Dv struct netmap_ring (one per ring) -contains the index of the current read or write slot (cur), -the number of slots available for reception or transmission (avail), -and an array of -.Pa slots -describing the buffers. -There is one ring pair for each of the N hardware ring pairs -supported by the card (numbered 0..N-1), plus -one ring pair (numbered N) for packets from/to the host stack. .Bd -literal struct netmap_ring { - const ssize_t buf_ofs; - const uint32_t num_slots; /* number of slots in the ring. */ - uint32_t avail; /* number of usable slots */ - uint32_t cur; /* 'current' index for the user side */ - uint32_t reserved; /* not refilled before current */ - - const uint16_t nr_buf_size; - uint16_t flags; - struct netmap_slot slot[0]; /* array of slots. */ + ... + const uint32_t num_slots; /* slots in each ring */ + const uint32_t nr_buf_size; /* size of each buffer */ + ... + uint32_t head; /* (u) first buf owned by user */ + uint32_t cur; /* (u) wakeup position */ + const uint32_t tail; /* (k) first buf owned by kernel */ + ... + uint32_t flags; + struct timeval ts; /* (k) time of last rxsync() */ + ... + struct netmap_slot slot[0]; /* array of slots */ } .Ed -.It Dv struct netmap_slot (one per packet) -contains the metadata for a packet: a buffer index (buf_idx), -a buffer length (len), and some flags. +.Pp +Implements transmit and receive rings, with read/write +pointers, metadata and and an array of +.Pa slots +describing the buffers. +.Pp +.It Dv struct netmap_slot (one per buffer) .Bd -literal struct netmap_slot { - uint32_t buf_idx; /* buffer index */ - uint16_t len; /* packet length */ - uint16_t flags; /* buf changed, etc. */ -#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ -#define NS_REPORT 0x0002 /* tell hw to report results - * e.g. by generating an interrupt - */ + uint32_t buf_idx; /* buffer index */ + uint16_t len; /* packet length */ + uint16_t flags; /* buf changed, etc. */ + uint64_t ptr; /* address for indirect buffers */ }; .Ed +.Pp +Describes a packet buffer, which normally is identified by +an index and resides in the mmapped region. .It Dv packet buffers -are fixed size (approximately 2k) buffers allocated by the kernel -that contain packet data. Buffers addresses are computed through -macros. +Fixed size (normally 2 KB) packet buffers allocated by the kernel. .El .Pp -Some macros support the access to objects in the shared memory -region. In particular: +The offset of the +.Pa struct netmap_if +in the mmapped region is indicated by the +.Pa nr_offset +field in the structure returned by +.Pa NIOCREGIF . +From there, all other objects are reachable through +relative references (offsets or indexes). +Macros and functions in +help converting them into actual pointers: +.Pp +.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); +.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); +.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); +.Pp +.Dl char *buf = NETMAP_BUF(ring, buffer_index); +.Sh RINGS, BUFFERS AND DATA I/O +.Va Rings +are circular queues of packets with three indexes/pointers +.Va ( head , cur , tail ) ; +one slot is always kept empty. +The ring size +.Va ( num_slots ) +should not be assumed to be a power of two. +.br +(NOTE: older versions of netmap used head/count format to indicate +the content of a ring). +.Pp +.Va head +is the first slot available to userspace; +.br +.Va cur +is the wakeup point: +select/poll will unblock when +.Va tail +passes +.Va cur ; +.br +.Va tail +is the first slot reserved to the kernel. +.Pp +Slot indexes MUST only move forward; +for convenience, the function +.Dl nm_ring_next(ring, index) +returns the next index modulo the ring size. +.Pp +.Va head +and +.Va cur +are only modified by the user program; +.Va tail +is only modified by the kernel. +The kernel only reads/writes the +.Vt struct netmap_ring +slots and buffers +during the execution of a netmap-related system call. +The only exception are slots (and buffers) in the range +.Va tail\ . . . head-1 , +that are explicitly assigned to the kernel. +.Pp +.Ss TRANSMIT RINGS +On transmit rings, after a +.Nm +system call, slots in the range +.Va head\ . . . tail-1 +are available for transmission. +User code should fill the slots sequentially +and advance +.Va head +and +.Va cur +past slots ready to transmit. +.Va cur +may be moved further ahead if the user code needs +more slots before further transmissions (see +.Sx SCATTER GATHER I/O ) . +.Pp +At the next NIOCTXSYNC/select()/poll(), +slots up to +.Va head-1 +are pushed to the port, and +.Va tail +may advance if further slots have become available. +Below is an example of the evolution of a TX ring: +.Pp .Bd -literal -struct netmap_if *nifp; -struct netmap_ring *txring = NETMAP_TXRING(nifp, i); -struct netmap_ring *rxring = NETMAP_RXRING(nifp, i); -int i = txring->slot[txring->cur].buf_idx; -char *buf = NETMAP_BUF(txring, i); + after the syscall, slots between cur and tail are (a)vailable + head=cur tail + | | + v v + TX [.....aaaaaaaaaaa.............] + + user creates new packets to (T)ransmit + head=cur tail + | | + v v + TX [.....TTTTTaaaaaa.............] + + NIOCTXSYNC/poll()/select() sends packets and reports new slots + head=cur tail + | | + v v + TX [..........aaaaaaaaaaa........] +.Ed +.Pp +select() and poll() wlll block if there is no space in the ring, i.e. +.Dl ring->cur == ring->tail +and return when new slots have become available. +.Pp +High speed applications may want to amortize the cost of system calls +by preparing as many packets as possible before issuing them. +.Pp +A transmit ring with pending transmissions has +.Dl ring->head != ring->tail + 1 (modulo the ring size). +The function +.Va int nm_tx_pending(ring) +implements this test. +.Pp +.Ss RECEIVE RINGS +On receive rings, after a +.Nm +system call, the slots in the range +.Va head\& . . . tail-1 +contain received packets. +User code should process them and advance +.Va head +and +.Va cur +past slots it wants to return to the kernel. +.Va cur +may be moved further ahead if the user code wants to +wait for more packets +without returning all the previous slots to the kernel. +.Pp +At the next NIOCRXSYNC/select()/poll(), +slots up to +.Va head-1 +are returned to the kernel for further receives, and +.Va tail +may advance to report new incoming packets. +.br +Below is an example of the evolution of an RX ring: +.Bd -literal + after the syscall, there are some (h)eld and some (R)eceived slots + head cur tail + | | | + v v v + RX [..hhhhhhRRRRRRRR..........] + + user advances head and cur, releasing some slots and holding others + head cur tail + | | | + v v v + RX [..*****hhhRRRRRR...........] + + NICRXSYNC/poll()/select() recovers slots and reports new packets + head cur tail + | | | + v v v + RX [.......hhhRRRRRRRRRRRR....] .Ed +.Pp +.Sh SLOTS AND PACKET BUFFERS +Normally, packets should be stored in the netmap-allocated buffers +assigned to slots when ports are bound to a file descriptor. +One packet is fully contained in a single buffer. +.Pp +The following flags affect slot and buffer processing: +.Bl -tag -width XXX +.It NS_BUF_CHANGED +it MUST be used when the buf_idx in the slot is changed. +This can be used to implement +zero-copy forwarding, see +.Sx ZERO-COPY FORWARDING . +.Pp +.It NS_REPORT +reports when this buffer has been transmitted. +Normally, +.Nm +notifies transmit completions in batches, hence signals +can be delayed indefinitely. This flag helps detecting +when packets have been send and a file descriptor can be closed. +.It NS_FORWARD +When a ring is in 'transparent' mode (see +.Sx TRANSPARENT MODE ) , +packets marked with this flags are forwarded to the other endpoint +at the next system call, thus restoring (in a selective way) +the connection between a NIC and the host stack. +.It NS_NO_LEARN +tells the forwarding code that the SRC MAC address for this +packet must not be used in the learning bridge code. +.It NS_INDIRECT +indicates that the packet's payload is in a user-supplied buffer, +whose user virtual address is in the 'ptr' field of the slot. +The size can reach 65535 bytes. +.br +This is only supported on the transmit ring of +.Nm VALE +ports, and it helps reducing data copies in the interconnection +of virtual machines. +.It NS_MOREFRAG +indicates that the packet continues with subsequent buffers; +the last buffer in a packet must have the flag clear. +.El +.Sh SCATTER GATHER I/O +Packets can span multiple slots if the +.Va NS_MOREFRAG +flag is set in all but the last slot. +The maximum length of a chain is 64 buffers. +This is normally used with +.Nm VALE +ports when connecting virtual machines, as they generate large +TSO segments that are not split unless they reach a physical device. +.Pp +NOTE: The length field always refers to the individual +fragment; there is no place with the total length of a packet. +.Pp +On receive rings the macro +.Va NS_RFRAGS(slot) +indicates the remaining number of slots for this packet, +including the current one. +Slots with a value greater than 1 also have NS_MOREFRAG set. .Sh IOCTLS .Nm -supports some ioctl() to synchronize the state of the rings -between the kernel and the user processes, plus some -to query and configure the interface. -The former do not require any argument, whereas the latter -use a -.Pa struct netmap_req -defined as follows: +uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) +for non-blocking I/O. They take no argument. +Two more ioctls (NIOCGINFO, NIOCREGIF) are used +to query and configure ports, with the following argument: .Bd -literal struct nmreq { - char nr_name[IFNAMSIZ]; - uint32_t nr_version; /* API version */ -#define NETMAP_API 3 /* current version */ - uint32_t nr_offset; /* nifp offset in the shared region */ - uint32_t nr_memsize; /* size of the shared region */ - uint32_t nr_tx_slots; /* slots in tx rings */ - uint32_t nr_rx_slots; /* slots in rx rings */ - uint16_t nr_tx_rings; /* number of tx rings */ - uint16_t nr_rx_rings; /* number of tx rings */ - uint16_t nr_ringid; /* ring(s) we care about */ -#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ -#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ -#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ -#define NETMAP_RING_MASK 0xfff /* the actual ring number */ - uint16_t spare1; - uint32_t spare2[4]; + char nr_name[IFNAMSIZ]; /* (i) port name */ + uint32_t nr_version; /* (i) API version */ + uint32_t nr_offset; /* (o) nifp offset in mmap region */ + uint32_t nr_memsize; /* (o) size of the mmap region */ + uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ + uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ + uint16_t nr_tx_rings; /* (i/o) number of tx rings */ + uint16_t nr_rx_rings; /* (i/o) number of tx rings */ + uint16_t nr_ringid; /* (i/o) ring(s) we care about */ + uint16_t nr_cmd; /* (i) special command */ + uint16_t nr_arg1; /* (i/o) extra arguments */ + uint16_t nr_arg2; /* (i/o) extra arguments */ + uint32_t nr_arg3; /* (i/o) extra arguments */ + uint32_t nr_flags /* (i/o) open mode */ + ... }; - .Ed -A device descriptor obtained through +.Pp +A file descriptor obtained through .Pa /dev/netmap -also supports the ioctl supported by network devices. +also supports the ioctl supported by network devices, see +.Xr netintro 4 . .Pp -The netmap-specific -.Xr ioctl 2 -command codes below are defined in -.In net/netmap.h -and are: .Bl -tag -width XXXX .It Dv NIOCGINFO -returns information about the interface named in nr_name. -On return, nr_memsize indicates the size of the shared netmap -memory region (this is device-independent), -nr_tx_slots and nr_rx_slots indicates how many buffers are in a -transmit and receive ring, -nr_tx_rings and nr_rx_rings indicates the number of transmit -and receive rings supported by the hardware. +returns EINVAL if the named port does not support netmap. +Otherwise, it returns 0 and (advisory) information +about the port. +Note that all the information below can change before the +interface is actually put in netmap mode. .Pp -If the device does not support netmap, the ioctl returns EINVAL. +.Bl -tag -width XX +.It Pa nr_memsize +indicates the size of the +.Nm +memory region. NICs in +.Nm +mode all share the same memory region, +whereas +.Nm VALE +ports have independent regions for each port. +.It Pa nr_tx_slots , nr_rx_slots +indicate the size of transmit and receive rings. +.It Pa nr_tx_rings , nr_rx_rings +indicate the number of transmit +and receive rings. +Both ring number and sizes may be configured at runtime +using interface-specific functions (e.g. +.Xr ethtool +). +.El .It Dv NIOCREGIF -puts the interface named in nr_name into netmap mode, disconnecting -it from the host stack, and/or defines which rings are controlled -through this file descriptor. -On return, it gives the same info as NIOCGINFO, and nr_ringid -indicates the identity of the rings controlled through the file +binds the port named in +.Va nr_name +to the file descriptor. For a physical device this also switches it into +.Nm +mode, disconnecting +it from the host stack. +Multiple file descriptors can be bound to the same port, +with proper synchronization left to the user. +.Pp +.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a +.Em netmap pipe , +consisting of two netmap ports with a crossover connection. +A netmap pipe share the same memory space of the parent port, +and is meant to enable configuration where a master process acts +as a dispatcher towards slave processes. +.Pp +To enable this function, the +.Pa nr_arg1 +field of the structure can be used as a hint to the kernel to +indicate how many pipes we expect to use, and reserve extra space +in the memory region. +.Pp +On return, it gives the same info as NIOCGINFO, +with +.Pa nr_ringid +and +.Pa nr_flags +indicating the identity of the rings controlled through the file descriptor. .Pp -Possible values for nr_ringid are +.Va nr_flags +.Va nr_ringid +selects which rings are controlled through this file descriptor. +Possible values of +.Pa nr_flags +are indicated below, together with the naming schemes +that application libraries (such as the +.Nm nm_open +indicated below) can use to indicate the specific set of rings. +In the example below, "netmap:foo" is any valid netmap port name. +.Pp .Bl -tag -width XXXXX -.It 0 -default, all hardware rings -.It NETMAP_SW_RING -the ``host rings'' connecting to the host stack -.It NETMAP_HW_RING + i -the i-th hardware ring +.It NR_REG_ALL_NIC "netmap:foo" +(default) all hardware ring pairs +.It NR_REG_SW_NIC "netmap:foo^" +the ``host rings'', connecting to the host stack. +.It NR_RING_NIC_SW "netmap:foo+ +all hardware rings and the host rings +.It NR_REG_ONE_NIC "netmap:foo-i" +only the i-th hardware ring pair, where the number is in +.Pa nr_ringid ; +.It NR_REG_PIPE_MASTER "netmap:foo{i" +the master side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid ; +.It NR_REG_PIPE_SLAVE "netmap:foo}i" +the slave side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid . +.Pp +The identifier of a pipe must be thought as part of the pipe name, +and does not need to be sequential. On return the pipe +will only have a single ring pair with index 0, +irrespective of the value of i. .El +.Pp By default, a -.Nm poll +.Xr poll 2 or -.Nm select +.Xr select 2 call pushes out any pending packets on the transmit ring, even if no write events are specified. The feature can be disabled by or-ing -.Nm NETMAP_NO_TX_SYNC -to nr_ringid. -But normally you should keep this feature unless you are using -separate file descriptors for the send and receive rings, because -otherwise packets are pushed out only if NETMAP_TXSYNC is called, -or the send queue is full. -.Pp -.Pa NIOCREGIF -can be used multiple times to change the association of a -file descriptor to a ring pair, always within the same device. -.It Dv NIOCUNREGIF -brings an interface back to normal mode. +.Va NETMAP_NO_TX_SYNC +to the value written to +.Va nr_ringid. +When this feature is used, +packets are transmitted only on +.Va ioctl(NIOCTXSYNC) +or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. +.Pp +When registering a virtual interface that is dynamically created to a +.Xr vale 4 +switch, we can specify the desired number of rings (1 by default, +and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. .It Dv NIOCTXSYNC tells the hardware of new packets to transmit, and updates the number of slots available for transmission. @@ -252,54 +658,387 @@ number of slots available for transmissi tells the hardware of consumed packets, and asks for newly available packets. .El +.Sh SELECT, POLL, EPOLL, KQUEUE. +.Xr select 2 +and +.Xr poll 2 +on a +.Nm +file descriptor process rings as indicated in +.Sx TRANSMIT RINGS +and +.Sx RECEIVE RINGS , +respectively when write (POLLOUT) and read (POLLIN) events are requested. +Both block if no slots are available in the ring +.Va ( ring->cur == ring->tail ) . +Depending on the platform, +.Xr epoll 2 +and +.Xr kqueue 2 +are supported too. +.Pp +Packets in transmit rings are normally pushed out +(and buffers reclaimed) even without +requesting write events. Passing the NETMAP_NO_TX_SYNC flag to +.Em NIOCREGIF +disables this feature. +By default, receive rings are processed only if read +events are requested. Passing the NETMAP_DO_RX_SYNC flag to +.Em NIOCREGIF updates receive rings even without read events. +Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC +only have an effect when some event is posted for the file descriptor. +.Sh LIBRARIES +The +.Nm +API is supposed to be used directly, both because of its simplicity and +for efficient integration with applications. +.Pp +For conveniency, the +.Va +header provides a few macros and functions to ease creating +a file descriptor and doing I/O with a +.Nm +port. These are loosely modeled after the +.Xr pcap 3 +API, to ease porting of libpcap-based applications to +.Nm . +To use these extra functions, programs should +.Dl #define NETMAP_WITH_LIBS +before +.Dl #include +.Pp +The following functions are available: +.Bl -tag -width XXXXX +.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) +similar to +.Xr pcap_open , +binds a file descriptor to a port. +.Bl -tag -width XX +.It Va ifname +is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a +.Nm VALE +port. +.It Va req +provides the initial values for the argument to the NIOCREGIF ioctl. +The nm_flags and nm_ringid values are overwritten by parsing +ifname and flags, and other fields can be overridden through +the other two arguments. +.It Va arg +points to a struct nm_desc containing arguments (e.g. from a previously +open file descriptor) that should override the defaults. +The fields are used as described below +.It Va flags +can be set to a combination of the following flags: +.Va NETMAP_NO_TX_POLL , +.Va NETMAP_DO_RX_POLL +(copied into nr_ringid); +.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, +avoids the mmap and uses the values from it); +.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); +.Va NM_OPEN_ARG1 , +.Va NM_OPEN_ARG2 , +.Va NM_OPEN_ARG3 (uses the fields from arg); +.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). +.El +.It Va int nm_close(struct nm_desc *d) +closes the file descriptor, unmaps memory, frees resources. +.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) +similar to pcap_inject(), pushes a packet to a ring, returns the size +of the packet is successful, or 0 on error; +.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) +similar to pcap_dispatch(), applies a callback to incoming packets +.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) +similar to pcap_next(), fetches the next packet +.Pp +.El +.Sh SUPPORTED DEVICES +.Nm +natively supports the following devices: +.Pp +On FreeBSD: +.Xr em 4 , +.Xr igb 4 , +.Xr ixgbe 4 , +.Xr lem 4 , +.Xr re 4 . +.Pp +On Linux +.Xr e1000 4 , +.Xr e1000e 4 , +.Xr igb 4 , +.Xr ixgbe 4 , +.Xr mlx4 4 , +.Xr forcedeth 4 , +.Xr r8169 4 . +.Pp +NICs without native support can still be used in +.Nm +mode through emulation. Performance is inferior to native netmap +mode but still significantly higher than sockets, and approaching +that of in-kernel solutions such as Linux's +.Xr pktgen . +.Pp +Emulation is also available for devices with native netmap support, +which can be used for testing or performance comparison. +The sysctl variable +.Va dev.netmap.admode +globally controls how netmap mode is implemented. +.Sh SYSCTL VARIABLES AND MODULE PARAMETERS +Some aspect of the operation of +.Nm +are controlled through sysctl variables on FreeBSD +.Em ( dev.netmap.* ) +and module parameters on Linux +.Em ( /sys/module/netmap_lin/parameters/* ) : +.Pp +.Bl -tag -width indent +.It Va dev.netmap.admode: 0 +Controls the use of native or emulated adapter mode. +0 uses the best available option, 1 forces native and +fails if not available, 2 forces emulated hence never fails. +.It Va dev.netmap.generic_ringsize: 1024 +Ring size used for emulated netmap mode +.It Va dev.netmap.generic_mit: 100000 +Controls interrupt moderation for emulated mode +.It Va dev.netmap.mmap_unreg: 0 +.It Va dev.netmap.fwd: 0 +Forces NS_FORWARD mode +.It Va dev.netmap.flags: 0 +.It Va dev.netmap.txsync_retry: 2 +.It Va dev.netmap.no_pendintr: 1 +Forces recovery of transmit buffers on system calls +.It Va dev.netmap.mitigate: 1 +Propagates interrupt mitigation to user processes +.It Va dev.netmap.no_timestamp: 0 +Disables the update of the timestamp in the netmap ring +.It Va dev.netmap.verbose: 0 +Verbose kernel messages +.It Va dev.netmap.buf_num: 163840 +.It Va dev.netmap.buf_size: 2048 +.It Va dev.netmap.ring_num: 200 +.It Va dev.netmap.ring_size: 36864 +.It Va dev.netmap.if_num: 100 +.It Va dev.netmap.if_size: 1024 +Sizes and number of objects (netmap_if, netmap_ring, buffers) +for the global memory region. The only parameter worth modifying is +.Va dev.netmap.buf_num +as it impacts the total amount of memory used by netmap. +.It Va dev.netmap.buf_curr_num: 0 +.It Va dev.netmap.buf_curr_size: 0 +.It Va dev.netmap.ring_curr_num: 0 +.It Va dev.netmap.ring_curr_size: 0 +.It Va dev.netmap.if_curr_num: 0 +.It Va dev.netmap.if_curr_size: 0 +Actual values in use. +.It Va dev.netmap.bridge_batch: 1024 +Batch size used when moving packets across a +.Nm VALE +switch. Values above 64 generally guarantee good +performance. +.El .Sh SYSTEM CALLS .Nm uses -.Nm select +.Xr select 2 , +.Xr poll 2 , +.Xr epoll and -.Nm poll -to wake up processes when significant events occur. +.Xr kqueue +to wake up processes when significant events occur, and +.Xr mmap 2 *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***