Date: Mon, 6 Jan 2014 14:56:01 +0000 (UTC) From: Mark Murray <markm@FreeBSD.org> To: src-committers@freebsd.org, svn-src-projects@freebsd.org Subject: svn commit: r260370 - in projects/random_number_generator: share/man/man4 share/mk sys/dev/e1000 sys/dev/ixgbe sys/dev/netmap sys/net sys/rpc tools/tools/netmap Message-ID: <201401061456.s06Eu11i095200@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: markm Date: Mon Jan 6 14:56:00 2014 New Revision: 260370 URL: http://svnweb.freebsd.org/changeset/base/260370 Log: MFC - tracking commit. Modified: projects/random_number_generator/share/man/man4/netmap.4 projects/random_number_generator/share/mk/bsd.sys.mk projects/random_number_generator/sys/dev/e1000/if_em.c projects/random_number_generator/sys/dev/e1000/if_igb.c projects/random_number_generator/sys/dev/e1000/if_lem.c projects/random_number_generator/sys/dev/ixgbe/ixgbe.c projects/random_number_generator/sys/dev/netmap/if_em_netmap.h projects/random_number_generator/sys/dev/netmap/if_igb_netmap.h projects/random_number_generator/sys/dev/netmap/if_lem_netmap.h projects/random_number_generator/sys/dev/netmap/if_re_netmap.h projects/random_number_generator/sys/dev/netmap/ixgbe_netmap.h projects/random_number_generator/sys/dev/netmap/netmap.c projects/random_number_generator/sys/dev/netmap/netmap_freebsd.c projects/random_number_generator/sys/dev/netmap/netmap_generic.c projects/random_number_generator/sys/dev/netmap/netmap_kern.h projects/random_number_generator/sys/dev/netmap/netmap_mbq.c projects/random_number_generator/sys/dev/netmap/netmap_mbq.h projects/random_number_generator/sys/dev/netmap/netmap_mem2.c projects/random_number_generator/sys/dev/netmap/netmap_mem2.h projects/random_number_generator/sys/dev/netmap/netmap_vale.c projects/random_number_generator/sys/net/netmap.h projects/random_number_generator/sys/net/netmap_user.h projects/random_number_generator/sys/rpc/svc.h projects/random_number_generator/tools/tools/netmap/bridge.c projects/random_number_generator/tools/tools/netmap/nm_util.c projects/random_number_generator/tools/tools/netmap/nm_util.h projects/random_number_generator/tools/tools/netmap/pcap.c projects/random_number_generator/tools/tools/netmap/pkt-gen.c projects/random_number_generator/tools/tools/netmap/vale-ctl.c Directory Properties: projects/random_number_generator/ (props changed) projects/random_number_generator/share/man/man4/ (props changed) projects/random_number_generator/sys/ (props changed) Modified: projects/random_number_generator/share/man/man4/netmap.4 ============================================================================== --- projects/random_number_generator/share/man/man4/netmap.4 Mon Jan 6 14:39:10 2014 (r260369) +++ projects/random_number_generator/share/man/man4/netmap.4 Mon Jan 6 14:56:00 2014 (r260370) @@ -1,4 +1,4 @@ -.\" Copyright (c) 2011-2013 Matteo Landi, Luigi Rizzo, Universita` di Pisa +.\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without @@ -27,434 +27,546 @@ .\" .\" $FreeBSD$ .\" -.Dd October 18, 2013 +.Dd January 4, 2014 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O +.br +.Nm VALE +.Nd a fast VirtuAl Local Ethernet using the netmap API .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm is a framework for extremely fast and efficient packet I/O -(reaching 14.88 Mpps with a single core at less than 1 GHz) for both userspace and kernel clients. -Userspace clients can use the netmap API -to send and receive raw packets through physical interfaces -or ports of the -.Xr VALE 4 -switch. -.Pp -.Nm VALE -is a very fast (reaching 20 Mpps per port) -and modular software switch, -implemented within the kernel, which can interconnect -virtual ports, physical devices, and the native host stack. -.Pp -.Nm -uses a memory mapped region to share packet buffers, -descriptors and queues with the kernel. -Simple -.Pa ioctl()s -are used to bind interfaces/ports to file descriptors and -implement non-blocking I/O, whereas blocking I/O uses -.Pa select()/poll() . +It runs on FreeBSD and Linux, +and includes +.Nm VALE , +a very fast and modular in-kernel software switch/dataplane. +.Pp .Nm -can exploit the parallelism in multiqueue devices and -multicore systems. +and +.Nm VALE +are one order of magnitude faster than sockets, bpf or +native switches based on +.Xr tun/tap 4 , +reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC, +and 20 Mpps per core for VALE ports. +.Pp +Userspace clients can dynamically switch NICs into +.Nm +mode and send and receive raw packets through +memory mapped buffers. +A selectable file descriptor supports +synchronization and blocking I/O. +.Pp +Similarly, +.Nm VALE +can dynamically create switch instances and ports, +providing high speed packet I/O between processes, +virtual machines, NICs and the host stack. .Pp -For the best performance, +For best performance, .Nm requires explicit support in device drivers; -a generic emulation layer is available to implement the +however, the .Nm -API on top of unmodified device drivers, +API can be emulated on top of unmodified device drivers, at the price of reduced performance -(but still better than what can be achieved with -sockets or BPF/pcap). +(but still better than sockets or BPF/pcap). .Pp -For a list of devices with native +In the rest of this (long) manual page we document +various aspects of the .Nm -support, see the end of this manual page. -.Sh OPERATION - THE NETMAP API +and +.Nm VALE +architecture, features and usage. +.Pp +.Sh ARCHITECTURE .Nm -clients must first -.Pa open("/dev/netmap") , -and then issue an -.Pa ioctl(fd, NIOCREGIF, (struct nmreq *)arg) -to bind the file descriptor to a specific interface or port. +supports raw packet I/O through a +.Em port , +which can be connected to a physical interface +.Em ( NIC ) , +to the host stack, +or to a +.Nm VALE +switch). +Ports use preallocated circular queues of buffers +.Em ( rings ) +residing in an mmapped region. +There is one ring for each transmit/receive queue of a +NIC or virtual port. +An additional ring pair connects to the host stack. +.Pp +After binding a file descriptor to a port, a +.Nm +client can send or receive packets in batches through +the rings, and possibly implement zero-copy forwarding +between ports. +.Pp +All NICs operating in +.Nm +mode use the same memory region, +accessible to all processes who own +.Nm /dev/netmap +file descriptors bound to NICs. +.Nm VALE +ports instead use separate memory regions. +.Pp +.Sh ENTERING AND EXITING NETMAP MODE +Ports and rings are created and controlled through a file descriptor, +created by opening a special device +.Dl fd = open("/dev/netmap"); +and then bound to a specific port with an +.Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); +.Pp .Nm has multiple modes of operation controlled by the -content of the -.Pa struct nmreq -passed to the -.Pa ioctl() . -In particular, the -.Em nr_name -field specifies whether the client operates on a physical network -interface or on a port of a -.Nm VALE -switch, as indicated below. Additional fields in the -.Pa struct nmreq -control the details of operation. +.Vt struct nmreq +argument. +.Va arg.nr_name +specifies the port name, as follows: .Bl -tag -width XXXX -.It Dv Interface name (e.g. 'em0', 'eth1', ... ) -The data path of the interface is disconnected from the host stack. -Depending on additional arguments, -the file descriptor is bound to the NIC (one or all queues), -or to the host stack. +.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) +the data path of the NIC is disconnected from the host stack, +and the file descriptor is bound to the NIC (one or all queues), +or to the host stack; .It Dv valeXXX:YYY (arbitrary XXX and YYY) -The file descriptor is bound to port YYY of a VALE switch called XXX, -where XXX and YYY are arbitrary alphanumeric strings. +the file descriptor is bound to port YYY of a VALE switch called XXX, +both dynamically created if necessary. The string cannot exceed IFNAMSIZ characters, and YYY cannot -matching the name of any existing interface. -.Pp -The switch and the port are created if not existing. -.It Dv valeXXX:ifname (ifname is an existing interface) -Flags in the argument control whether the physical interface -(and optionally the corrisponding host stack endpoint) -are connected or disconnected from the VALE switch named XXX. -.Pp -In this case the -.Pa ioctl() -is used only for configuring the VALE switch, typically through the -.Nm vale-ctl -command. -The file descriptor cannot be used for I/O, and should be -.Pa close()d -after issuing the -.Pa ioctl(). +be the name of any existing OS network interface. .El .Pp -The binding can be removed (and the interface returns to -regular operation, or the virtual port destroyed) with a -.Pa close() -on the file descriptor. -.Pp -The processes owning the file descriptor can then -.Pa mmap() -the memory region that contains pre-allocated -buffers, descriptors and queues, and use them to -read/write raw packets. +On return, +.Va arg +indicates the size of the shared memory region, +and the number, size and location of all the +.Nm +data structures, which can be accessed by mmapping the memory +.Dl char *mem = mmap(0, arg.nr_memsize, fd); +.Pp Non blocking I/O is done with special -.Pa ioctl()'s , -whereas the file descriptor can be passed to -.Pa select()/poll() -to be notified about incoming packet or available transmit buffers. -.Ss DATA STRUCTURES -The data structures in the mmapped memory are described below -(see -.Xr sys/net/netmap.h -for reference). -All physical devices operating in +.Xr ioctl 2 +.Xr select 2 +and +.Xr poll 2 +on the file descriptor permit blocking I/O. +.Xr epoll 2 +and +.Xr kqueue 2 +are not supported on .Nm -mode use the same memory region, -shared by the kernel and all processes who own -.Pa /dev/netmap -descriptors bound to those devices -(NOTE: visibility may be restricted in future implementations). -Virtual ports instead use separate memory regions, -shared only with the kernel. -.Pp -All references between the shared data structure -are relative (offsets or indexes). Some macros help converting -them into actual pointers. +file descriptors. +.Pp +While a NIC is in +.Nm +mode, the OS will still believe the interface is up and running. +OS-generated packets for that NIC end up into a +.Nm +ring, and another ring is used to send packets into the OS network stack. +A +.Xr close 2 +on the file descriptor removes the binding, +and returns the NIC to normal mode (reconnecting the data path +to the host stack), or destroys the virtual port. +.Pp +.Sh DATA STRUCTURES +The data structures in the mmapped memory region are detailed in +.Xr sys/net/netmap.h , +which is the ultimate reference for the +.Nm +API. The main structures and fields are indicated below: .Bl -tag -width XXX .It Dv struct netmap_if (one per interface) -indicates the number of rings supported by an interface, their -sizes, and the offsets of the -.Pa netmap_rings -associated to the interface. -.Pp -.Pa struct netmap_if -is at offset -.Pa nr_offset -in the shared memory region is indicated by the -field in the structure returned by the -.Pa NIOCREGIF -(see below). .Bd -literal struct netmap_if { - char ni_name[IFNAMSIZ]; /* name of the interface. */ - const u_int ni_version; /* API version */ - const u_int ni_rx_rings; /* number of rx ring pairs */ - const u_int ni_tx_rings; /* if 0, same as ni_rx_rings */ - const ssize_t ring_ofs[]; /* offset of tx and rx rings */ + ... + const uint32_t ni_flags; /* properties */ + ... + const uint32_t ni_tx_rings; /* NIC tx rings */ + const uint32_t ni_rx_rings; /* NIC rx rings */ + const uint32_t ni_extra_tx_rings; /* extra tx rings */ + const uint32_t ni_extra_rx_rings; /* extra rx rings */ + ... }; .Ed +.Pp +Indicates the number of available rings +.Pa ( struct netmap_rings ) +and their position in the mmapped region. +The number of tx and rx rings +.Pa ( ni_tx_rings , ni_rx_rings ) +normally depends on the hardware. +NICs also have an extra tx/rx ring pair connected to the host stack. +.Em NIOCREGIF +can request additional tx/rx rings, +to be used between multiple processes/threads +accessing the same +.Nm +port. .It Dv struct netmap_ring (one per ring) -Contains the positions in the transmit and receive rings to -synchronize the kernel and the application, -and an array of -.Pa slots -describing the buffers. -'reserved' is used in receive rings to tell the kernel the -number of slots after 'cur' that are still in usr -indicates how many slots starting from 'cur' -the -.Pp -Each physical interface has one -.Pa netmap_ring -for each hardware transmit and receive ring, -plus one extra transmit and one receive structure -that connect to the host stack. .Bd -literal struct netmap_ring { - const ssize_t buf_ofs; /* see details */ - const uint32_t num_slots; /* number of slots in the ring */ - uint32_t avail; /* number of usable slots */ - uint32_t cur; /* 'current' read/write index */ - uint32_t reserved; /* not refilled before current */ - - const uint16_t nr_buf_size; - uint16_t flags; -#define NR_TIMESTAMP 0x0002 /* set timestamp on *sync() */ -#define NR_FORWARD 0x0004 /* enable NS_FORWARD for ring */ -#define NR_RX_TSTMP 0x0008 /* set rx timestamp in slots */ - struct timeval ts; - struct netmap_slot slot[0]; /* array of slots */ + ... + const uint32_t num_slots; /* slots in each ring */ + const uint32_t nr_buf_size; /* size of each buffer */ + ... + uint32_t head; /* (u) first buf owned by user */ + uint32_t cur; /* (u) wakeup position */ + const uint32_t tail; /* (k) first buf owned by kernel */ + ... + uint32_t flags; + struct timeval ts; /* (k) time of last rxsync() */ + ... + struct netmap_slot slot[0]; /* array of slots */ } .Ed .Pp -In transmit rings, after a system call 'cur' indicates -the first slot that can be used for transmissions, -and 'avail' reports how many of them are available. -Before the next netmap-related system call on the file -descriptor, the application should fill buffers and -slots with data, and update 'cur' and 'avail' -accordingly, as shown in the figure below: +Implements transmit and receive rings, with read/write +pointers, metadata and and an array of +.Pa slots +describing the buffers. +.Pp +.It Dv struct netmap_slot (one per buffer) .Bd -literal - - cur - |----- avail ---| (after syscall) - v - TX [*****aaaaaaaaaaaaaaaaa**] - TX [*****TTTTTaaaaaaaaaaaa**] - ^ - |-- avail --| (before syscall) - cur +struct netmap_slot { + uint32_t buf_idx; /* buffer index */ + uint16_t len; /* packet length */ + uint16_t flags; /* buf changed, etc. */ + uint64_t ptr; /* address for indirect buffers */ +}; .Ed -In receive rings, after a system call 'cur' indicates -the first slot that contains a valid packet, -and 'avail' reports how many of them are available. -Before the next netmap-related system call on the file -descriptor, the application can process buffers and -release them to the kernel updating -'cur' and 'avail' accordingly, as shown in the figure below. -Receive rings have an additional field called 'reserved' -to indicate how many buffers before 'cur' are still -under processing and cannot be released. +.Pp +Describes a packet buffer, which normally is identified by +an index and resides in the mmapped region. +.It Dv packet buffers +Fixed size (normally 2 KB) packet buffers allocated by the kernel. +.El +.Pp +The offset of the +.Pa struct netmap_if +in the mmapped region is indicated by the +.Pa nr_offset +field in the structure returned by +.Pa NIOCREGIF . +From there, all other objects are reachable through +relative references (offsets or indexes). +Macros and functions in <net/netmap_user.h> +help converting them into actual pointers: +.Pp +.Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); +.Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); +.Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); +.Pp +.Dl char *buf = NETMAP_BUF(ring, buffer_index); +.Sh RINGS, BUFFERS AND DATA I/O +.Va Rings +are circular queues of packets with three indexes/pointers +.Va ( head , cur , tail ) ; +one slot is always kept empty. +The ring size +.Va ( num_slots ) +should not be assumed to be a power of two. +.br +(NOTE: older versions of netmap used head/count format to indicate +the content of a ring). +.Pp +.Va head +is the first slot available to userspace; +.br +.Va cur +is the wakeup point: +select/poll will unblock when +.Va tail +passes +.Va cur ; +.br +.Va tail +is the first slot reserved to the kernel. +.Pp +Slot indexes MUST only move forward; +for convenience, the function +.Dl nm_ring_next(ring, index) +returns the next index modulo the ring size. +.Pp +.Va head +and +.Va cur +are only modified by the user program; +.Va tail +is only modified by the kernel. +The kernel only reads/writes the +.Vt struct netmap_ring +slots and buffers +during the execution of a netmap-related system call. +The only exception are slots (and buffers) in the range +.Va tail\ . . . head-1 , +that are explicitly assigned to the kernel. +.Pp +.Ss TRANSMIT RINGS +On transmit rings, after a +.Nm +system call, slots in the range +.Va head\ . . . tail-1 +are available for transmission. +User code should fill the slots sequentially +and advance +.Va head +and +.Va cur +past slots ready to transmit. +.Va cur +may be moved further ahead if the user code needs +more slots before further transmissions (see +.Sx SCATTER GATHER I/O ) . +.Pp +At the next NIOCTXSYNC/select()/poll(), +slots up to +.Va head-1 +are pushed to the port, and +.Va tail +may advance if further slots have become available. +Below is an example of the evolution of a TX ring: +.Pp .Bd -literal - cur - |-res-|-- avail --| (after syscall) - v - RX [**rrrrrrRRRRRRRRRRRR******] - RX [**...........rrrrRRR******] - |res|--|<avail (before syscall) - ^ - cur + after the syscall, slots between cur and tail are (a)vailable + head=cur tail + | | + v v + TX [.....aaaaaaaaaaa.............] + + user creates new packets to (T)ransmit + head=cur tail + | | + v v + TX [.....TTTTTaaaaaa.............] + NIOCTXSYNC/poll()/select() sends packets and reports new slots + head=cur tail + | | + v v + TX [..........aaaaaaaaaaa........] .Ed -.It Dv struct netmap_slot (one per packet) -contains the metadata for a packet: +.Pp +select() and poll() wlll block if there is no space in the ring, i.e. +.Dl ring->cur == ring->tail +and return when new slots have become available. +.Pp +High speed applications may want to amortize the cost of system calls +by preparing as many packets as possible before issuing them. +.Pp +A transmit ring with pending transmissions has +.Dl ring->head != ring->tail + 1 (modulo the ring size). +The function +.Va int nm_tx_pending(ring) +implements this test. +.Pp +.Ss RECEIVE RINGS +On receive rings, after a +.Nm +system call, the slots in the range +.Va head\& . . . tail-1 +contain received packets. +User code should process them and advance +.Va head +and +.Va cur +past slots it wants to return to the kernel. +.Va cur +may be moved further ahead if the user code wants to +wait for more packets +without returning all the previous slots to the kernel. +.Pp +At the next NIOCRXSYNC/select()/poll(), +slots up to +.Va head-1 +are returned to the kernel for further receives, and +.Va tail +may advance to report new incoming packets. +.br +Below is an example of the evolution of an RX ring: .Bd -literal -struct netmap_slot { - uint32_t buf_idx; /* buffer index */ - uint16_t len; /* packet length */ - uint16_t flags; /* buf changed, etc. */ -#define NS_BUF_CHANGED 0x0001 /* must resync, buffer changed */ -#define NS_REPORT 0x0002 /* tell hw to report results - * e.g. by generating an interrupt - */ -#define NS_FORWARD 0x0004 /* pass packet to the other endpoint - * (host stack or device) - */ -#define NS_NO_LEARN 0x0008 -#define NS_INDIRECT 0x0010 -#define NS_MOREFRAG 0x0020 -#define NS_PORT_SHIFT 8 -#define NS_PORT_MASK (0xff << NS_PORT_SHIFT) -#define NS_RFRAGS(_slot) ( ((_slot)->flags >> 8) & 0xff) - uint64_t ptr; /* buffer address (indirect buffers) */ -}; + after the syscall, there are some (h)eld and some (R)eceived slots + head cur tail + | | | + v v v + RX [..hhhhhhRRRRRRRR..........] + + user advances head and cur, releasing some slots and holding others + head cur tail + | | | + v v v + RX [..*****hhhRRRRRR...........] + + NICRXSYNC/poll()/select() recovers slots and reports new packets + head cur tail + | | | + v v v + RX [.......hhhRRRRRRRRRRRR....] .Ed -The flags control how the the buffer associated to the slot -should be managed. -.It Dv packet buffers -are normally fixed size (2 Kbyte) buffers allocated by the kernel -that contain packet data. Buffers addresses are computed through -macros. -.El -.Bl -tag -width XXX -Some macros support the access to objects in the shared memory -region. In particular, -.It NETMAP_TXRING(nifp, i) -.It NETMAP_RXRING(nifp, i) -return the address of the i-th transmit and receive ring, -respectively, whereas -.It NETMAP_BUF(ring, buf_idx) -returns the address of the buffer with index buf_idx -(which can be part of any ring for the given interface). -.El .Pp -Normally, buffers are associated to slots when interfaces are bound, -and one packet is fully contained in a single buffer. -Clients can however modify the mapping using the -following flags: -.Ss FLAGS +.Sh SLOTS AND PACKET BUFFERS +Normally, packets should be stored in the netmap-allocated buffers +assigned to slots when ports are bound to a file descriptor. +One packet is fully contained in a single buffer. +.Pp +The following flags affect slot and buffer processing: .Bl -tag -width XXX .It NS_BUF_CHANGED -indicates that the buf_idx in the slot has changed. -This can be useful if the client wants to implement -some form of zero-copy forwarding (e.g. by passing buffers -from an input interface to an output interface), or -needs to process packets out of order. +it MUST be used when the buf_idx in the slot is changed. +This can be used to implement +zero-copy forwarding, see +.Sx ZERO-COPY FORWARDING . .Pp -The flag MUST be used whenever the buffer index is changed. .It NS_REPORT -indicates that we want to be woken up when this buffer -has been transmitted. This reduces performance but insures -a prompt notification when a buffer has been sent. +reports when this buffer has been transmitted. Normally, .Nm notifies transmit completions in batches, hence signals -can be delayed indefinitely. However, we need such notifications -before closing a descriptor. +can be delayed indefinitely. This flag helps detecting +when packets have been send and a file descriptor can be closed. .It NS_FORWARD -When the device is open in 'transparent' mode, -the client can mark slots in receive rings with this flag. -For all marked slots, marked packets are forwarded to -the other endpoint at the next system call, thus restoring -(in a selective way) the connection between the NIC and the -host stack. +When a ring is in 'transparent' mode (see +.Sx TRANSPARENT MODE ) , +packets marked with this flags are forwarded to the other endpoint +at the next system call, thus restoring (in a selective way) +the connection between a NIC and the host stack. .It NS_NO_LEARN tells the forwarding code that the SRC MAC address for this -packet should not be used in the learning bridge +packet must not be used in the learning bridge code. .It NS_INDIRECT -indicates that the packet's payload is not in the netmap -supplied buffer, but in a user-supplied buffer whose -user virtual address is in the 'ptr' field of the slot. +indicates that the packet's payload is in a user-supplied buffer, +whose user virtual address is in the 'ptr' field of the slot. The size can reach 65535 bytes. -.Em This is only supported on the transmit ring of virtual ports +.br +This is only supported on the transmit ring of +.Nm VALE +ports, and it helps reducing data copies in the interconnection +of virtual machines. .It NS_MOREFRAG indicates that the packet continues with subsequent buffers; the last buffer in a packet must have the flag clear. +.El +.Sh SCATTER GATHER I/O +Packets can span multiple slots if the +.Va NS_MOREFRAG +flag is set in all but the last slot. The maximum length of a chain is 64 buffers. -.Em This is only supported on virtual ports -.It NS_RFRAGS(slot) -on receive rings, returns the number of remaining buffers -in a packet, including this one. -Slots with a value greater than 1 also have NS_MOREFRAG set. -The length refers to the individual buffer, there is no -field for the total length. +This is normally used with +.Nm VALE +ports when connecting virtual machines, as they generate large +TSO segments that are not split unless they reach a physical device. .Pp -On transmit rings, if NS_DST is set, it is passed to the lookup -function, which can use it e.g. as the index of the destination -port instead of doing an address lookup. -.El +NOTE: The length field always refers to the individual +fragment; there is no place with the total length of a packet. +.Pp +On receive rings the macro +.Va NS_RFRAGS(slot) +indicates the remaining number of slots for this packet, +including the current one. +Slots with a value greater than 1 also have NS_MOREFRAG set. .Sh IOCTLS .Nm -supports some ioctl() to synchronize the state of the rings -between the kernel and the user processes, plus some -to query and configure the interface. -The former do not require any argument, whereas the latter -use a -.Pa struct nmreq -defined as follows: +uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) +for non-blocking I/O. They take no argument. +Two more ioctls (NIOCGINFO, NIOCREGIF) are used +to query and configure ports, with the following argument: .Bd -literal struct nmreq { - char nr_name[IFNAMSIZ]; - uint32_t nr_version; /* API version */ -#define NETMAP_API 4 /* current version */ - uint32_t nr_offset; /* nifp offset in the shared region */ - uint32_t nr_memsize; /* size of the shared region */ - uint32_t nr_tx_slots; /* slots in tx rings */ - uint32_t nr_rx_slots; /* slots in rx rings */ - uint16_t nr_tx_rings; /* number of tx rings */ - uint16_t nr_rx_rings; /* number of tx rings */ - uint16_t nr_ringid; /* ring(s) we care about */ -#define NETMAP_HW_RING 0x4000 /* low bits indicate one hw ring */ -#define NETMAP_SW_RING 0x2000 /* we process the sw ring */ -#define NETMAP_NO_TX_POLL 0x1000 /* no gratuitous txsync on poll */ -#define NETMAP_RING_MASK 0xfff /* the actual ring number */ - uint16_t nr_cmd; -#define NETMAP_BDG_ATTACH 1 /* attach the NIC */ -#define NETMAP_BDG_DETACH 2 /* detach the NIC */ -#define NETMAP_BDG_LOOKUP_REG 3 /* register lookup function */ -#define NETMAP_BDG_LIST 4 /* get bridge's info */ - uint16_t nr_arg1; - uint16_t nr_arg2; - uint32_t spare2[3]; + char nr_name[IFNAMSIZ]; /* (i) port name */ + uint32_t nr_version; /* (i) API version */ + uint32_t nr_offset; /* (o) nifp offset in mmap region */ + uint32_t nr_memsize; /* (o) size of the mmap region */ + uint32_t nr_tx_slots; /* (o) slots in tx rings */ + uint32_t nr_rx_slots; /* (o) slots in rx rings */ + uint16_t nr_tx_rings; /* (o) number of tx rings */ + uint16_t nr_rx_rings; /* (o) number of tx rings */ + uint16_t nr_ringid; /* (i) ring(s) we care about */ + uint16_t nr_cmd; /* (i) special command */ + uint16_t nr_arg1; /* (i) extra arguments */ + uint16_t nr_arg2; /* (i) extra arguments */ + ... }; - .Ed -A device descriptor obtained through +.Pp +A file descriptor obtained through .Pa /dev/netmap -also supports the ioctl supported by network devices. +also supports the ioctl supported by network devices, see +.Xr netintro 4 . .Pp -The netmap-specific -.Xr ioctl 2 -command codes below are defined in -.In net/netmap.h -and are: .Bl -tag -width XXXX .It Dv NIOCGINFO -returns EINVAL if the named device does not support netmap. +returns EINVAL if the named port does not support netmap. Otherwise, it returns 0 and (advisory) information -about the interface. +about the port. Note that all the information below can change before the interface is actually put in netmap mode. .Pp -.Pa nr_memsize -indicates the size of the netmap -memory region. Physical devices all share the same memory region, -whereas VALE ports may have independent regions for each port. -These sizes can be set through system-wise sysctl variables. -.Pa nr_tx_slots, nr_rx_slots +.Bl -tag -width XX +.It Pa nr_memsize +indicates the size of the +.Nm +memory region. NICs in +.Nm +mode all share the same memory region, +whereas +.Nm VALE +ports have independent regions for each port. +.It Pa nr_tx_slots , nr_rx_slots indicate the size of transmit and receive rings. -.Pa nr_tx_rings, nr_rx_rings +.It Pa nr_tx_rings , nr_rx_rings indicate the number of transmit and receive rings. Both ring number and sizes may be configured at runtime using interface-specific functions (e.g. -.Pa sysctl -or -.Pa ethtool . +.Xr ethtool +). +.El .It Dv NIOCREGIF -puts the interface named in nr_name into netmap mode, disconnecting -it from the host stack, and/or defines which rings are controlled -through this file descriptor. +binds the port named in +.Va nr_name +to the file descriptor. For a physical device this also switches it into +.Nm +mode, disconnecting +it from the host stack. +Multiple file descriptors can be bound to the same port, +with proper synchronization left to the user. +.Pp On return, it gives the same info as NIOCGINFO, and nr_ringid indicates the identity of the rings controlled through the file descriptor. .Pp -Possible values for nr_ringid are +.Va nr_ringid +selects which rings are controlled through this file descriptor. +Possible values are: .Bl -tag -width XXXXX .It 0 -default, all hardware rings +(default) all hardware rings .It NETMAP_SW_RING -the ``host rings'' connecting to the host stack -.It NETMAP_HW_RING + i -the i-th hardware ring +the ``host rings'', connecting to the host stack. +.It NETMAP_HW_RING | i +the i-th hardware ring . .El +.Pp By default, a -.Nm poll +.Xr poll 2 or -.Nm select +.Xr select 2 call pushes out any pending packets on the transmit ring, even if no write events are specified. The feature can be disabled by or-ing -.Nm NETMAP_NO_TX_SYNC -to nr_ringid. -But normally you should keep this feature unless you are using -separate file descriptors for the send and receive rings, because -otherwise packets are pushed out only if NETMAP_TXSYNC is called, -or the send queue is full. -.Pp -.Pa NIOCREGIF -can be used multiple times to change the association of a -file descriptor to a ring pair, always within the same device. +.Va NETMAP_NO_TX_SYNC +to the value written to +.Va nr_ringid. +When this feature is used, +packets are transmitted only on +.Va ioctl(NIOCTXSYNC) +or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. .Pp When registering a virtual interface that is dynamically created to a .Xr vale 4 @@ -467,6 +579,164 @@ number of slots available for transmissi tells the hardware of consumed packets, and asks for newly available packets. .El +.Sh SELECT AND POLL +.Xr select 2 +and +.Xr poll 2 +on a +.Nm +file descriptor process rings as indicated in +.Sx TRANSMIT RINGS +and +.Sx RECEIVE RINGS +when write (POLLOUT) and read (POLLIN) events are requested. +.Pp +Both block if no slots are available in the ring ( +.Va ring->cur == ring->tail ) +.Pp +Packets in transmit rings are normally pushed out even without +requesting write events. Passing the NETMAP_NO_TX_SYNC flag to +.Em NIOCREGIF +disables this feature. +.Sh LIBRARIES +The +.Nm +API is supposed to be used directly, both because of its simplicity and +for efficient integration with applications. +.Pp +For conveniency, the +.Va <net/netmap_user.h> +header provides a few macros and functions to ease creating +a file descriptor and doing I/O with a +.Nm +port. These are loosely modeled after the +.Xr pcap 3 +API, to ease porting of libpcap-based applications to +.Nm . +To use these extra functions, programs should +.Dl #define NETMAP_WITH_LIBS +before +.Dl #include <net/netmap_user.h> +.Pp +The following functions are available: +.Bl -tag -width XXXXX +.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags) +similar to +.Xr pcap_open , +binds a file descriptor to a port. +.Bl -tag -width XX +.It Va ifname +is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a +.Nm VALE +port. +.It Va flags +can be set to +.Va NETMAP_SW_RING +to bind to the host ring pair, +or to NETMAP_HW_RING to bind to a specific ring. +.Va ring_name +with NETMAP_HW_RING, +is interpreted as a string or an integer indicating the ring to use. +.It Va ring_flags +is copied directly into the ring flags, to specify additional parameters +such as NR_TIMESTAMP or NR_FORWARD. +.El +.It Va int nm_close(struct nm_desc_t *d) +closes the file descriptor, unmaps memory, frees resources. +.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size) +similar to pcap_inject(), pushes a packet to a ring, returns the size +of the packet is successful, or 0 on error; +.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg) +similar to pcap_dispatch(), applies a callback to incoming packets +.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr) +similar to pcap_next(), fetches the next packet +.Pp +.El +.Sh SUPPORTED DEVICES +.Nm +natively supports the following devices: +.Pp +On FreeBSD: +.Xr em 4 , +.Xr igb 4 , +.Xr ixgbe 4 , +.Xr lem 4 , +.Xr re 4 . +.Pp +On Linux +.Xr e1000 4 , +.Xr e1000e 4 , +.Xr igb 4 , +.Xr ixgbe 4 , +.Xr mlx4 4 , +.Xr forcedeth 4 , +.Xr r8169 4 . +.Pp +NICs without native support can still be used in +.Nm +mode through emulation. Performance is inferior to native netmap +mode but still significantly higher than sockets, and approaching +that of in-kernel solutions such as Linux's +.Xr pktgen . +.Pp +Emulation is also available for devices with native netmap support, +which can be used for testing or performance comparison. +The sysctl variable +.Va dev.netmap.admode +globally controls how netmap mode is implemented. +.Sh SYSCTL VARIABLES AND MODULE PARAMETERS +Some aspect of the operation of +.Nm +are controlled through sysctl variables on FreeBSD +.Em ( dev.netmap.* ) +and module parameters on Linux +.Em ( /sys/module/netmap_lin/parameters/* ) : *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201401061456.s06Eu11i095200>