From owner-svn-src-all@freebsd.org Tue Dec 4 17:53:58 2018 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4818F131480D; Tue, 4 Dec 2018 17:53:58 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EE77485CC1; Tue, 4 Dec 2018 17:53:57 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id DE228A4A; Tue, 4 Dec 2018 17:53:56 +0000 (UTC) (envelope-from vmaffione@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id wB4Hrubm062961; Tue, 4 Dec 2018 17:53:56 GMT (envelope-from vmaffione@FreeBSD.org) Received: (from vmaffione@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id wB4HruP1062960; Tue, 4 Dec 2018 17:53:56 GMT (envelope-from vmaffione@FreeBSD.org) Message-Id: <201812041753.wB4HruP1062960@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: vmaffione set sender to vmaffione@FreeBSD.org using -f From: Vincenzo Maffione Date: Tue, 4 Dec 2018 17:53:56 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org Subject: svn commit: r341482 - stable/11/share/man/man4 X-SVN-Group: stable-11 X-SVN-Commit-Author: vmaffione X-SVN-Commit-Paths: stable/11/share/man/man4 X-SVN-Commit-Revision: 341482 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: EE77485CC1 X-Spamd-Result: default: False [-0.70 / 15.00]; local_wl_from(0.00)[FreeBSD.org]; NEURAL_HAM_MEDIUM(-0.42)[-0.416,0]; NEURAL_SPAM_LONG(0.04)[0.037,0]; ASN(0.00)[asn:11403, ipnet:2610:1c1:1::/48, country:US]; NEURAL_HAM_SHORT(-0.33)[-0.326,0] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Dec 2018 17:53:58 -0000 Author: vmaffione Date: Tue Dec 4 17:53:56 2018 New Revision: 341482 URL: https://svnweb.freebsd.org/changeset/base/341482 Log: MFC r341430 netmap(4): improve man page Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D18057 Modified: stable/11/share/man/man4/netmap.4 Directory Properties: stable/11/ (props changed) Modified: stable/11/share/man/man4/netmap.4 ============================================================================== --- stable/11/share/man/man4/netmap.4 Tue Dec 4 17:49:44 2018 (r341481) +++ stable/11/share/man/man4/netmap.4 Tue Dec 4 17:53:56 2018 (r341482) @@ -27,45 +27,60 @@ .\" .\" $FreeBSD$ .\" -.Dd October 28, 2018 +.Dd November 20, 2018 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O -.Pp -.Nm VALE -.Nd a fast VirtuAl Local Ethernet using the netmap API -.Pp -.Nm netmap pipes -.Nd a shared memory packet transport channel .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm is a framework for extremely fast and efficient packet I/O -for both userspace and kernel clients. +for userspace and kernel clients, and for Virtual Machines. It runs on .Fx -and Linux, and includes -.Nm VALE , -a very fast and modular in-kernel software switch/dataplane, -and -.Nm netmap pipes , -a shared memory packet transport channel. -All these are accessed interchangeably with the same API. +Linux and some versions of Windows, and supports a variety of +.Nm netmap ports , +including +.Bl -tag -width XXXX +.It Nm physical NIC ports +to access individual queues of network interfaces; +.It Nm host ports +to inject packets into the host stack; +.It Nm VALE ports +implementing a very fast and modular in-kernel software switch/dataplane; +.It Nm netmap pipes +a shared memory packet transport channel; +.It Nm netmap monitors +a mechanism similar to +.Xr bpf 4 +to capture traffic +.El .Pp -.Nm , -.Nm VALE -and -.Nm netmap pipes -are at least one order of magnitude faster than +All these +.Nm netmap ports +are accessed interchangeably with the same API, +and are at least one order of magnitude faster than standard OS mechanisms -(sockets, bpf, tun/tap interfaces, native switches, pipes), -reaching 14.88 million packets per second (Mpps) -with much less than one core on a 10 Gbit NIC, -about 20 Mpps per core for VALE ports, -and over 100 Mpps for netmap pipes. +(sockets, bpf, tun/tap interfaces, native switches, pipes). +With suitably fast hardware (NICs, PCIe buses, CPUs), +packet I/O using +.Nm +on supported NICs +reaches 14.88 million packets per second (Mpps) +with much less than one core on 10 Gbit/s NICs; +35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); +about 20 Mpps per core for VALE ports; +and over 100 Mpps for +.Nm netmap pipes . +NICs without native +.Nm +support can still use the API in emulated mode, +which uses unmodified device drivers and is 3-5 times faster than +.Xr bpf 4 +or raw sockets. .Pp Userspace clients can dynamically switch NICs into .Nm @@ -73,8 +88,10 @@ mode and send and receive raw packets through memory mapped buffers. Similarly, .Nm VALE -switch instances and ports, and +switch instances and ports, .Nm netmap pipes +and +.Nm netmap monitors can be created dynamically, providing high speed packet I/O between processes, virtual machines, NICs and the host stack. @@ -86,20 +103,20 @@ synchronization and blocking I/O through a file descri and standard OS mechanisms such as .Xr select 2 , .Xr poll 2 , -.Xr epoll 2 , +.Xr kqueue 2 and -.Xr kqueue 2 . -.Nm VALE -and -.Nm netmap pipes +.Xr epoll 7 . +All types of +.Nm netmap ports +and the +.Nm VALE switch are implemented by a single kernel module, which also emulates the .Nm -API over standard drivers for devices without native -.Nm -support. +API over standard drivers. For best performance, .Nm -requires explicit support in device drivers. +requires native support in device drivers. +A list of such devices is at the end of this document. .Pp In the rest of this (long) manual page we document various aspects of the @@ -116,7 +133,7 @@ which can be connected to a physical interface to the host stack, or to a .Nm VALE -switch). +switch. Ports use preallocated circular queues of buffers .Em ( rings ) residing in an mmapped region. @@ -152,8 +169,9 @@ ports (including and .Nm netmap pipe ports). -Simpler, higher level functions are described in section -.Xr LIBRARIES . +Simpler, higher level functions are described in the +.Sx LIBRARIES +section. .Pp Ports and rings are created and controlled through a file descriptor, created by opening a special device @@ -166,16 +184,18 @@ has multiple modes of operation controlled by the .Vt struct nmreq argument. .Va arg.nr_name -specifies the port name, as follows: +specifies the netmap port name, as follows: .Bl -tag -width XXXX -.It Dv OS network interface name (e.g. 'em0', 'eth1', ... ) +.It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) the data path of the NIC is disconnected from the host stack, and the file descriptor is bound to the NIC (one or all queues), or to the host stack; -.It Dv valeXXX:YYY (arbitrary XXX and YYY) -the file descriptor is bound to port YYY of a VALE switch called XXX, -both dynamically created if necessary. -The string cannot exceed IFNAMSIZ characters, and YYY cannot +.It Dv valeSSS:PPP +the file descriptor is bound to port PPP of VALE switch SSS. +Switch instances and ports are dynamically created if necessary. +.Pp +Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string +cannot exceed IFNAMSIZ characters, and PPP cannot be the name of any existing OS network interface. .El .Pp @@ -193,12 +213,6 @@ Non-blocking I/O is done with special and .Xr poll 2 on the file descriptor permit blocking I/O. -.Xr epoll 2 -and -.Xr kqueue 2 -are not supported on -.Nm -file descriptors. .Pp While a NIC is in .Nm @@ -219,7 +233,7 @@ which is the ultimate reference for the API. The main structures and fields are indicated below: .Bl -tag -width XXX -.It Dv struct netmap_if (one per interface) +.It Dv struct netmap_if (one per interface ) .Bd -literal struct netmap_if { ... @@ -242,14 +256,30 @@ NICs also have an extra tx/rx ring pair connected to t .Em NIOCREGIF can also request additional unbound buffers in the same memory space, to be used as temporary storage for packets. +The number of extra +buffers is specified in the +.Va arg.nr_arg3 +field. +On success, the kernel writes back to +.Va arg.nr_arg3 +the number of extra buffers actually allocated (they may be less +than the amount requested if the memory space ran out of buffers). .Pa ni_bufs_head -contains the index of the first of these free rings, +contains the index of the first of these extra buffers, which are connected in a list (the first uint32_t of each buffer being the index of the next buffer in the list). A .Dv 0 indicates the end of the list. -.It Dv struct netmap_ring (one per ring) +The application is free to modify +this list and use the buffers (i.e., binding them to the slots of a +netmap ring). +When closing the netmap file descriptor, +the kernel frees the buffers contained in the list pointed by +.Pa ni_bufs_head +, irrespectively of the buffers originally provided by the kernel on +.Em NIOCREGIF . +.It Dv struct netmap_ring (one per ring ) .Bd -literal struct netmap_ring { ... @@ -271,7 +301,7 @@ Implements transmit and receive rings, with read/write pointers, metadata and an array of .Em slots describing the buffers. -.It Dv struct netmap_slot (one per buffer) +.It Dv struct netmap_slot (one per buffer ) .Bd -literal struct netmap_slot { uint32_t buf_idx; /* buffer index */ @@ -312,20 +342,17 @@ one slot is always kept empty. The ring size .Va ( num_slots ) should not be assumed to be a power of two. -.br -(NOTE: older versions of netmap used head/count format to indicate -the content of a ring). .Pp .Va head is the first slot available to userspace; -.br +.Pp .Va cur is the wakeup point: select/poll will unblock when .Va tail passes .Va cur ; -.br +.Pp .Va tail is the first slot reserved to the kernel. .Pp @@ -349,7 +376,6 @@ during the execution of a netmap-related system call. The only exception are slots (and buffers) in the range .Va tail\ . . . head-1 , that are explicitly assigned to the kernel. -.Pp .Ss TRANSMIT RINGS On transmit rings, after a .Nm @@ -397,7 +423,7 @@ Below is an example of the evolution of a TX ring: .Fn select and .Fn poll -will block if there is no space in the ring, i.e. +will block if there is no space in the ring, i.e., .Dl ring->cur == ring->tail and return when new slots have become available. .Pp @@ -431,7 +457,7 @@ slots up to are returned to the kernel for further receives, and .Va tail may advance to report new incoming packets. -.br +.Pp Below is an example of the evolution of an RX ring: .Bd -literal after the syscall, there are some (h)eld and some (R)eceived slots @@ -476,10 +502,9 @@ can be delayed indefinitely. This flag helps detect when packets have been sent and a file descriptor can be closed. .It NS_FORWARD -When a ring is in 'transparent' mode (see -.Sx TRANSPARENT MODE ) , -packets marked with this flag are forwarded to the other endpoint -at the next system call, thus restoring (in a selective way) +When a ring is in 'transparent' mode, +packets marked with this flag by the user application are forwarded to the +other endpoint at the next system call, thus restoring (in a selective way) the connection between a NIC and the host stack. .It NS_NO_LEARN tells the forwarding code that the source MAC address for this @@ -488,7 +513,7 @@ packet must not be used in the learning bridge code. indicates that the packet's payload is in a user-supplied buffer whose user virtual address is in the 'ptr' field of the slot. The size can reach 65535 bytes. -.br +.Pp This is only supported on the transmit ring of .Nm VALE ports, and it helps reducing data copies in the interconnection @@ -570,8 +595,8 @@ indicate the size of transmit and receive rings. indicate the number of transmit and receive rings. Both ring number and sizes may be configured at runtime -using interface-specific functions (e.g. -.Xr ethtool +using interface-specific functions (e.g., +.Xr ethtool 8 ). .El .It Dv NIOCREGIF @@ -585,6 +610,15 @@ it from the host stack. Multiple file descriptors can be bound to the same port, with proper synchronization left to the user. .Pp +The recommended way to bind a file descriptor to a port is +to use function +.Va nm_open(..) +(see +.Sx LIBRARIES ) +which parses names to access specific port types and +enable features. +In the following we document the main features. +.Pp .Dv NIOCREGIF can also bind a file descriptor to one endpoint of a .Em netmap pipe , consisting of two netmap ports with a crossover connection. @@ -638,7 +672,7 @@ and does not need to be sequential. On return the pipe will only have a single ring pair with index 0, irrespective of the value of -.Va i. +.Va i . .El .Pp By default, a @@ -650,11 +684,14 @@ no write events are specified. The feature can be disabled by or-ing .Va NETMAP_NO_TX_POLL to the value written to -.Va nr_ringid. +.Va nr_ringid . When this feature is used, packets are transmitted only on .Va ioctl(NIOCTXSYNC) -or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring. +or +.Va select() / +.Va poll() +are called with a write event (POLLOUT/wfdset) or a full ring. .Pp When registering a virtual interface that is dynamically created to a .Xr vale 4 @@ -667,7 +704,7 @@ number of slots available for transmission. tells the hardware of consumed packets, and asks for newly available packets. .El -.Sh SELECT, POLL, EPOLL, KQUEUE. +.Sh SELECT, POLL, EPOLL, KQUEUE .Xr select 2 and .Xr poll 2 @@ -681,7 +718,7 @@ respectively when write (POLLOUT) and read (POLLIN) ev Both block if no slots are available in the ring .Va ( ring->cur == ring->tail ) . Depending on the platform, -.Xr epoll 2 +.Xr epoll 7 and .Xr kqueue 2 are supported too. @@ -700,7 +737,10 @@ Passing the .Dv NETMAP_DO_RX_POLL flag to .Em NIOCREGIF updates receive rings even without read events. -Note that on epoll and kqueue, +Note that on +.Xr epoll 7 +and +.Xr kqueue 2 , .Dv NETMAP_NO_TX_POLL and .Dv NETMAP_DO_RX_POLL @@ -728,13 +768,13 @@ before .Pp The following functions are available: .Bl -tag -width XXXXX -.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) +.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg ) similar to -.Xr pcap_open , +.Xr pcap_open_live 3 , binds a file descriptor to a port. .Bl -tag -width XX .It Va ifname -is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a +is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a .Nm VALE port. .It Va req @@ -743,7 +783,7 @@ The nm_flags and nm_ringid values are overwritten by p ifname and flags, and other fields can be overridden through the other two arguments. .It Va arg -points to a struct nm_desc containing arguments (e.g. from a previously +points to a struct nm_desc containing arguments (e.g., from a previously open file descriptor) that should override the defaults. The fields are used as described below .It Va flags @@ -751,52 +791,70 @@ can be set to a combination of the following flags: .Va NETMAP_NO_TX_POLL , .Va NETMAP_DO_RX_POLL (copied into nr_ringid); -.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, +.Va NM_OPEN_NO_MMAP +(if arg points to the same memory region, avoids the mmap and uses the values from it); -.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); +.Va NM_OPEN_IFNAME +(ignores ifname and uses the values in arg); .Va NM_OPEN_ARG1 , .Va NM_OPEN_ARG2 , -.Va NM_OPEN_ARG3 (uses the fields from arg); -.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). +.Va NM_OPEN_ARG3 +(uses the fields from arg); +.Va NM_OPEN_RING_CFG +(uses the ring number and sizes from arg). .El -.It Va int nm_close(struct nm_desc *d) +.It Va int nm_close(struct nm_desc *d ) closes the file descriptor, unmaps memory, frees resources. -.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) -similar to pcap_inject(), pushes a packet to a ring, returns the size +.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size ) +similar to +.Va pcap_inject() , +pushes a packet to a ring, returns the size of the packet is successful, or 0 on error; -.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) -similar to pcap_dispatch(), applies a callback to incoming packets -.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) -similar to pcap_next(), fetches the next packet +.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg ) +similar to +.Va pcap_dispatch() , +applies a callback to incoming packets +.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr ) +similar to +.Va pcap_next() , +fetches the next packet .El .Sh SUPPORTED DEVICES .Nm natively supports the following devices: .Pp -On FreeBSD: +On +.Fx : +.Xr cxgbe 4 , .Xr em 4 , -.Xr igb 4 , +.Xr iflib 4 +(providing igb, em and lem), .Xr ixgbe 4 , -.Xr lem 4 , -.Xr re 4 . +.Xr ixl 4 , +.Xr re 4 , +.Xr vtnet 4 . .Pp -On Linux -.Xr e1000 4 , -.Xr e1000e 4 , -.Xr igb 4 , -.Xr ixgbe 4 , -.Xr mlx4 4 , -.Xr forcedeth 4 , -.Xr r8169 4 . +On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3. .Pp NICs without native support can still be used in .Nm mode through emulation. Performance is inferior to native netmap -mode but still significantly higher than sockets, and approaching -that of in-kernel solutions such as Linux's -.Xr pktgen . +mode but still significantly higher than various raw socket types +(bpf, PF_PACKET, etc.). +Note that for slow devices (such as 1 Gbit/s and slower NICs, +or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), +emulated and native mode will likely have similar or same throughput. .Pp +When emulation is in use, packet sniffer programs such as tcpdump +could see received packets before they are diverted by netmap. +This behaviour is not intentional, being just an artifact of the implementation +of emulation. +Note that in case the netmap application subsequently moves packets received +from the emulated adapter onto the host RX ring, the sniffer will intercept +those packets again, since the packets are injected to the host stack as they +were received by the network interface. +.Pp Emulation is also available for devices with native netmap support, which can be used for testing or performance comparison. The sysctl variable @@ -805,15 +863,22 @@ globally controls how netmap mode is implemented. .Sh SYSCTL VARIABLES AND MODULE PARAMETERS Some aspect of the operation of .Nm -are controlled through sysctl variables on FreeBSD +are controlled through sysctl variables on +.Fx .Em ( dev.netmap.* ) and module parameters on Linux -.Em ( /sys/module/netmap_lin/parameters/* ) : +.Em ( /sys/module/netmap/parameters/* ) : .Bl -tag -width indent .It Va dev.netmap.admode: 0 Controls the use of native or emulated adapter mode. -0 uses the best available option, 1 forces native and -fails if not available, 2 forces emulated hence never fails. +.Pp +0 uses the best available option; +.Pp +1 forces native mode and fails if not available; +.Pp +2 forces emulated hence never fails. +.It Va dev.netmap.generic_rings: 1 +Number of rings used for emulated netmap mode .It Va dev.netmap.generic_ringsize: 1024 Ring size used for emulated netmap mode .It Va dev.netmap.generic_mit: 100000 @@ -855,15 +920,17 @@ Batch size used when moving packets across a switch. Values above 64 generally guarantee good performance. +.It Va dev.netmap.ptnet_vnet_hdr: 1 +Allow ptnet devices to use virtio-net headers .El .Sh SYSTEM CALLS .Nm uses .Xr select 2 , .Xr poll 2 , -.Xr epoll +.Xr epoll 7 and -.Xr kqueue +.Xr kqueue 2 to wake up processes when significant events occur, and .Xr mmap 2 to map memory. @@ -893,7 +960,7 @@ directory in .Fx distributions. .Pp -.Xr pkt-gen +.Xr pkt-gen 8 is a general purpose traffic source/sink. .Pp As an example @@ -904,11 +971,11 @@ is a traffic sink. Both print traffic statistics, to help monitor how the system performs. .Pp -.Xr pkt-gen +.Xr pkt-gen 8 has many options can be uses to set packet sizes, addresses, rates, and use multiple send/receive threads and cores. .Pp -.Xr bridge +.Xr bridge 4 is another test program which interconnects two .Nm ports. @@ -1000,7 +1067,7 @@ to replenish the receive ring: .Ed .Ss ACCESSING THE HOST STACK The host stack is for all practical purposes just a regular ring pair, -which you can access with the netmap API (e.g. with +which you can access with the netmap API (e.g., with .Dl nm_open("netmap:eth0^", ... ) ; All packets that the host would send to an interface in .Nm @@ -1010,13 +1077,13 @@ TX ring are send up to the host stack. A simple way to test the performance of a .Nm VALE switch is to attach a sender and a receiver to it, -e.g. running the following in two different terminals: +e.g., running the following in two different terminals: .Dl pkt-gen -i vale1:a -f rx # receiver .Dl pkt-gen -i vale1:b -f tx # sender The same example can be used to test netmap pipes, by simply -changing port names, e.g. -.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side -.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side +changing port names, e.g., +.Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side +.Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side .Pp The following command attaches an interface and the host stack to a switch: @@ -1030,6 +1097,7 @@ with the network card or the host. .Xr vale-ctl 4 , .Xr bridge 8 , .Xr lb 8 , +.Xr nmreplay 8 , .Xr pkt-gen 8 .Pp .Pa http://info.iet.unipi.it/~luigi/netmap/ @@ -1088,7 +1156,7 @@ multiqueue, schedulers, packet filters. Multiple transmit and receive rings are supported natively and can be configured with ordinary OS tools, such as -.Xr ethtool +.Xr ethtool 8 or device-specific sysctl variables. The same goes for Receive Packet Steering (RPS)