Date: Sat, 15 Feb 2014 08:23:32 +0000 (UTC) From: Luigi Rizzo <luigi@FreeBSD.org> To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r261912 - head/share/man/man4 Message-ID: <201402150823.s1F8NWN4000192@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: luigi Date: Sat Feb 15 08:23:31 2014 New Revision: 261912 URL: http://svnweb.freebsd.org/changeset/base/261912 Log: complete svn 261909 - new netmap version. since i updated the manpage i might as well commit it. MFC after: 3 days Modified: head/share/man/man4/netmap.4 Modified: head/share/man/man4/netmap.4 ============================================================================== --- head/share/man/man4/netmap.4 Sat Feb 15 07:57:01 2014 (r261911) +++ head/share/man/man4/netmap.4 Sat Feb 15 08:23:31 2014 (r261912) @@ -27,7 +27,7 @@ .\" .\" $FreeBSD$ .\" -.Dd January 4, 2014 +.Dd February 13, 2014 .Dt NETMAP 4 .Os .Sh NAME @@ -36,6 +36,9 @@ .br .Nm VALE .Nd a fast VirtuAl Local Ethernet using the netmap API +.br +.Nm netmap pipes +.Nd a shared memory packet transport channel .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION @@ -45,38 +48,55 @@ for both userspace and kernel clients. It runs on FreeBSD and Linux, and includes .Nm VALE , -a very fast and modular in-kernel software switch/dataplane. +a very fast and modular in-kernel software switch/dataplane, +and +.Nm netmap pipes , +a shared memory packet transport channel. +All these are accessed interchangeably with the same API. .Pp -.Nm +.Nm , VALE and -.Nm VALE -are one order of magnitude faster than sockets, bpf or -native switches based on -.Xr tun/tap 4 , -reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC, -and 20 Mpps per core for VALE ports. +.Nm netmap pipes +are at least one order of magnitude faster than +standard OS mechanisms +(sockets, bpf, tun/tap interfaces, native switches, pipes), +reaching 14.88 million packets per second (Mpps) +with much less than one core on a 10 Gbit NIC, +about 20 Mpps per core for VALE ports, +and over 100 Mpps for netmap pipes. .Pp Userspace clients can dynamically switch NICs into .Nm mode and send and receive raw packets through memory mapped buffers. -A selectable file descriptor supports -synchronization and blocking I/O. -.Pp Similarly, .Nm VALE -can dynamically create switch instances and ports, +switch instances and ports, and +.Nm netmap pipes +can be created dynamically, providing high speed packet I/O between processes, virtual machines, NICs and the host stack. .Pp -For best performance, .Nm -requires explicit support in device drivers; -however, the +suports both non-blocking I/O through +.Xr ioctls() , +synchronization and blocking I/O through a file descriptor +and standard OS mechanisms such as +.Xr select 2 , +.Xr poll 2 , +.Xr epoll 2 , +.Xr kqueue 2 . +.Nm VALE +and +.Nm netmap pipes +are implemented by a single kernel module, which also emulates the .Nm -API can be emulated on top of unmodified device drivers, -at the price of reduced performance -(but still better than sockets or BPF/pcap). +API over standard drivers for devices without native +.Nm +support. +For best performance, +.Nm +requires explicit support in device drivers. .Pp In the rest of this (long) manual page we document various aspects of the @@ -114,10 +134,26 @@ mode use the same memory region, accessible to all processes who own .Nm /dev/netmap file descriptors bound to NICs. +Independent .Nm VALE -ports instead use separate memory regions. +and +.Nm netmap pipe +ports +by default use separate memory regions, +but can be independently configured to share memory. .Pp .Sh ENTERING AND EXITING NETMAP MODE +The following section describes the system calls to create +and control +.Nm netmap +ports (including +.Nm VALE +and +.Nm netmap pipe +ports). +Simpler, higher level functions are described in section +.Xr LIBRARIES . +.Pp Ports and rings are created and controlled through a file descriptor, created by opening a special device .Dl fd = open("/dev/netmap"); @@ -186,12 +222,11 @@ API. The main structures and fields are .Bd -literal struct netmap_if { ... - const uint32_t ni_flags; /* properties */ + const uint32_t ni_flags; /* properties */ ... - const uint32_t ni_tx_rings; /* NIC tx rings */ - const uint32_t ni_rx_rings; /* NIC rx rings */ - const uint32_t ni_extra_tx_rings; /* extra tx rings */ - const uint32_t ni_extra_rx_rings; /* extra rx rings */ + const uint32_t ni_tx_rings; /* NIC tx rings */ + const uint32_t ni_rx_rings; /* NIC rx rings */ + uint32_t ni_bufs_head; /* head of extra bufs list */ ... }; .Ed @@ -204,11 +239,14 @@ The number of tx and rx rings normally depends on the hardware. NICs also have an extra tx/rx ring pair connected to the host stack. .Em NIOCREGIF -can request additional tx/rx rings, -to be used between multiple processes/threads -accessing the same -.Nm -port. +can also request additional unbound buffers in the same memory space, +to be used as temporary storage for packets. +.Pa ni_bufs_head +contains the index of the first of these free rings, +which are connected in a list (the first uint32_t of each +buffer being the index of the next buffer in the list). +A 0 indicates the end of the list. +.Pp .It Dv struct netmap_ring (one per ring) .Bd -literal struct netmap_ring { @@ -221,9 +259,9 @@ struct netmap_ring { const uint32_t tail; /* (k) first buf owned by kernel */ ... uint32_t flags; - struct timeval ts; /* (k) time of last rxsync() */ + struct timeval ts; /* (k) time of last rxsync() */ ... - struct netmap_slot slot[0]; /* array of slots */ + struct netmap_slot slot[0]; /* array of slots */ } .Ed .Pp @@ -482,14 +520,16 @@ struct nmreq { uint32_t nr_version; /* (i) API version */ uint32_t nr_offset; /* (o) nifp offset in mmap region */ uint32_t nr_memsize; /* (o) size of the mmap region */ - uint32_t nr_tx_slots; /* (o) slots in tx rings */ - uint32_t nr_rx_slots; /* (o) slots in rx rings */ - uint16_t nr_tx_rings; /* (o) number of tx rings */ - uint16_t nr_rx_rings; /* (o) number of tx rings */ - uint16_t nr_ringid; /* (i) ring(s) we care about */ + uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ + uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ + uint16_t nr_tx_rings; /* (i/o) number of tx rings */ + uint16_t nr_rx_rings; /* (i/o) number of tx rings */ + uint16_t nr_ringid; /* (i/o) ring(s) we care about */ uint16_t nr_cmd; /* (i) special command */ - uint16_t nr_arg1; /* (i) extra arguments */ - uint16_t nr_arg2; /* (i) extra arguments */ + uint16_t nr_arg1; /* (i/o) extra arguments */ + uint16_t nr_arg2; /* (i/o) extra arguments */ + uint32_t nr_arg3; /* (i/o) extra arguments */ + uint32_t nr_flags /* (i/o) open mode */ ... }; .Ed @@ -537,20 +577,59 @@ it from the host stack. Multiple file descriptors can be bound to the same port, with proper synchronization left to the user. .Pp -On return, it gives the same info as NIOCGINFO, and nr_ringid -indicates the identity of the rings controlled through the file +.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a +.Em netmap pipe , +consisting of two netmap ports with a crossover connection. +A netmap pipe share the same memory space of the parent port, +and is meant to enable configuration where a master process acts +as a dispatcher towards slave processes. +.Pp +To enable this function, the +.Pa nr_arg1 +field of the structure can be used as a hint to the kernel to +indicate how many pipes we expect to use, and reserve extra space +in the memory region. +.Pp +On return, it gives the same info as NIOCGINFO, +with +.Pa nr_ringid +and +.Pa nr_flags +indicating the identity of the rings controlled through the file descriptor. .Pp +.Va nr_flags .Va nr_ringid selects which rings are controlled through this file descriptor. -Possible values are: +Possible values of +.Pa nr_flags +are indicated below, together with the naming schemes +that application libraries (such as the +.Nm nm_open +indicated below) can use to indicate the specific set of rings. +In the example below, "netmap:foo" is any valid netmap port name. +.Pp .Bl -tag -width XXXXX -.It 0 -(default) all hardware rings -.It NETMAP_SW_RING +.It NR_REG_ALL_NIC "netmap:foo" +(default) all hardware ring pairs +.It NR_REG_SW_NIC "netmap:foo^" the ``host rings'', connecting to the host stack. -.It NETMAP_HW_RING | i -the i-th hardware ring . +.It NR_RING_NIC_SW "netmap:foo+ +all hardware rings and the host rings +.It NR_REG_ONE_NIC "netmap:foo-i" +only the i-th hardware ring pair, where the number is in +.Pa nr_ringid ; +.It NR_REG_PIPE_MASTER "netmap:foo{i" +the master side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid ; +.It NR_REG_PIPE_SLAVE "netmap:foo}i" +the slave side of the netmap pipe whose identifier (i) is in +.Pa nr_ringid . +.Pp +The identifier of a pipe must be thought as part of the pipe name, +and does not need to be sequential. On return the pipe +will only have a single ring pair with index 0, +irrespective of the value of i. .El .Pp By default, a @@ -579,7 +658,7 @@ number of slots available for transmissi tells the hardware of consumed packets, and asks for newly available packets. .El -.Sh SELECT AND POLL +.Sh SELECT, POLL, EPOLL, KQUEUE. .Xr select 2 and .Xr poll 2 @@ -588,16 +667,26 @@ on a file descriptor process rings as indicated in .Sx TRANSMIT RINGS and -.Sx RECEIVE RINGS -when write (POLLOUT) and read (POLLIN) events are requested. -.Pp -Both block if no slots are available in the ring ( -.Va ring->cur == ring->tail ) +.Sx RECEIVE RINGS , +respectively when write (POLLOUT) and read (POLLIN) events are requested. +Both block if no slots are available in the ring +.Va ( ring->cur == ring->tail ) . +Depending on the platform, +.Xr epoll 2 +and +.Xr kqueue 2 +are supported too. .Pp -Packets in transmit rings are normally pushed out even without +Packets in transmit rings are normally pushed out +(and buffers reclaimed) even without requesting write events. Passing the NETMAP_NO_TX_SYNC flag to .Em NIOCREGIF disables this feature. +By default, receive rings are processed only if read +events are requested. Passing the NETMAP_DO_RX_SYNC flag to +.Em NIOCREGIF updates receive rings even without read events. +Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC +only have an effect when some event is posted for the file descriptor. .Sh LIBRARIES The .Nm @@ -620,7 +709,7 @@ before .Pp The following functions are available: .Bl -tag -width XXXXX -.It Va struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags) +.It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg) similar to .Xr pcap_open , binds a file descriptor to a port. @@ -629,26 +718,36 @@ binds a file descriptor to a port. is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a .Nm VALE port. +.It Va req +provides the initial values for the argument to the NIOCREGIF ioctl. +The nm_flags and nm_ringid values are overwritten by parsing +ifname and flags, and other fields can be overridden through +the other two arguments. +.It Va arg +points to a struct nm_desc containing arguments (e.g. from a previously +open file descriptor) that should override the defaults. +The fields are used as described below .It Va flags -can be set to -.Va NETMAP_SW_RING -to bind to the host ring pair, -or to NETMAP_HW_RING to bind to a specific ring. -.Va ring_name -with NETMAP_HW_RING, -is interpreted as a string or an integer indicating the ring to use. -.It Va ring_flags -is copied directly into the ring flags, to specify additional parameters -such as NR_TIMESTAMP or NR_FORWARD. +can be set to a combination of the following flags: +.Va NETMAP_NO_TX_POLL , +.Va NETMAP_DO_RX_POLL +(copied into nr_ringid); +.Va NM_OPEN_NO_MMAP (if arg points to the same memory region, +avoids the mmap and uses the values from it); +.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); +.Va NM_OPEN_ARG1 , +.Va NM_OPEN_ARG2 , +.Va NM_OPEN_ARG3 (uses the fields from arg); +.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). .El -.It Va int nm_close(struct nm_desc_t *d) +.It Va int nm_close(struct nm_desc *d) closes the file descriptor, unmaps memory, frees resources. -.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size) +.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size) similar to pcap_inject(), pushes a packet to a ring, returns the size of the packet is successful, or 0 on error; -.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg) +.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg) similar to pcap_dispatch(), applies a callback to incoming packets -.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr) +.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr) similar to pcap_next(), fetches the next packet .Pp .El @@ -740,9 +839,11 @@ performance. .Sh SYSTEM CALLS .Nm uses -.Xr select 2 +.Xr select 2 , +.Xr poll 2 , +.Xr epoll and -.Xr poll 2 +.Xr kqueue to wake up processes when significant events occur, and .Xr mmap 2 to map memory. @@ -872,10 +973,10 @@ A simple receiver can be implemented usi ... void receiver(void) { - struct nm_desc_t *d; + struct nm_desc *d; struct pollfd fds; u_char *buf; - struct nm_hdr_t h; + struct nm_pkthdr h; ... d = nm_open("netmap:ix0", NULL, 0, 0); fds.fd = NETMAP_FD(d); @@ -910,6 +1011,13 @@ to replenish the receive ring: ... .Ed .Ss ACCESSING THE HOST STACK +The host stack is for all practical purposes just a regular ring pair, +which you can access with the netmap API (e.g. with +.Dl nm_open("netmap:eth0^", ... ) ; +All packets that the host would send to an interface in +.Nm +mode end up into the RX ring, whereas all packets queued to the +TX ring are send up to the host stack. .Ss VALE SWITCH A simple way to test the performance of a .Nm VALE @@ -917,6 +1025,10 @@ switch is to attach a sender and a recei e.g. running the following in two different terminals: .Dl pkt-gen -i vale1:a -f rx # receiver .Dl pkt-gen -i vale1:b -f tx # sender +The same example can be used to test netmap pipes, by simply +changing port names, e.g. +.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side +.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side .Pp The following command attaches an interface and the host stack to a switch: @@ -935,6 +1047,14 @@ Communications of the ACM, 55 (3), pp.45 .Pp Luigi Rizzo, netmap: a novel framework for fast packet I/O, Usenix ATC'12, June 2012, Boston +.Pp +Luigi Rizzo, Giuseppe Lettieri, +VALE, a switched ethernet for virtual machines, +ACM CoNEXT'12, December 2012, Nice +.Pp +Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, +Speeding up packet I/O in virtual machines, +ACM/IEEE ANCS'13, October 2013, San Jose .Sh AUTHORS .An -nosplit The @@ -953,20 +1073,3 @@ and .Nm VALE have been funded by the European Commission within FP7 Projects CHANGE (257422) and OPENLAB (287581). -.Pp -.Ss SPECIAL MODES -When the device name has the form -.Dl valeXXX:ifname (ifname is an existing interface) -the physical interface -(and optionally the corrisponding host stack endpoint) -are connected or disconnected from the -.Nm VALE -switch named XXX. -In this case the -.Pa ioctl() -is only used only for configuration, typically through the -.Xr vale-ctl -command. -The file descriptor cannot be used for I/O, and should be -closed after issuing the -.Pa ioctl() .
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201402150823.s1F8NWN4000192>