Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 15 Feb 2014 08:23:32 +0000 (UTC)
From:      Luigi Rizzo <luigi@FreeBSD.org>
To:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   svn commit: r261912 - head/share/man/man4
Message-ID:  <201402150823.s1F8NWN4000192@svn.freebsd.org>

next in thread | raw e-mail | index | archive | help
Author: luigi
Date: Sat Feb 15 08:23:31 2014
New Revision: 261912
URL: http://svnweb.freebsd.org/changeset/base/261912

Log:
  complete svn 261909 - new netmap version.
  since i updated the manpage i might as well commit it.
  
  MFC after:	3 days

Modified:
  head/share/man/man4/netmap.4

Modified: head/share/man/man4/netmap.4
==============================================================================
--- head/share/man/man4/netmap.4	Sat Feb 15 07:57:01 2014	(r261911)
+++ head/share/man/man4/netmap.4	Sat Feb 15 08:23:31 2014	(r261912)
@@ -27,7 +27,7 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd January 4, 2014
+.Dd February 13, 2014
 .Dt NETMAP 4
 .Os
 .Sh NAME
@@ -36,6 +36,9 @@
 .br
 .Nm VALE
 .Nd a fast VirtuAl Local Ethernet using the netmap API
+.br
+.Nm netmap pipes
+.Nd a shared memory packet transport channel
 .Sh SYNOPSIS
 .Cd device netmap
 .Sh DESCRIPTION
@@ -45,38 +48,55 @@ for both userspace and kernel clients.
 It runs on FreeBSD and Linux,
 and includes
 .Nm VALE ,
-a very fast and modular in-kernel software switch/dataplane.
+a very fast and modular in-kernel software switch/dataplane,
+and
+.Nm netmap pipes ,
+a shared memory packet transport channel.
+All these are accessed interchangeably with the same API.
 .Pp
-.Nm
+.Nm , VALE
 and
-.Nm VALE
-are one order of magnitude faster than sockets, bpf or
-native switches based on
-.Xr tun/tap 4 ,
-reaching 14.88 Mpps with much less than one core on a 10 Gbit NIC,
-and 20 Mpps per core for VALE ports.
+.Nm netmap pipes
+are at least one order of magnitude faster than
+standard OS mechanisms
+(sockets, bpf, tun/tap interfaces, native switches, pipes),
+reaching 14.88 million packets per second (Mpps)
+with much less than one core on a 10 Gbit NIC,
+about 20 Mpps per core for VALE ports,
+and over 100 Mpps for netmap pipes.
 .Pp
 Userspace clients can dynamically switch NICs into
 .Nm
 mode and send and receive raw packets through
 memory mapped buffers.
-A selectable file descriptor supports
-synchronization and blocking I/O.
-.Pp
 Similarly,
 .Nm VALE
-can dynamically create switch instances and ports,
+switch instances and ports, and
+.Nm netmap pipes
+can be created dynamically,
 providing high speed packet I/O between processes,
 virtual machines, NICs and the host stack.
 .Pp
-For best performance,
 .Nm
-requires explicit support in device drivers;
-however, the
+suports both non-blocking I/O through
+.Xr ioctls() ,
+synchronization and blocking I/O through a file descriptor
+and standard OS mechanisms such as
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll 2 ,
+.Xr kqueue 2 .
+.Nm VALE
+and
+.Nm netmap pipes
+are implemented by a single kernel module, which also emulates the
 .Nm
-API can be emulated on top of unmodified device drivers,
-at the price of reduced performance
-(but still better than sockets or BPF/pcap).
+API over standard drivers for devices without native
+.Nm
+support.
+For best performance,
+.Nm
+requires explicit support in device drivers.
 .Pp
 In the rest of this (long) manual page we document
 various aspects of the
@@ -114,10 +134,26 @@ mode use the same memory region,
 accessible to all processes who own
 .Nm /dev/netmap
 file descriptors bound to NICs.
+Independent
 .Nm VALE
-ports instead use separate memory regions.
+and
+.Nm netmap pipe
+ports
+by default use separate memory regions,
+but can be independently configured to share memory.
 .Pp
 .Sh ENTERING AND EXITING NETMAP MODE
+The following section describes the system calls to create
+and control
+.Nm netmap 
+ports (including
+.Nm VALE
+and
+.Nm netmap pipe
+ports).
+Simpler, higher level functions are described in section
+.Xr LIBRARIES .
+.Pp
 Ports and rings are created and controlled through a file descriptor,
 created by opening a special device
 .Dl fd = open("/dev/netmap");
@@ -186,12 +222,11 @@ API. The main structures and fields are 
 .Bd -literal
 struct netmap_if {
     ...
-    const uint32_t   ni_flags;          /* properties     */
+    const uint32_t   ni_flags;      /* properties              */
     ...
-    const uint32_t   ni_tx_rings;       /* NIC tx rings   */
-    const uint32_t   ni_rx_rings;       /* NIC rx rings   */
-    const uint32_t   ni_extra_tx_rings; /* extra tx rings */
-    const uint32_t   ni_extra_rx_rings; /* extra rx rings */
+    const uint32_t   ni_tx_rings;   /* NIC tx rings            */
+    const uint32_t   ni_rx_rings;   /* NIC rx rings            */
+    uint32_t         ni_bufs_head;  /* head of extra bufs list */
     ...
 };
 .Ed
@@ -204,11 +239,14 @@ The number of tx and rx rings
 normally depends on the hardware.
 NICs also have an extra tx/rx ring pair connected to the host stack.
 .Em NIOCREGIF
-can request additional tx/rx rings,
-to be used between multiple processes/threads
-accessing the same
-.Nm
-port.
+can also request additional unbound buffers in the same memory space,
+to be used as temporary storage for packets.
+.Pa ni_bufs_head
+contains the index of the first of these free rings,
+which are connected in a list (the first uint32_t of each
+buffer being the index of the next buffer in the list).
+A 0 indicates the end of the list.
+.Pp
 .It Dv struct netmap_ring (one per ring)
 .Bd -literal
 struct netmap_ring {
@@ -221,9 +259,9 @@ struct netmap_ring {
     const uint32_t tail;        /* (k) first buf owned by kernel */
     ...
     uint32_t       flags;
-    struct timeval ts;          /* (k) time of last rxsync()      */
+    struct timeval ts;          /* (k) time of last rxsync()     */
     ...
-    struct netmap_slot slot[0]; /* array of slots                 */
+    struct netmap_slot slot[0]; /* array of slots                */
 }
 .Ed
 .Pp
@@ -482,14 +520,16 @@ struct nmreq {
     uint32_t  nr_version;        /* (i) API version                */
     uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
     uint32_t  nr_memsize;        /* (o) size of the mmap region    */
-    uint32_t  nr_tx_slots;       /* (o) slots in tx rings          */
-    uint32_t  nr_rx_slots;       /* (o) slots in rx rings          */
-    uint16_t  nr_tx_rings;       /* (o) number of tx rings         */
-    uint16_t  nr_rx_rings;       /* (o) number of tx rings         */
-    uint16_t  nr_ringid;         /* (i) ring(s) we care about      */
+    uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
+    uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
+    uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
+    uint16_t  nr_rx_rings;       /* (i/o) number of tx rings       */
+    uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
     uint16_t  nr_cmd;            /* (i) special command            */
-    uint16_t  nr_arg1;           /* (i) extra arguments            */
-    uint16_t  nr_arg2;           /* (i) extra arguments            */
+    uint16_t  nr_arg1;           /* (i/o) extra arguments          */
+    uint16_t  nr_arg2;           /* (i/o) extra arguments          */
+    uint32_t  nr_arg3;           /* (i/o) extra arguments          */
+    uint32_t  nr_flags           /* (i/o) open mode                */
     ...
 };
 .Ed
@@ -537,20 +577,59 @@ it from the host stack.
 Multiple file descriptors can be bound to the same port,
 with proper synchronization left to the user.
 .Pp
-On return, it gives the same info as NIOCGINFO, and nr_ringid
-indicates the identity of the rings controlled through the file
+.Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
+.Em netmap pipe ,
+consisting of two netmap ports with a crossover connection.
+A netmap pipe share the same memory space of the parent port,
+and is meant to enable configuration where a master process acts
+as a dispatcher towards slave processes.
+.Pp
+To enable this function, the
+.Pa nr_arg1
+field of the structure can be used as a hint to the kernel to
+indicate how many pipes we expect to use, and reserve extra space
+in the memory region.
+.Pp
+On return, it gives the same info as NIOCGINFO,
+with
+.Pa nr_ringid
+and
+.Pa nr_flags
+indicating the identity of the rings controlled through the file
 descriptor.
 .Pp
+.Va nr_flags
 .Va nr_ringid
 selects which rings are controlled through this file descriptor.
-Possible values are:
+Possible values of
+.Pa nr_flags
+are indicated below, together with the naming schemes
+that application libraries (such as the
+.Nm nm_open
+indicated below) can use to indicate the specific set of rings.
+In the example below, "netmap:foo" is any valid netmap port name.
+.Pp
 .Bl -tag -width XXXXX
-.It 0
-(default) all hardware rings
-.It NETMAP_SW_RING
+.It NR_REG_ALL_NIC                         "netmap:foo"
+(default) all hardware ring pairs
+.It NR_REG_SW_NIC           "netmap:foo^"
 the ``host rings'', connecting to the host stack.
-.It NETMAP_HW_RING | i
-the i-th hardware ring .
+.It NR_RING_NIC_SW        "netmap:foo+
+all hardware rings and the host rings
+.It NR_REG_ONE_NIC       "netmap:foo-i"
+only the i-th hardware ring pair, where the number is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_MASTER  "netmap:foo{i"
+the master side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid ;
+.It NR_REG_PIPE_SLAVE   "netmap:foo}i"
+the slave side of the netmap pipe whose identifier (i) is in
+.Pa nr_ringid .
+.Pp
+The identifier of a pipe must be thought as part of the pipe name,
+and does not need to be sequential. On return the pipe
+will only have a single ring pair with index 0,
+irrespective of the value of i.
 .El
 .Pp
 By default, a
@@ -579,7 +658,7 @@ number of slots available for transmissi
 tells the hardware of consumed packets, and asks for newly available
 packets.
 .El
-.Sh SELECT AND POLL
+.Sh SELECT, POLL, EPOLL, KQUEUE.
 .Xr select 2
 and
 .Xr poll 2
@@ -588,16 +667,26 @@ on a
 file descriptor process rings as indicated in
 .Sx TRANSMIT RINGS
 and
-.Sx RECEIVE RINGS
-when write (POLLOUT) and read (POLLIN) events are requested.
-.Pp
-Both block if no slots are available in the ring (
-.Va ring->cur == ring->tail )
+.Sx RECEIVE RINGS ,
+respectively when write (POLLOUT) and read (POLLIN) events are requested.
+Both block if no slots are available in the ring
+.Va ( ring->cur == ring->tail ) .
+Depending on the platform,
+.Xr epoll 2
+and
+.Xr kqueue 2
+are supported too.
 .Pp
-Packets in transmit rings are normally pushed out even without
+Packets in transmit rings are normally pushed out
+(and buffers reclaimed) even without
 requesting write events. Passing the NETMAP_NO_TX_SYNC flag to
 .Em NIOCREGIF
 disables this feature.
+By default, receive rings are processed only if read
+events are requested. Passing the NETMAP_DO_RX_SYNC flag to
+.Em NIOCREGIF updates receive rings even without read events.
+Note that on epoll and kqueue, NETMAP_NO_TX_SYNC and NETMAP_DO_RX_SYNC
+only have an effect when some event is posted for the file descriptor.
 .Sh LIBRARIES
 The
 .Nm
@@ -620,7 +709,7 @@ before
 .Pp
 The following functions are available:
 .Bl -tag -width XXXXX
-.It Va  struct nm_desc_t * nm_open(const char *ifname, const char *ring_name, int flags, int ring_flags)
+.It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
 similar to
 .Xr pcap_open ,
 binds a file descriptor to a port.
@@ -629,26 +718,36 @@ binds a file descriptor to a port.
 is a port name, in the form "netmap:XXX" for a NIC and "valeXXX:YYY" for a
 .Nm VALE
 port.
+.It Va req
+provides the initial values for the argument to the NIOCREGIF ioctl.
+The nm_flags and nm_ringid values are overwritten by parsing
+ifname and flags, and other fields can be overridden through
+the other two arguments.
+.It Va arg
+points to a struct nm_desc containing arguments (e.g. from a previously
+open file descriptor) that should override the defaults.
+The fields are used as described below
 .It Va flags
-can be set to
-.Va NETMAP_SW_RING
-to bind to the host ring pair,
-or to NETMAP_HW_RING to bind to a specific ring.
-.Va ring_name
-with NETMAP_HW_RING,
-is interpreted as a string or an integer indicating the ring to use.
-.It Va ring_flags
-is copied directly into the ring flags, to specify additional parameters
-such as NR_TIMESTAMP or NR_FORWARD.
+can be set to a combination of the following flags:
+.Va NETMAP_NO_TX_POLL ,
+.Va NETMAP_DO_RX_POLL
+(copied into nr_ringid);
+.Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
+avoids the mmap and uses the values from it);
+.Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
+.Va NM_OPEN_ARG1 ,
+.Va NM_OPEN_ARG2 ,
+.Va NM_OPEN_ARG3 (uses the fields from arg);
+.Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
 .El
-.It Va int nm_close(struct nm_desc_t *d)
+.It Va int nm_close(struct nm_desc *d)
 closes the file descriptor, unmaps memory, frees resources.
-.It Va int nm_inject(struct nm_desc_t *d, const void *buf, size_t size)
+.It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
 similar to pcap_inject(), pushes a packet to a ring, returns the size
 of the packet is successful, or 0 on error;
-.It Va int nm_dispatch(struct nm_desc_t *d, int cnt, nm_cb_t cb, u_char *arg)
+.It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
 similar to pcap_dispatch(), applies a callback to incoming packets
-.It Va u_char * nm_nextpkt(struct nm_desc_t *d, struct nm_hdr_t *hdr)
+.It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
 similar to pcap_next(), fetches the next packet
 .Pp
 .El
@@ -740,9 +839,11 @@ performance.
 .Sh SYSTEM CALLS
 .Nm
 uses
-.Xr select 2
+.Xr select 2 ,
+.Xr poll 2 ,
+.Xr epoll
 and
-.Xr poll 2
+.Xr kqueue
 to wake up processes when significant events occur, and
 .Xr mmap 2
 to map memory.
@@ -872,10 +973,10 @@ A simple receiver can be implemented usi
 ...
 void receiver(void)
 {
-    struct nm_desc_t *d;
+    struct nm_desc *d;
     struct pollfd fds;
     u_char *buf;
-    struct nm_hdr_t h;
+    struct nm_pkthdr h;
     ...
     d = nm_open("netmap:ix0", NULL, 0, 0);
     fds.fd = NETMAP_FD(d);
@@ -910,6 +1011,13 @@ to replenish the receive ring:
     ...
 .Ed
 .Ss ACCESSING THE HOST STACK
+The host stack is for all practical purposes just a regular ring pair,
+which you can access with the netmap API (e.g. with
+.Dl nm_open("netmap:eth0^", ... ) ;
+All packets that the host would send to an interface in
+.Nm
+mode end up into the RX ring, whereas all packets queued to the
+TX ring are send up to the host stack.
 .Ss VALE SWITCH
 A simple way to test the performance of a
 .Nm VALE
@@ -917,6 +1025,10 @@ switch is to attach a sender and a recei
 e.g. running the following in two different terminals:
 .Dl pkt-gen -i vale1:a -f rx # receiver
 .Dl pkt-gen -i vale1:b -f tx # sender
+The same example can be used to test netmap pipes, by simply
+changing port names, e.g.
+.Dl pkt-gen -i vale:x{3 -f rx # receiver on the master side
+.Dl pkt-gen -i vale:x}3 -f tx # sender on the slave side
 .Pp
 The following command attaches an interface and the host stack
 to a switch:
@@ -935,6 +1047,14 @@ Communications of the ACM, 55 (3), pp.45
 .Pp
 Luigi Rizzo, netmap: a novel framework for fast packet I/O,
 Usenix ATC'12, June 2012, Boston
+.Pp
+Luigi Rizzo, Giuseppe Lettieri,
+VALE, a switched ethernet for virtual machines,
+ACM CoNEXT'12, December 2012, Nice
+.Pp
+Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
+Speeding up packet I/O in virtual machines,
+ACM/IEEE ANCS'13, October 2013, San Jose
 .Sh AUTHORS
 .An -nosplit
 The
@@ -953,20 +1073,3 @@ and
 .Nm VALE
 have been funded by the European Commission within FP7 Projects
 CHANGE (257422) and OPENLAB (287581).
-.Pp
-.Ss SPECIAL MODES
-When the device name has the form
-.Dl valeXXX:ifname (ifname is an existing interface)
-the physical interface
-(and optionally the corrisponding host stack endpoint)
-are connected or disconnected from the
-.Nm VALE
-switch named XXX.
-In this case the
-.Pa ioctl()
-is only used only for configuration, typically through the
-.Xr vale-ctl
-command.
-The file descriptor cannot be used for I/O, and should be
-closed after issuing the
-.Pa ioctl() .



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201402150823.s1F8NWN4000192>