Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 01 Mar 2002 18:10:56 -0500
From:      Sergey Babkin <sergey@caldera.com>
To:        arch@freebsd.org, chawla@caldera.com
Subject:   proposition for new socket syscalls {send,recv}fromto
Message-ID:  <3C800A80.96CEA9D2@caldera.com>

next in thread | raw e-mail | index | archive | help
Hi all,

In case if anyone wonders, it's still me but from my work e-mail address.

The story is that we (Caldera) are considering a possibility of adding
a couple more of system calls to make the coexistance of applications
with high-availability clusters a bit easier. If this happens, it would
be good to make these syscalls not limited to OpenUnix (former UnixWare)
and OpenLinux but portable among the Unix systems, BSD included. Personally
I believe that BSD would benefit from these syscalls as well.

The situation we are trying to solve is:

In the high-availability clusters it's convenient and typical to assign
an IP address to a logical server (or service). This logical server
may be moved between the physical hosts as neccessary (for example,
if a physical host fails or needs to be shut down for maintenance).
So this addres gets added to an interface of the current physical
host as an alias. Here comes the bad part: this alias happens to be
on the same subnet as the primary address of this interface, and this
may cause a confusion about the source address of the packets coming out
of this host. Yes, I know that this situation is from the area of "you
are not supposed to do this" but the reason seems quite compelling.
It's no big deal for the TCP connections coming to this 
host: when accept() is done, the local side gets whatever address
was specified in the SYN packet, same as for the multi-homed hosts,
and things work fine. But for the UDP servers (for example, tftp or BIND)
there is an issue:

The UDP server sockets normally have INADDR_ANY as their local address. 
As an outgoing packet with INADDR_ANY in its source address goes down 
through the IP layer in ip_output() it notices that and fills in the 
source address with the address of the interface though which the packet 
is going to be sent. Obviously if the machine has two addresses from the 
same subnet then the address found first fill be always used. And here
comes the problem: this address may not be the same to which the client
has sent its request. 

For example, let's suppose that the server has the addresses
192.168.1.1 (the physical host's addesss) and 192.168.1.3 (the
cluster's logical host address) on the same interface. The client
has the address 192.168.1.100 and send a request to the server
at 192.168.1.3. The server handles the requests and sends back the
reply, but since its source address is filled in as described above,
to che client this reply appears as coming from 192.168.1.1, so the
client happily discards it and continues waiting.

The fix in short: the server should do a bind() to the right address
before doing the reply. However in practice this code gets much more
compilcated and ugly, as will be discussed further.

The other situations in which the same problem occurs:

One is a service on a multihomed host. Suppose that a host has two 
interfaces, 192.168.1.1 and 172.16.2.2 with some UDP server running
on it, bound to a socket with local address INADDR_ANY. Some client 
with address 192.168.1.100 sends an UDP request to the server at address
172.16.2.2. The server receives the request and sends a reply back. 
However  it happens that the reply packet is routed through the interface
192.168.1.1 and has its address filled in as such, so again the client
won't recognise the reply.

Another one is the Netware emulator. About 6 years ago I've tried to
port the Netwre emulator from Linux to FreeBSD. However this emulator
sends all the IPX packets from the specific source address, so I've tried
to do bind() and such but did not get it quite right and failed. The
Linux implementation of IPX works around it by sending the whole packet
header with the source and destination addresses in the body of the packet,
which is ugly.

The details of doing the bind():

To reply to some UDP packet destined to some specific address, the 
destination address of this packet must be extracted and then used
as the source address for sending the reply packet. This looks as follows:
First, do 

setsockopt(sockfd, ..., IP_RECVDSTADDR, ....)

to enable extraction of the destination address. Then receive the
packets with recvmsg() and the control buffer pointed to
by msg_control of struct msghdr will (possibly along with the other options)
contain the destination address of the packet received. This option
can be identified by its header (struct cmsghdr) by the fields
cmsg_level==IPPROTO_IP, cmsg_name==IP_RECVDSTADDR. It should be noted 
that struct cmsghdr is not portable. OpenUnix calls the logically same
structure "struct opthdr" and has slightly different field names. So the 
only portable way is to ignore the header structure and handle the options 
in raw byte format or define your own similar structure. 

Then this address can be used to do bind() before sending the reply.
However here we have a bad problem: you can't just do a bind() on the
socket where you are listening for incoming datagrams. If you do so,
the datagrams coming to this port but other addresses of this host
will be thrown away. So what you need to do is to create a new socket, set 
the option SO_REUSEADDR on it, bind it to the specific address and then 
send the datagram from it. Obviously it's a lot of overhead, plus here 
comes another catch: after you do so you can't just close this another 
socket since by this time it may have gotten some incoming datagrams
queued to it. So what you have to do is to keep a cache of sockets
with various addresses bound to them, do select() on all of them
before doing recvmsg(), and when sending an answer reusing the socket
with the right address from the cache (or if there is no socket with
this address cached yet, creating a new one and adding it to the cache).
All this is real, real ugly. 

How can we fix this situation ? Everything would become a lot simpler
if we have the calls:

ssize_t
recvfromto(int s, void *buf, size_t len, int flags, 
  struct sockaddr *from, int *fromlen,
  struct sockaddr *to, int *tolen)

This call would receive a datagram and fill both its source address
(from) and its destination address (to) into the buffers.

ssize_t
sendfromto(int s, void *buf, size_t len, int flags, 
  const struct sockaddr *from, int fromlen,
  const struct sockaddr *to, int tolen)

This call would send a datagram from the specified address to the
specified address without any need to do an extra bind(). Of course,
just as when doing bind() this call shoud check that the "from" address
actually belongs to some local interface.

With these syscalls added the modifications to the servers become easy 
and obvious.

-SB
P.S. I'm going on a trip next week, and will be back only on about 
March 14th, I won't be reading and answering much of e-mail in the meantime

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C800A80.96CEA9D2>