Date: Fri, 01 Mar 2002 18:10:56 -0500 From: Sergey Babkin <sergey@caldera.com> To: arch@freebsd.org, chawla@caldera.com Subject: proposition for new socket syscalls {send,recv}fromto Message-ID: <3C800A80.96CEA9D2@caldera.com>
next in thread | raw e-mail | index | archive | help
Hi all, In case if anyone wonders, it's still me but from my work e-mail address. The story is that we (Caldera) are considering a possibility of adding a couple more of system calls to make the coexistance of applications with high-availability clusters a bit easier. If this happens, it would be good to make these syscalls not limited to OpenUnix (former UnixWare) and OpenLinux but portable among the Unix systems, BSD included. Personally I believe that BSD would benefit from these syscalls as well. The situation we are trying to solve is: In the high-availability clusters it's convenient and typical to assign an IP address to a logical server (or service). This logical server may be moved between the physical hosts as neccessary (for example, if a physical host fails or needs to be shut down for maintenance). So this addres gets added to an interface of the current physical host as an alias. Here comes the bad part: this alias happens to be on the same subnet as the primary address of this interface, and this may cause a confusion about the source address of the packets coming out of this host. Yes, I know that this situation is from the area of "you are not supposed to do this" but the reason seems quite compelling. It's no big deal for the TCP connections coming to this host: when accept() is done, the local side gets whatever address was specified in the SYN packet, same as for the multi-homed hosts, and things work fine. But for the UDP servers (for example, tftp or BIND) there is an issue: The UDP server sockets normally have INADDR_ANY as their local address. As an outgoing packet with INADDR_ANY in its source address goes down through the IP layer in ip_output() it notices that and fills in the source address with the address of the interface though which the packet is going to be sent. Obviously if the machine has two addresses from the same subnet then the address found first fill be always used. And here comes the problem: this address may not be the same to which the client has sent its request. For example, let's suppose that the server has the addresses 192.168.1.1 (the physical host's addesss) and 192.168.1.3 (the cluster's logical host address) on the same interface. The client has the address 192.168.1.100 and send a request to the server at 192.168.1.3. The server handles the requests and sends back the reply, but since its source address is filled in as described above, to che client this reply appears as coming from 192.168.1.1, so the client happily discards it and continues waiting. The fix in short: the server should do a bind() to the right address before doing the reply. However in practice this code gets much more compilcated and ugly, as will be discussed further. The other situations in which the same problem occurs: One is a service on a multihomed host. Suppose that a host has two interfaces, 192.168.1.1 and 172.16.2.2 with some UDP server running on it, bound to a socket with local address INADDR_ANY. Some client with address 192.168.1.100 sends an UDP request to the server at address 172.16.2.2. The server receives the request and sends a reply back. However it happens that the reply packet is routed through the interface 192.168.1.1 and has its address filled in as such, so again the client won't recognise the reply. Another one is the Netware emulator. About 6 years ago I've tried to port the Netwre emulator from Linux to FreeBSD. However this emulator sends all the IPX packets from the specific source address, so I've tried to do bind() and such but did not get it quite right and failed. The Linux implementation of IPX works around it by sending the whole packet header with the source and destination addresses in the body of the packet, which is ugly. The details of doing the bind(): To reply to some UDP packet destined to some specific address, the destination address of this packet must be extracted and then used as the source address for sending the reply packet. This looks as follows: First, do setsockopt(sockfd, ..., IP_RECVDSTADDR, ....) to enable extraction of the destination address. Then receive the packets with recvmsg() and the control buffer pointed to by msg_control of struct msghdr will (possibly along with the other options) contain the destination address of the packet received. This option can be identified by its header (struct cmsghdr) by the fields cmsg_level==IPPROTO_IP, cmsg_name==IP_RECVDSTADDR. It should be noted that struct cmsghdr is not portable. OpenUnix calls the logically same structure "struct opthdr" and has slightly different field names. So the only portable way is to ignore the header structure and handle the options in raw byte format or define your own similar structure. Then this address can be used to do bind() before sending the reply. However here we have a bad problem: you can't just do a bind() on the socket where you are listening for incoming datagrams. If you do so, the datagrams coming to this port but other addresses of this host will be thrown away. So what you need to do is to create a new socket, set the option SO_REUSEADDR on it, bind it to the specific address and then send the datagram from it. Obviously it's a lot of overhead, plus here comes another catch: after you do so you can't just close this another socket since by this time it may have gotten some incoming datagrams queued to it. So what you have to do is to keep a cache of sockets with various addresses bound to them, do select() on all of them before doing recvmsg(), and when sending an answer reusing the socket with the right address from the cache (or if there is no socket with this address cached yet, creating a new one and adding it to the cache). All this is real, real ugly. How can we fix this situation ? Everything would become a lot simpler if we have the calls: ssize_t recvfromto(int s, void *buf, size_t len, int flags, struct sockaddr *from, int *fromlen, struct sockaddr *to, int *tolen) This call would receive a datagram and fill both its source address (from) and its destination address (to) into the buffers. ssize_t sendfromto(int s, void *buf, size_t len, int flags, const struct sockaddr *from, int fromlen, const struct sockaddr *to, int tolen) This call would send a datagram from the specified address to the specified address without any need to do an extra bind(). Of course, just as when doing bind() this call shoud check that the "from" address actually belongs to some local interface. With these syscalls added the modifications to the servers become easy and obvious. -SB P.S. I'm going on a trip next week, and will be back only on about March 14th, I won't be reading and answering much of e-mail in the meantime To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3C800A80.96CEA9D2>