Date: Sat, 19 Jul 2014 13:28:19 -0400 From: John Baldwin <jhb@freebsd.org> To: Julian Elischer <julian@freebsd.org> Cc: pyunyh@gmail.com, "Russell L. Carter" <rcarter@pinyon.org>, Rick Macklem <rmacklem@uoguelph.ca>, freebsd-net@freebsd.org Subject: Re: NFS client READ performance on -current Message-ID: <1780417.KfjTWjeQCU@pippin.baldwin.cx> In-Reply-To: <53C7B774.60304@freebsd.org> References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> <201407151034.54681.jhb@freebsd.org> <53C7B774.60304@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 17 July 2014 19:45:56 Julian Elischer wrote: > On 7/15/14, 10:34 PM, John Baldwin wrote: > > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote: > >> Yonghyeon Pyun wrote: > >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote: > >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote: > >>>>> John Baldwin wrote: > >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote: > >>>>>>> Russell L. Carter wrote: > >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote: > >>>>>>>>> Could you please post the dmesg stuff for the network > >>>>>>>>> interface, > >>>>>>>>> so I can tell what driver is being used? I'll take a look > >>>>>>>>> at > >>>>>>>>> it, > >>>>>>>>> in case it needs to be changed to use m_defrag(). > >>>>>>>> > >>>>>>>> em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port > >>>>>>>> 0xd020-0xd03f > >>>>>>>> mem > >>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at > >>>>>>>> device 0.0 > >>>>>>>> on > >>>>>>>> pci2 > >>>>>>>> em0: Using an MSI interrupt > >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba > >>>>>>>> 001.000007 [2323] netmap_attach success for em0 > >>>>>>>> tx > >>>>>>>> 1/1024 > >>>>>>>> rx > >>>>>>>> 1/1024 queues/slots > >>>>>>>> > >>>>>>>> This is one of those dual nic cards, so there is em1 as > >>>>>>>> well... > >>>>>>> > >>>>>>> Well, I took a quick look at the driver and it does use > >>>>>>> m_defrag(), > >>>>>>> but > >>>>>>> I think that the "retry:" label it does a goto after doing so > >>>>>>> might > >>>>>>> be in > >>>>>>> the wrong place. > >>>>>>> > >>>>>>> The attached untested patch might fix this. > >>>>>>> > >>>>>>> Is it convenient to build a kernel with this patch applied > >>>>>>> and then > >>>>>>> try > >>>>>>> it with TSO enabled? > >>>>>>> > >>>>>>> rick > >>>>>>> ps: It does have the transmit segment limit set to 32. I have > >>>>>>> no > >>>>>>> idea if > >>>>>>> > >>>>>>> this is a hardware limitation. > >>>>>> > >>>>>> I think the retry is not in the wrong place, but the overhead > >>>>>> of all > >>>>>> those > >>>>>> pullups is apparently quite severe. > >>>>> > >>>>> The m_defrag() call after the first failure will just barely > >>>>> squeeze > >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I > >>>>> think any > >>>>> m_pullup() done during the retry will allocate an mbuf > >>>>> (at a glance it seems to always do this when the old mbuf is a > >>>>> cluster) > >>>>> and prepend that to the list. > >>>>> --> Now the list is > 32 mbufs again and the > >>>>> bus_dmammap_load_mbuf_sg() > >>>>> > >>>>> will fail again on the retry, this time fatally, I think? > >>>>> > >>>>> I can't see any reason to re-do all the stuff using m_pullup() > >>>>> and Russell > >>>>> reported that moving the "retry:" fixed his problem, from what I > >>>>> understood. > >>>> > >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be > >>>> nops in this > >>>> case. It seems the NIC would really like to have all those things > >>>> in a single > >>>> segment, but it is not required, so I agree that your patch is > >>>> fine. > >>> > >>> I recall em(4) controllers have various limitation in TSO. Driver > >>> has to update IP header to make TSO work so driver has to get a > >>> writable mbufs. bpf(4) consumers will see IP packet length is 0 > >>> after this change. I think tcpdump has a compile time option to > >>> guess correct IP packet length. The firmware of controller also > >>> should be able to access complete IP/TCP header in a single buffer. > >>> I don't remember more details in TSO limitation but I guess you may > >>> be able to get more details TSO limitation from publicly available > >>> Intel data sheet. > >> > >> I think that the patch should handle this ok. All of the m_pullup() > >> stuff gets done the first time. Then, if the result is more than 32 > >> mbufs in the list, m_defrag() is called to copy the chain. This should > >> result in all the header stuff in the first mbuf cluster and the map > >> call is done again with this list of clusters. (Without the patch, > >> m_pullup() would allocate another prepended mbuf and make the chain > >> more than 32mbufs again.) > > > > Hmm, I am surprised by the m_pullup() behavior that it doesn't just > > notice that the first mbuf with a cluster has the desired data already > > and returns without doing anything. That is, I'm surprised the first > > > > statement in m_pullup() isn't just: > > if (n->m_len >= len) > > > > return (n); > > I seem to remember that the standard behaviour is for the caller to do > exactly that. Huh, the manpage doesn't really state that, and it does check in one case. However, I think that means that the code in em(4) is busted and should be checking m_len before all the calls to m_pullup(). I think this will fix the issue the same as Rick's change but it might also avoid unnecessary pullups in some cases when defrag isn't needed in the first place. Index: if_em.c =================================================================== --- if_em.c (revision 268570) +++ if_em.c (working copy) @@ -1857,32 +1857,41 @@ retry: * for IPv6 yet. */ ip_off = sizeof(struct ether_header); - m_head = m_pullup(m_head, ip_off); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < ip_off) { + m_head = m_pullup(m_head, ip_off); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } eh = mtod(m_head, struct ether_header *); if (eh->ether_type == htons(ETHERTYPE_VLAN)) { ip_off = sizeof(struct ether_vlan_header); - m_head = m_pullup(m_head, ip_off); + if (m_head->m_len < ip_off) { + m_head = m_pullup(m_head, ip_off); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } + } + } + if (m_head->m_len < ip_off + sizeof(struct ip)) { + m_head = m_pullup(m_head, ip_off + sizeof(struct ip)); if (m_head == NULL) { *m_headp = NULL; return (ENOBUFS); } } - m_head = m_pullup(m_head, ip_off + sizeof(struct ip)); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); - } ip = (struct ip *)(mtod(m_head, char *) + ip_off); poff = ip_off + (ip->ip_hl << 2); if (do_tso) { - m_head = m_pullup(m_head, poff + sizeof(struct tcphdr)); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < poff + sizeof(struct tcphdr)) { + m_head = m_pullup(m_head, poff + + sizeof(struct tcphdr)); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } tp = (struct tcphdr *)(mtod(m_head, char *) + poff); /* @@ -1889,10 +1898,13 @@ retry: * TSO workaround: * pull 4 more bytes of data into it. */ - m_head = m_pullup(m_head, poff + (tp->th_off << 2) + 4); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < poff + (tp->th_off << 2) + 4) { + m_head = m_pullup(m_head, poff + + (tp->th_off << 2) + 4); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } ip = (struct ip *)(mtod(m_head, char *) + ip_off); ip->ip_len = 0; @@ -1907,24 +1919,33 @@ retry: tp->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); } else if (m_head->m_pkthdr.csum_flags & CSUM_TCP) { - m_head = m_pullup(m_head, poff + sizeof(struct tcphdr)); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < poff + sizeof(struct tcphdr)) { + m_head = m_pullup(m_head, poff + + sizeof(struct tcphdr)); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } tp = (struct tcphdr *)(mtod(m_head, char *) + poff); - m_head = m_pullup(m_head, poff + (tp->th_off << 2)); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < poff + (tp->th_off << 2)) { + m_head = m_pullup(m_head, poff + + (tp->th_off << 2)); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } ip = (struct ip *)(mtod(m_head, char *) + ip_off); tp = (struct tcphdr *)(mtod(m_head, char *) + poff); } else if (m_head->m_pkthdr.csum_flags & CSUM_UDP) { - m_head = m_pullup(m_head, poff + sizeof(struct udphdr)); - if (m_head == NULL) { - *m_headp = NULL; - return (ENOBUFS); + if (m_head->m_len < poff + sizeof(struct udphdr)) { + m_head = m_pullup(m_head, poff + + sizeof(struct udphdr)); + if (m_head == NULL) { + *m_headp = NULL; + return (ENOBUFS); + } } ip = (struct ip *)(mtod(m_head, char *) + ip_off); } -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1780417.KfjTWjeQCU>