From owner-freebsd-bugs@freebsd.org Mon Jan 29 14:36:09 2018 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 87E51ECA760 for ; Mon, 29 Jan 2018 14:36:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.ysv.freebsd.org (mxrelay.ysv.freebsd.org [IPv6:2001:1900:2254:206a::19:3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mxrelay.ysv.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 20E4F7648E for ; Mon, 29 Jan 2018 14:36:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.ysv.freebsd.org (Postfix) with ESMTPS id 4BACF1115B for ; Mon, 29 Jan 2018 14:36:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id w0TEa8Ic077981 for ; Mon, 29 Jan 2018 14:36:08 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id w0TEa8Gp077980 for freebsd-bugs@FreeBSD.org; Mon, 29 Jan 2018 14:36:08 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 225535] Delays in TCP connection over Gigabit Ethernet connections; Regression from 6.3-RELEASE Date: Mon, 29 Jan 2018 14:36:08 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: aeder@list.ru X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jan 2018 14:36:09 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D225535 Bug ID: 225535 Summary: Delays in TCP connection over Gigabit Ethernet connections; Regression from 6.3-RELEASE Product: Base System Version: 10.3-RELEASE Hardware: i386 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: aeder@list.ru Created attachment 190162 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D190162&action= =3Dedit cycle_clock.c Delays in TCP connection over Gigabit Ethernet connections; Regression from 6.3-RELEASE. Long description of usage case: The company I'm working in is developing soft-realtime applications, with internal cycle like 600 ms (0.6) seconds. The core of applications is worki= ng (for functions safety) simulatenously on two computers with different architecture (A - FreeBSD/Intel, B - Linux/PowerPC).=20 A and B computers connected together via copper Gigabit switch, with no oth= er devices connected to the same network. Normal cycle for application looks like this: 1. Process input from object controllers; 2. Cross-compare input between A and B computers 3. Make evaluations (simulate complex automata). 4. Produce output to send to objects controllers. 5. Cross-compare output between A and B. 6. Send output to object controllers. 7. Sleep until the end-of-cycle.=20 8. Go to step 1 Cross-compare part is done using tcp connection over gigabit ethernet, betw= een A and B computers. If A or B do not able to handle all operations in appropriate time (600 ms = +/- 150 ms) the whole system halts due to internal time checks. Yes, may be using UDP packets may be better - but even in this configuratio= n, last released version of hardware is working just fine - uptime ~ 1 year per installation, approximatelly 100 installations worldwide. Moreover, no case= s of halts due to internal time checks was found - some other defects, software = or hardware, was causing rare halts. So, the old release use A industrial computer, CPU: Intel(R) Core(TM)2 Duo CPU L7400 @ 1.50GHz (1500.12-MHz 686-class CPU) Origin =3D "GenuineIntel" Id =3D 0x6fb Stepping =3D 11 =20 Features=3D0xbfebfbff Features2=3D0xe3bd AMD Features=3D0x20100000 AMD Features2=3D0x1 Cores per package: 2 .... em0: port 0xdf00-0xd= f1f mem 0xff880000-0xff89ffff,0xff860000-0xff87ffff irq 16 at device 0.0 on pci4 em0: Ethernet address: 00:30:64:09:6f:ee .... FreeBSD fspa2 6.3-RELEASE FreeBSD 6.3-RELEASE #0: Wed Jan 16 04:45:45 UTC 2= 008=20 root@dessler.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP i386 and this configuration works just fine.=20 I have made test application simulating main cycle of the complex system, a= nd maximum delay is 24 ms for the 1 million cycles. Test application (running on two different computers) is working like this: 1. Establish tcp connection (one instance listen, another connect). 2. send() small fixed amount of data from both sides 3. recv() small fixed data. 4. send() 40 K of data from both sides 5. recv() 40 K of data on both sides 6. Perform complex, but useless computation (simulate actual application lo= ad). 7. Calculate how much to nanosleep() until end-of-cycle, call nanosleep(). go on step 2 ------------------ every operation is guarded with clock_gettime() and duration of send(), rec= v(), evaluate() and nanosleep() is printed out. I will attach test application sources to this ticket. For FreeBSD 6.3-RELEASE, output looks like this - running between two ident= ical computers with FreeBSD-6.3 installed (only 2-4 columns is shown): fspa2# grep times fspa2_fspa2_vpu.txt | awk '{print $3 " " $4 " " $6 " " $8= " " $10}' | sort | uniq -c 1115322 send_sync 0 0 0 0 7425 send_sync 0 0 0 1 73629 send_sync 0 1 0 0 1 send_sync 0 13 0 0 66 send_sync 0 2 0 0 1 send_sync 0 24 0 0 27 send_sync 0 3 0 0 17 send_sync 0 4 0 0 As you can see, the maximum delay is 24 milliseconds (0.024 s) and it's happends only once. So now the real problem: using FreeBSD 10.3-RELEASE and much more powerfull hardware (Moxa DA-820 industrial computer): CPU: Intel(R) Core(TM) i7-3555LE CPU @ 2.50GHz (2494.39-MHz 686-class CPU) Origin=3D"GenuineIntel" Id=3D0x306a9 Family=3D0x6 Model=3D0x3a Steppi= ng=3D9 =20 Features=3D0xbfebfbff =20 Features2=3D0x7fbae3ff AMD Features=3D0x28100000 AMD Features2=3D0x1 Structured Extended Features=3D0x281 XSAVE Features=3D0x1 VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics em0: port 0x6600-0x661f mem 0xc0a00000-0xc0a1ffff,0xc0a27000-0xc0a27fff irq 20 at device 25.0 on pci0 em0: Using an MSI interrupt em0: Ethernet address: 00:90:e8:69:ea:3c root@fspa2:~/clock/new_res # uname -a FreeBSD fspa2 10.3-RELEASE FreeBSD 10.3-RELEASE #0 r297264: Fri Mar 25 03:5= 1:29 UTC 2016 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC i386 and the same test application we get (running between two identical devices, both Moxa DA-820 with FreeBSD-10.3): root@fspa2:~/clock/new_res # grep times direct_1.txt | awk '{print $3 " "= $4 " " $6 " " $8 " " $10 ;}' | sort | uniq -c 155042 send_sync 0 0 0 0 122890 send_sync 0 0 0 1 1 send_sync 0 0 0 100 1 send_sync 0 0 0 102 1 send_sync 0 0 0 111 1 send_sync 0 0 0 12 1 send_sync 0 0 0 125 2 send_sync 0 0 0 13 1 send_sync 0 0 0 130 1 send_sync 0 0 0 131 1 send_sync 0 0 0 133 2 send_sync 0 0 0 136 1 send_sync 0 0 0 148 2 send_sync 0 0 0 149 3 send_sync 0 0 0 150 1 send_sync 0 0 0 156 1 send_sync 0 0 0 159 1 send_sync 0 0 0 16 1 send_sync 0 0 0 161 1 send_sync 0 0 0 17 1 send_sync 0 0 0 176 1 send_sync 0 0 0 19 18 send_sync 0 0 0 2 1 send_sync 0 0 0 229 1 send_sync 0 0 0 23 1 send_sync 0 0 0 24 2 send_sync 0 0 0 25 1 send_sync 0 0 0 26 1 send_sync 0 0 0 28 1 send_sync 0 0 0 282 1 send_sync 0 0 0 29 14 send_sync 0 0 0 3 1 send_sync 0 0 0 30 1 send_sync 0 0 0 31 1 send_sync 0 0 0 32 1 send_sync 0 0 0 37 1 send_sync 0 0 0 38 14 send_sync 0 0 0 4 1 send_sync 0 0 0 40 1 send_sync 0 0 0 41 1 send_sync 0 0 0 43 1 send_sync 0 0 0 45 2 send_sync 0 0 0 46 4 send_sync 0 0 0 49 16 send_sync 0 0 0 5 2 send_sync 0 0 0 53 2 send_sync 0 0 0 57 4 send_sync 0 0 0 58 14 send_sync 0 0 0 59 17 send_sync 0 0 0 6 20 send_sync 0 0 0 60 16 send_sync 0 0 0 61 8 send_sync 0 0 0 62 4 send_sync 0 0 0 63 1 send_sync 0 0 0 64 1 send_sync 0 0 0 67 1 send_sync 0 0 0 68 9 send_sync 0 0 0 7 1 send_sync 0 0 0 70 1 send_sync 0 0 0 72 1 send_sync 0 0 0 79 5 send_sync 0 0 0 8 3 send_sync 0 0 0 80 1 send_sync 0 0 0 81 1 send_sync 0 0 0 82 1 send_sync 0 0 0 84 1 send_sync 0 0 0 89 3 send_sync 0 0 0 9 1 send_sync 0 0 0 90 1 send_sync 0 0 0 93 1 send_sync 0 0 0 95 1 send_sync 0 0 0 97 147 send_sync 0 1 0 0 1 send_sync 0 33 0 0 As you can see, with only 300.000 cycles (1.100.000 cycles in old configuration) we have multiple cases of long and very long delays, includi= ng delay for 229 ms. The only difference from standart OS configuration is=20 sysctl kern.timecounter.alloweddeviation=3D0 because without it, nanosleep() return with random error up to 4%. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D I have tryed everything that I can think of: 1. Tweaking em0 syscontrols - do not help. In polling mode, works much wors= e. 2. Replace intel em0 devices with Realtec re0 device - slightly better, but stil produce significant delays. 3. Tweaking kernel syscontrols - disabling new rfc compatibility - do not h= elp. 4. Replace cables and switch to different model just from the box - do not help. If anybody have any ideas how to fix it, please comment it here. --=20 You are receiving this mail because: You are the assignee for the bug.=