Date: Mon, 1 May 2017 18:25:11 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Alan Somers <asomers@freebsd.org>, Gleb Smirnoff <glebius@freebsd.org>, "src-committers@freebsd.org" <src-committers@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org> Subject: Re: svn commit: r317061 - in head: libexec/rpc.rstatd sys/amd64/amd64 sys/amd64/include sys/arm/arm sys/arm/include sys/arm64/include sys/cddl/contrib/opensolaris/uts/common/fs/zfs sys/compat/linprocfs... Message-ID: <20170501163725.U972@besplex.bde.org> In-Reply-To: <20170430201324.GV1622@kib.kiev.ua> References: <201704171734.v3HHYlf5022945@repo.freebsd.org> <CAOtMX2jdNj0du0ZuUKPr16iHK_YeNVzf-nDvwC-MuFM003VVAg@mail.gmail.com> <20170419130428.J956@besplex.bde.org> <20170430201324.GV1622@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 30 Apr 2017, Konstantin Belousov wrote: > On Wed, Apr 19, 2017 at 02:09:58PM +1000, Bruce Evans wrote: >> On Tue, 18 Apr 2017, Alan Somers wrote: >> >>> On Mon, Apr 17, 2017 at 11:34 AM, Gleb Smirnoff <glebius@freebsd.org> wrote: >> >>>> head/usr.bin/top/machine.c >>>> head/usr.bin/vmstat/vmstat.c >> >> The previous 2 lines turn out to be relevant. I missed the update to >> vmstat and checked a wrong version in my bug report described below >> I checked an updated top, but the change seemed too small to help. >> systat was not updated. >> >>> This change broke backwards compatibility with old top binaries. When >>> I use a kernel at version 317094 but a top from 14-April, I get the >>> error "top: sysctl(vm.stats.vm.v_swappgsin...) failed: Cannot allocate >>> memory". I get the same error when running top from an 11.0-RELEASE >>> jail. Can you please add backward compatibility shims? >> >> I sent a much longer (30 times longer) bug report to the author of the >> bug. >> >> Most or all statistics utilities that use vmmeter are broken. vmstat >> is too broken to even notice the bug -- it silently ignores the error, >> an has type errors which prevent other errors which it doesn't ignore. >> The result is silently truncating to 32 bits, like a quick fix would >> do intentionally. systat -v detects the errors but prints warning >> messages to the status line where they overwrite each other so make >> the problem look smaller than it is. The result is unsilently >> truncating to 32 bits. >> >> I've never seen any backwards compatibility shims added for sysctls. >> None should be needed, but there are few or no utilities other than >> sysctl(8) which use sysctl(3) in a portable way. > See vfs.bufspace handled by vfs_bio.c:sysctl_bufspace(). Ugh. My fix for this continues to work fine. It handles about 162 combinations of integer sizes/signedness for all sysctls in not much more space than sysctl_bufspace(). Unfortunately, the full version breaks the API slightly (it even breaks sysctl(8)), and the dumbed down version causes problems although it technically doesn't change the API. sysctl_bufspace() also changes the API slightly, much like my dumbed down version. This tends to break probes. > In fact, not only top and vmstat are broken. Quite unexpected and > worrying, reboot(8) is broken as well. This is due to get_pageins() > failing and system stopping userspace much earlier than it is desirable. > > I now believe that compat shims to handle int-sized vmstat queries are > required. Here are my current versions and a test program (mainly for old design and implementation bugs with short reads and writes). XX Index: kern_sysctl.c XX =================================================================== XX --- kern_sysctl.c (revision 317604) XX +++ kern_sysctl.c (working copy) This version changes the API too much, and I now use the simpler patch to fix the immediate problem with counters. The full version of it does: - let userland do reads and writes of integers with any size large enough to hold the value - new errno EOVERFLOW for userland sizes that are not large enough. This replaces ENOMEM in most cases for integer sysctls. Only oldlen = 0 should mean to probe. - new errno EOVERFLOW for kernel sizes that are not large enough. Be more careful not to write anything if the kernel size is not large enough. The previous errno for this is EINVAL. - it is no longer very fragile to read with a type larger than the kernel type. The kernel expands to the requested size. This used to be a non-error with garbage in the bits after the end of the unexpanded kernel size. - it is now possible to write with a type smaller than the kernel type. The kernel expands to its size. The case is documented to fail with EINVAL, and works as documented IIRC. - it is now possible to write correctly with a type larger than the kernel type, provided the value fits. The case is documented to fail with EINVAL, but IIRC it is the case that has been broken for almost 20 years. Usually, the bug is harmless since the value fits (due to are 0's or 1's in the unwritten top bits). The dumbed down version keeps ENOMEM instead of EOVERFLOW, and doesn't allow all combinations of sizes. For the counter ABI change, the case where the userland variable is smaller must be supported for reading, and the opposite is not needed yet. I forget what happens for writing. XX @@ -1344,6 +1344,302 @@ XX return (error); XX } XX XX +int sysctl_uicvt(const void *src, size_t srclen, void *dst, size_t dstlen); XX +int XX +sysctl_uicvt(const void *src, size_t srclen, void *dst, size_t dstlen) XX +{ XX + uintmax_t val, valtrunc; XX + uint64_t val64; XX + uint32_t val32; XX + uint16_t val16; XX + uint8_t val8; XX + XX + if (dstlen > sizeof(uintmax_t)) XX + dstlen = sizeof(uintmax_t); XX + if (dstlen > srclen) XX + dstlen = srclen; The conversion functions are trivial, but not as small or as fast as they could be. In most case, no conversion is needed. These dstlen adjustments are dumbing down. The first one is mainly to reduce the number of cases, but is not quite write. Suppose that the userland uintmax_t is larger than the kernel's, or userland is reading into a large u_char array or struct for some reason. Then reducing dstlen to sizeof(uintmax_t) gives garbage after the end of a uintmax_t, the same as in the current API. The second one loses the size adjustment in 1 direction -- see below. This is complicated and perhaps buggy because it tries to support both directions in userland <-> kernel in shared code. XX + switch (srclen) { XX + case 1: XX + val = *(const int8_t *)src; XX + break; XX + case 2: XX + val = *(const int16_t *)src; XX + break; XX + case 4: XX + val = *(const int32_t *)src; XX + break; XX + case 8: XX + val = *(const int64_t *)src; XX + break; XX + default: XX + bcopy(src, dst, MIN(srclen, dstlen)); XX + return (srclen); XX + } XX + XX + switch (dstlen) { XX + case 1: XX + valtrunc = val8 = val; XX + src = &val8; XX + break; XX + case 2: XX + valtrunc = val16 = val; XX + src = &val16; XX + break; XX + case 4: XX + valtrunc = val32 = val; XX + src = &val32; XX + break; XX + case 8: XX + valtrunc = val64 = val; XX + src = &val64; XX + break; XX + default: XX + valtrunc = val - 1; XX + break; XX + } XX + if (valtrunc != val) { XX + if (dstlen != 0) XX + printf("val %#jx, valtrunc %#jx\n", val, valtrunc); XX + bcopy(src, dst, dstlen); XX + return (srclen); XX + } else { XX + bcopy(src, dst, dstlen); XX + return (dstlen); XX + } XX +} XX + XX +/* Output a signed integer if possible. */ XX +static int XX +sysctl_out_i(struct sysctl_req *req, intmax_t val, size_t kernlen) XX +{ XX + size_t len; XX + intmax_t valtrunc; XX + int64_t val64; XX + int32_t val32; XX + int16_t val16; XX + int8_t val8; XX + int error; XX + XX + len = req->oldlen != 0 ? req->oldlen : kernlen; XX + /* XX + * XXX: sysctl(8) determines the correct length, but then twee-ly XX + * doubles it, so we can't tell the application's integer size. XX + * If we supplied the doubled length (as is possible up to size 4), XX + * then sysctl(8) would treat the result as an array. XX + * XX + * Work around this by only supplying the default length if that XX + * fits. This restores the original broken behaviour of not XX + * extending, so applications get garbage in the top bits if they XX + * use a larger integer size. XX + */ XX + if (len > kernlen) XX + len = kernlen; See the comment. Without this hack, sysctl(8) prints most variables as "v 0", where v is the correct output, because it sees int as [u_]int[2]. Statistics utilities are not as "smart" as sysctl, so they kept seeing plain [u_]int, as intended for the counter variables which became uint64_t. XX + switch (len) { XX + case 1: XX + valtrunc = val8 = val; XX + error = SYSCTL_OUT(req, &val8, 1); XX + break; XX + case 2: XX + valtrunc = val16 = val; XX + error = SYSCTL_OUT(req, &val16, 2); XX + break; XX + case 4: XX + valtrunc = val32 = val; XX + error = SYSCTL_OUT(req, &val32, 4); XX + break; XX + case 8: XX + valtrunc = val64 = val; XX + error = SYSCTL_OUT(req, &val64, 8); XX + break; XX + default: XX + /* XXX: not quite compat; should convert back to kernel type. */ XX + return (SYSCTL_OUT(req, &val, sizeof(val))); XX + } XX + if (valtrunc != val && error == 0) { XX + printf("val %#jx, valtrunc %#jx\n", val, valtrunc); XX + error = ENOMEM; /* should be EOVERFLOW */ XX + } XX + return (error); XX +} XX + XX +/* Output an usigned integer if possible. */ XX +static int XX +sysctl_out_ui(struct sysctl_req *req, uintmax_t val, size_t kernlen) XX +{ XX + size_t len; XX + uintmax_t valtrunc; XX + uint64_t val64; XX + uint32_t val32; XX + uint16_t val16; XX + uint8_t val8; XX + int error; XX + XX + len = req->oldlen != 0 ? req->oldlen : kernlen; XX + if (len > kernlen) XX + len = kernlen; XX + switch (len) { XX + case 1: XX + valtrunc = val8 = val; XX + error = SYSCTL_OUT(req, &val8, 1); XX + break; XX + case 2: XX + valtrunc = val16 = val; XX + error = SYSCTL_OUT(req, &val16, 2); XX + break; XX + case 4: XX + valtrunc = val32 = val; XX + error = SYSCTL_OUT(req, &val32, 4); XX + break; XX + case 8: XX + valtrunc = val64 = val; XX + error = SYSCTL_OUT(req, &val64, 8); XX + break; XX + default: XX + return (SYSCTL_OUT(req, &val, sizeof(val))); XX + } XX + if (valtrunc != val && error == 0) { XX + printf("val %#jx, valtrunc %#jx\n", val, valtrunc); XX + error = ENOMEM; /* should be EOVERFLOW */ XX + } XX + return (error); XX +} XX + XX +/* Input a signed integer if possible. */ XX +static int XX +sysctl_in_i(struct sysctl_req *req, void *valp, size_t kernlen) XX +{ XX + intmax_t val; XX + int64_t val64; XX + int32_t val32; XX + int16_t val16; XX + int8_t val8; XX + int error; XX + XX + switch (req->newlen) { XX + case 1: XX + error = SYSCTL_IN(req, &val8, 1); XX + val = val8; XX + break; XX + case 2: XX + error = SYSCTL_IN(req, &val16, 2); XX + val = val16; XX + break; XX + case 4: XX + error = SYSCTL_IN(req, &val32, 4); XX + val = val32; XX + break; XX + case 8: XX + error = SYSCTL_IN(req, &val64, 8); XX + val = val64; XX + break; XX + default: XX + /* Only kernel lengths are supported; we hard-code {1,2,4,8}. */ XX + error = EINVAL; XX + break; XX + } XX + if (error != 0) XX + return (error); XX + XX + switch (kernlen) { XX + case 1: XX + if ((int8_t)val == val) XX + *(int8_t *)valp = val; XX + else XX + error = EINVAL; /* should be EOVERFLOW everywhere */ XX + break; XX + case 2: XX + if ((int16_t)val == val) XX + *(int16_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + case 4: XX + if ((int32_t)val == val) XX + *(int32_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + case 8: XX + if ((uint64_t)val == val) XX + *(uint64_t *)valp = val; XX + else XX + error = EINVAL; XX + default: XX + error = EDOOFUS; XX + break; XX + } XX + return (error); XX +} XX + XX +/* Input an unsigned integer if possible. */ XX +static int XX +sysctl_in_ui(struct sysctl_req *req, void *valp, size_t kernlen) XX +{ XX + uintmax_t val; XX + uint64_t val64; XX + uint32_t val32; XX + uint16_t val16; XX + uint8_t val8; XX + int error; XX + XX + switch (req->newlen) { XX + case 1: XX + error = SYSCTL_IN(req, &val8, 1); XX + val = val8; XX + break; XX + case 2: XX + error = SYSCTL_IN(req, &val16, 2); XX + val = val16; XX + break; XX + case 4: XX + error = SYSCTL_IN(req, &val32, 4); XX + val = val32; XX + break; XX + case 8: XX + default: XX + if (req->newlen == sizeof(val)) XX + error = SYSCTL_IN(req, &val, sizeof(val)); XX + else if (req->newlen == 8) { XX + error = SYSCTL_IN(req, &val64, 8); XX + val = val64; XX + } else XX + error = EINVAL; XX + break; XX + } XX + if (error != 0) XX + return (error); XX + XX + switch (kernlen) { XX + case 1: XX + if ((uint8_t)val == val) XX + *(uint8_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + case 2: XX + if ((uint16_t)val == val) XX + *(uint16_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + case 4: XX + if ((uint32_t)val == val) XX + *(uint32_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + case 8: XX + if ((uint64_t)val == val) XX + *(uint64_t *)valp = val; XX + else XX + error = EINVAL; XX + break; XX + default: XX + error = EDOOFUS; XX + } XX + return (error); XX +} XX + XX /* XX * Handle an int, signed or unsigned. XX * Two cases: XX @@ -1354,24 +1650,35 @@ XX int XX sysctl_handle_int(SYSCTL_HANDLER_ARGS) XX { XX - int tmpout, error = 0; XX + int error; XX XX - /* XX - * Attempt to get a coherent snapshot by making a copy of the data. XX - */ XX - if (arg1) XX - tmpout = *(int *)arg1; XX - else XX - tmpout = arg2; XX - error = SYSCTL_OUT(req, &tmpout, sizeof(int)); XX + switch (oidp->oid_kind & CTLTYPE) { XX + case CTLTYPE_UINT: XX + error = sysctl_out_ui(req, XX + arg1 != NULL ? *(u_int *)arg1 : arg2, sizeof(u_int)); XX + break; XX + case CTLTYPE_INT: XX + default: XX + error = sysctl_out_i(req, XX + arg1 != NULL ? *(int *)arg1 : arg2, sizeof(int)); XX + break; XX + } XX XX if (error || !req->newptr) XX return (error); XX XX if (!arg1) XX - error = EPERM; XX - else XX - error = SYSCTL_IN(req, arg1, sizeof(int)); XX + return (EPERM); XX + XX + switch (oidp->oid_kind & CTLTYPE) { XX + case CTLTYPE_UINT: XX + error = sysctl_in_ui(req, arg1, sizeof(int)); XX + break; XX + case CTLTYPE_INT: XX + default: XX + error = sysctl_in_i(req, arg1, sizeof(int)); XX + break; XX + } XX return (error); XX } I only fixed sysctl_handle_int(), including style bugs and its type puns for u_int. All of the integer handling functions should be in this one, with the case statement expanded in the obvious way. XX XX Index: subr_counter.c XX =================================================================== XX --- subr_counter.c (revision 317604) XX +++ subr_counter.c (working copy) XX @@ -78,11 +78,15 @@ XX sysctl_handle_counter_u64(SYSCTL_HANDLER_ARGS) XX { XX uint64_t out; XX + uint32_t out32; XX int error; XX XX out = counter_u64_fetch(*(counter_u64_t *)arg1); XX XX - error = SYSCTL_OUT(req, &out, sizeof(uint64_t)); XX + if (req->oldlen == 4 && (out32 = out) == out) XX + error = SYSCTL_OUT(req, &out32, sizeof(out32)); XX + else XX + error = SYSCTL_OUT(req, &out, sizeof(out)); XX XX if (error || !req->newptr) XX return (error); This does the minimum necessary to fix the current problem. This works until the counters become too large for a u_int. There could be an option to get truncation with no error, but that would require an API change, so applications should be required to do the truncation for themself if that is what they want. "Smart" applications are still broken by the above. They might have support for reading with 64 bit sizes, but probe with 32-bit sizes. Then the above will succeed early, but later when the counters overflow the sysctl will start failing. It fails with the old errno of ENOMEM, so the "smart" applications should start using 64 bit sizes, but it might assume that the initial sizing is enough for integers. After all, the size of an integer can't change in the kernel while the kernel is running, and the API tacitly implies returning actual kernel integers and not their representation in any convenient smaller type. Test program. It tests mainly for old bugs involving no errors for short reads and writes. After the API change, these non-errors become correct. YY #include <sys/types.h> YY #include <sys/sysctl.h> YY YY #include <err.h> YY #include <errno.h> YY #include <stdint.h> YY #include <stdlib.h> YY YY #define TESTMIB "vfs.ffs.doasyncfree" YY YY /* Use for initialization and later tests. */ YY static void YY get(int32_t *oldvalp) YY { YY size_t oldlen; YY YY oldlen = sizeof(*oldvalp); YY if (sysctlbyname(TESTMIB, oldvalp, &oldlen, NULL, 0) != 0) YY err(1, "basic: read failed"); YY /* YY * It is currently difficult to find the correct size/type to use. YY * Assume that the size is 32 bits here since the focus is not quite YY * on the design errors behind that, and we want to do a safe null YY * change to the MIB (the kernel variable should be a bool or a YY * 1-bit bit-field, and we should be able to set it to 0 or 1 but YY * no other values using any integer type (but not a bit-field). YY */ YY if (oldlen != sizeof(*oldvalp)) YY err(1, "basic: size mismatch"); YY } YY YY /* Use for initial test and unclobber. */ YY static void YY set(int32_t newval) YY { YY size_t newlen; YY YY newlen = sizeof(newval); YY if (sysctlbyname(TESTMIB, NULL, NULL, &newval, YY newlen) != 0) YY err(1, "basic: write failed"); YY } YY YY int YY main(void) YY { YY size_t newlen, oldlen; YY int64_t longnewval, longoldval; YY int32_t oldval, origval; YY int16_t shortnewval, shortoldval; YY int error; YY YY /* Normal operation. */ YY get(&origval); YY set(origval); YY warnx("basic: passed: val %d", origval); YY YY /* Long read. No bugs expected. Just design errors. */ YY shortoldval = ~origval; YY oldlen = sizeof(shortoldval); YY error = sysctlbyname(TESTMIB, &shortoldval, &oldlen, NULL, 0); YY if (error == 0 && shortoldval == origval) YY warnx("long read: unexpectedly worked right"); YY else if (error == 0) YY warnx("long read: unexpectedly succeeded, but gave garbage"); YY else if (errno != ENOMEM) YY warn("long read: failed with unexpected errno"); YY else YY warnx("long read: passed"); YY YY /* Short read. No bugs expected. Just design errors. */ YY longoldval = ~origval; YY oldlen = sizeof(longoldval); YY error = sysctlbyname(TESTMIB, &longoldval, &oldlen, NULL, 0); YY if (error != 0) YY warn("short read: failed"); YY else if (longoldval != origval) YY warnx("short read: passed: garbage in top bits as expected"); YY else if (oldlen == sizeof(longoldval) && longoldval == origval) YY warnx("short read: unexpectedly worked right"); YY else YY warnx("short read: succeeded with right val but wrong length"); YY YY /* Long write. Expect success with top bits not written (broken). */ YY longnewval = 0x100000666; /* unrepresentable */ YY newlen = sizeof(longnewval); YY error = sysctlbyname(TESTMIB, NULL, NULL, &longnewval, newlen); YY get(&oldval); YY if (error != 0 && errno == EINVAL) YY warnx("long write: unexpectedly unbroken (failed with EINVAL)"); YY else if (error == 0 && oldval == 0x666) YY warnx("long write: broken as expected (success and value)"); YY else if (error == 0) YY warnx("long write: more broken than expected (succ. and !val)"); YY else YY warn("long write: failed with unexpected errno"); YY if (oldval == origval) YY warnx("long write: preserved value"); YY else if (oldval == longnewval) YY warnx("long write: set value correctly (impossible)"); YY else YY warnx("long write: clobbered value"); YY set(origval); YY YY /* Another case for long write. */ YY longnewval = -1; /* representable assuming signed */ YY newlen = sizeof(longnewval); YY error = sysctlbyname(TESTMIB, NULL, NULL, &longnewval, newlen); YY get(&oldval); YY if (error != 0 && errno == EINVAL) YY warnx("long write: unexpectedly unbroken (failed with EINVAL)"); YY else if (error == 0 && oldval == longnewval) YY warnx("long write: broken as expected, or working"); YY else if (error == 0) YY warnx("long write: more broken than expected (succ. and !val)"); YY else YY warn("long write: failed with unexpected errno"); YY if (oldval == origval) YY warnx("long write: preserved value"); YY else if (oldval == longnewval) YY warnx("long write: set value correctly"); YY else YY warnx("long write: clobbered value"); YY set(origval); YY YY /* Short write. No bugs expected. Just design errors. */ YY shortnewval = origval + 1; YY newlen = sizeof(shortnewval); YY error = sysctlbyname(TESTMIB, NULL, NULL, &shortnewval, newlen); YY get(&oldval); YY if (error == 0 && oldval == shortnewval) YY warnx("short write: unexpectedly worked"); YY else if (error == 0) YY warnx("short write: unexpectedly succeed but set garbage"); YY else if (errno != EINVAL) YY warn("short write: failed with unexpected errno"); YY else YY warnx("short write: passed"); YY if (oldval == origval) YY warnx("short write: preserved value"); YY else if (oldval == shortnewval) YY warnx("short write: set value correctly"); YY else YY warnx("short write: clobbered value"); YY set(origval); YY return (0); YY } Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170501163725.U972>