Date: Sun, 6 Dec 2015 21:39:17 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Gavin Mu <gavin.mu@qq.com> Cc: freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: application coredump behavior differences between FreeBSD 7.0andFreeBSD 10.1 Message-ID: <20151206193917.GH2202@kib.kiev.ua> In-Reply-To: <tencent_215F812C18B67A5703B4A4F3@qq.com> References: <tencent_16AD659A73E203C91A0E112F@qq.com> <20151204094550.GO2405@kib.kiev.ua> <tencent_532DCE3C7376849271198715@qq.com> <20151205142403.GE2202@kib.kiev.ua> <tencent_24B00D6972156E7D55E1566E@qq.com> <tencent_215F812C18B67A5703B4A4F3@qq.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Dec 06, 2015 at 04:54:38PM +0800, Gavin Mu wrote: > Hi, kib, > > > It is really related with madvise behavior, I checked code related with MADV_SEQUENTIAL, and it seems there is something wrong with vm_fault() of FreeBSD 10.1. > I did a simple patch: > diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c > index b5ac58f..135fc67 100644 > --- a/sys/vm/vm_fault.c > +++ b/sys/vm/vm_fault.c > @@ -966,6 +966,8 @@ vnode_locked: > */ > if (hardfault) > fs.entry->next_read = fs.pindex + faultcount - reqpage; > + else > + fs.entry->next_read = fs.pindex + 1; > > > vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE); > vm_page_assert_xbusied(fs.m); > > > > without this next_read will not be updated and keeps zero in my testing. I think here next_read should be updated to be pindex + 1. Is my understanding correct? thanks. > Yes, I think you are right pointing out that soft faults are not accounted for the read-ahead and cache-behind behaviour, and this results in the pages behind the read point to be not deactivated. OTOH, the behaviour of ignoring the soft faults is intentional, since read-ahead (and cache-behind) should be only triggered when the pager is executing costly i/o. I think that the right question to ask, which I did not asked in the first reply, is why do you raised the behaviour as an issue ? Generally, the fact that the pages of the shared segment are instantiated, mapped and activated due to accesses (be it coredumping or any other reason) is the expected VM behaviour. I do not think that the behaviour of the 7.x kernel in the specific state, where most of the pages of the large shared segment are not even instantiated, was intentional. It was a consequence of some other, really undesirable, cache-behind configuration, I believe. Making specific optimization for this case (i.e. not-instantiated pages of the swap object accessed by the exiting process) is possible, but is somewhat questionable. > > Regards, > Gavin Mu > > > ------------------ Original ------------------ > From: "Gavin Mu";<gavin.mu@qq.com>; > Date: Sun, Dec 6, 2015 08:14 AM > To: "Konstantin Belousov"<kostikbel@gmail.com>; > Cc: "freebsd-stable"<freebsd-stable@freebsd.org>; > Subject: Re: application coredump behavior differences between FreeBSD 7.0andFreeBSD 10.1 > > > > Hi, kib, > > > It does not help. > I added: > ret = madvise(shm_handle, size * 1024 * 1024 * 1024, MADV_SEQUENTIAL); > if (ret != 0) { > printf("madvise return %d\n", ret); > } > > > > top displays it still uses full memory, below is a snapshot during core dump. > last pid: 3656; load averages: 1.84, 1.29, 1.04 up 0+00:18:06 23:58:37 > 43 processes: 2 running, 41 sleeping > CPU: 1.2% user, 0.0% nice, 85.2% system, 7.8% interrupt, 5.9% idle > Mem: 924M Active, 57M Inact, 745M Wired, 8980K Cache, 103M Buf, 34M Free > Swap: 4096M Total, 188M Used, 3908M Free, 4% Inuse > > > PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND > 3646 root 1 84 0 1036M 710M RUN 0:13 42.29% tt > > > > Regards, > Gavin Mu > > > ------------------ Original ------------------ > From: "Konstantin Belousov";<kostikbel@gmail.com>; > Date: Sat, Dec 5, 2015 10:24 PM > To: "Gavin Mu"<gavin.mu@qq.com>; > Cc: "freebsd-stable"<freebsd-stable@freebsd.org>; > Subject: Re: application coredump behavior differences between FreeBSD 7.0andFreeBSD 10.1 > > > > On Sat, Dec 05, 2015 at 01:09:31PM +0800, Gavin Mu wrote: > > Hi, kib, > > > > > > Please see my testing on FreeBSD 7.0. > > freebsd7# sysctl kern.ipc.shmall > > kern.ipc.shmall: 819200 > > freebsd7# sysctl kern.ipc.shmmax > > kern.ipc.shmmax: 3355443200 > > freebsd7# uname -a > > FreeBSD freebsd7.localdomain 7.0-RELEASE FreeBSD 7.0-RELEASE #0: Sun Feb 24 10:35:36 UTC 2008 root@driscoll.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 > > > > > > > > testing code: > > freebsd7# cat tt.c > > #include <stdio.h> > > #include <stdlib.h> > > #include <machine/param.h> > > #include <sys/types.h> > > #include <sys/ipc.h> > > #include <sys/shm.h> > > > > > > int > > main(int argc, char **argv) > > { > > char **p; > > int size; > > int i; > > char *c = NULL; > > int shmid; > > void *shm_handle; > > size = atoi(argv[1]); > > printf("will alloc %dGB\n", size); > > > > > > shmid = shmget(100, size * 1024 * 1024 * 1024, 0644 | IPC_CREAT); > > if (shmid == -1) { > > printf("shmid = %d\n", shmid); > > } > > > > > > shm_handle = shmat(shmid, NULL, 0); > (shm_handle is not a handle). > > if (shm_handle == -1) { > > printf("null shm_handle\n"); > > } > > > What if you add > madvise(shm_handle, size, MADV_SEQUENTIAL); > there ? Does 10.x behaviour become similar to that of the 7.x ? > > > > > *c = 0; > > return 0; > > } > > > > > > > > freebsd7# ./a.out 1 > > will alloc 1GB > > Segmentation fault (core dumped) > > > > > > > > when a.out is running, the RES keeps being 2024K without increasing: > > > > > > last pid: 735; load averages: 0.00, 0.01, 0.03 up 0+00:15:11 04:43:35 > > 25 processes: 1 running, 24 sleeping > > CPU states: 0.0% user, 0.0% nice, 22.6% system, 0.8% interrupt, 76.7% idle > > Mem: 13M Active, 6380K Inact, 52M Wired, 32K Cache, 39M Buf, 910M Free > > Swap: 2015M Total, 2015M Free > > > > > > PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND > > 734 root 1 -16 0 1027M 2024K wdrain 0:02 13.27% a.out > > > > > > > > but when same code is running on FreeBSD 10.1, the RES keeps increasing to 1GB. From my testing, if the memory is allocated by malloc(), then RES will keep increasing in both 7.0 and 10.1. only sysv_shm in 7.0 has different behavior. I have checked coredump() code but did not find any clue why it is different. > > > > > > Regards, > > Gavin Mu > > > > > > ------------------ Original ------------------ > > From: "Konstantin Belousov";<kostikbel@gmail.com>; > > Date: Fri, Dec 4, 2015 05:45 PM > > To: "Gavin Mu"<gavin.mu@qq.com>; > > Cc: "freebsd-stable"<freebsd-stable@freebsd.org>; > > Subject: Re: application coredump behavior differences between FreeBSD 7.0and FreeBSD 10.1 > > > > > > > > On Fri, Dec 04, 2015 at 09:35:54AM +0800, Gavin Mu wrote: > > > Hi, > > > > > > We have an application running on old FreeBSD 7.0, and we are upgrading the base system to FreeBSD 10.1. The application uses sysv_shm, and will allocate a lot of share memory, though most of time only a part of the allocated memory is used. aka. large SIZE and small RES from /usr/bin/top view. > > > > > > When the application core dump, the core dump file will be large, and in FreeBSD 7.0, it uses only a little more memory to do core dump, but in FreeBSD 10.1, it seems all share memory are touched and uses a lot of physical memory (RES in /usr/bin/top output will increase very much) and cause memory drain. > > > > > > I have been debugging but can not find any clue yet. Could someone provide some points where the issue happen? Thanks. > > > > Both stable/7 and latest HEAD do read the whole mapped segment to write > > the coredump. This behaviour did not changed, since probably introduction > > of the ELF support into FreeBSD. And, how otherwise could coredump file > > contain the content of the mapped segments ? > > > > What in the FreeBSD 10 changed in this regard, is a deadlock fix which > > could occur in some scenarious, including the coredumping. In stable/7, > > the page instantiation or swap-in for pages accessed by the core write, > > was done while owning several VFS locks. This sometimes caused deadlock. > > In stable/10 the deadlock avoidance code is enabled by default, and > > when kernel detects the possibility of the deadlock, it changes to reading > > carefully by small chunks. > > > > Still, this does not explain the effect that you describe. In fact, I > > am more suspicious to the claim that stable/7 did not increase RSS of > > the dumping process or did not accessed the whole mapped shared segment, > > then the claim that there is a regression in stable/10.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151206193917.GH2202>