Date: Tue, 29 May 2018 19:38:19 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: Mark Johnston <markj@FreeBSD.org> Cc: freebsd-current <freebsd-current@FreeBSD.org>, Julian Elischer <julian@FreeBSD.org>, Bryan Drewery <bdrewery@FreeBSD.org> Subject: Re: Bad link elm in vm_object_terminate [Was: crash on process exit.. current at about r332467] Message-ID: <8ac5295c-d915-2994-6bcd-bc5a1a68f075@FreeBSD.org> In-Reply-To: <20180529162217.GA99109@raichu> References: <9479e941-39be-e6e2-869e-aac475c5e33a@freebsd.org> <9bf4b2b0-65a2-90ef-c8c0-3022e80bc149@FreeBSD.org> <20180529162217.GA99109@raichu>
next in thread | previous in thread | raw e-mail | index | archive | help
On 29/05/2018 19:22, Mark Johnston wrote: > On Tue, May 29, 2018 at 04:50:14PM +0300, Andriy Gapon wrote: >> On 23/04/2018 17:50, Julian Elischer wrote: >>> back trace at: http://www.freebsd.org/~julian/bob-crash.png >>> >>> If anyone wants to take a look.. >>> >>> In the exit syscall, while deallocating a vm object. >>> >>> I haven't see references to a similar crash in the last 10 days or so.. But if >>> it rings any bells... >> >> We have just got another one: >> panic: Bad link elm 0xfffff80cc3938360 prev->next != elm >> >> Matching disassembled code to C code, it seems that the crash is somewhere in >> vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of >> TAILQ_REMOVE-s there: >> if (p->queue != PQ_NONE) { >> KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: " >> "page %p is not queued", p)); >> pq1 = vm_page_pagequeue(p); >> if (pq != pq1) { >> if (pq != NULL) { >> vm_pagequeue_cnt_add(pq, dequeued); >> vm_pagequeue_unlock(pq); >> } >> pq = pq1; >> vm_pagequeue_lock(pq); >> dequeued = 0; >> } >> p->queue = PQ_NONE; >> TAILQ_REMOVE(&pq->pq_pl, p, plinks.q); >> dequeued--; >> } >> if (vm_page_free_prep(p, true)) >> continue; >> unlist: >> TAILQ_REMOVE(&object->memq, p, listq); >> } >> >> >> Please note that this is the code before r332974 Improve VM page queue scalability. >> I am not sure if r332974 + r333256 would fix the problem or if it just would get >> moved to a different place. >> >> Does this ring a bell to anyone who tinkered with that part of the VM code recently? > > This doesn't look familiar to me and I doubt that r332974 fixed the > underlying problem, whatever it is. I see. >> Looking a little bit further, I think that object->memq somehow got corrupted. >> memq contains just two elements and the reported element is not there. > > Based on the debugging session, it would be interesting to know if there > were any other threads somehow manipulating the (dead) object at the > time of the panic. I will check for this. > Among the panics that you observed, is it the same application that is > causing the crash in each case? I have two crash dumps right now and in both cases it's sh exec-ing grep. But I cannot imagine what could be so special about that. Actually, I see that the shell ran a long pipeline with many grep-s in it, so there were many exec-s and exits of grep, perhaps some of them concurrent. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8ac5295c-d915-2994-6bcd-bc5a1a68f075>