From nobody Tue Nov 9 15:56:19 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 2BDB51854FE0 for ; Tue, 9 Nov 2021 15:56:22 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HpXdx6tYbz3LWx; Tue, 9 Nov 2021 15:56:21 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from [192.168.0.88] (unknown [195.64.148.76]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: avg/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 78FCB27356; Tue, 9 Nov 2021 15:56:21 +0000 (UTC) (envelope-from avg@freebsd.org) From: Andriy Gapon To: Konstantin Belousov Cc: FreeBSD Current References: <20191022104434.GM73312@kib.kiev.ua> <3a67f9a9-31cf-5814-4a68-8bdd6063b21e@FreeBSD.org> <20191022131633.GN73312@kib.kiev.ua> <9c131a2a-cc94-4d93-1ba8-595c0151e366@FreeBSD.org> <32fe4f76-155b-8b99-5782-4daafd4219d5@FreeBSD.org> Subject: Re: thread on sleepqueue does not wake up after timeout Message-ID: Date: Tue, 9 Nov 2021 17:56:19 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101 Firefox/78.0 Thunderbird/78.14.0 List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-ThisMailContainsUnwantedMimeParts: N On Tue, Nov 09, 2021 at 11:58:30AM +0200, Andriy Gapon wrote: > Here is an explanation for the numbers reported in the panic message (sorted > from earliest to latest): > 190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime(); > 190543869738008 - td_sleeptimo, also c_time in the callout; > 190543869798505 - 'now' as captured when the LAPIC timer fired, seen in the > stack trace and also recorded as c_exec_time in the callout. Kostik, thank you very much for the pointers! I spent some more time staring at the code and at the timehands data from the dump (which I neglected to share earlier). I am starting to think that there is a bug in the FreeBSD code (at least in the copy that we use) unless I got confused somewhere or made a mistake in calculations. So, I think that there is a discrepancy between how "large deltas" are handled in tc_windup and bintime_off. As large deltas happen very rarely, especially on good hardware, the bug should be very rare as well. Now to the data. (kgdb) p *timehands->th_counter $28 = {tc_get_timecount = 0xffffffff809de380 , tc_poll_pps = 0x0, tc_counter_mask = 4294967295, tc_frequency = 14318180, tc_name = 0xffffffff80b0ff97 "HPET", tc_quality = 950, tc_flags = 0, tc_priv = 0xfffffe0010916000, tc_next = 0xffffffff810f6c30 , tc_fill_vdso_timehands = 0xffffffff809dc6b0 , tc_fill_vdso_timehands32 = 0x0} (kgdb) p timehands_count $76 = 2 (kgdb) p timehands $75 = (struct timehands * volatile) 0xffffffff8109e1a0 (kgdb) p &ths[0] $77 = (struct timehands *) 0xffffffff8109e120 (kgdb) p &ths[1] $78 = (struct timehands *) 0xffffffff8109e1a0 (kgdb) p ths[0] $79 = {th_counter = 0xfffffe0010916060, th_adjustment = 254493021346896, th_scale = 1288420532592, th_large_delta = 14317331, th_offset_count = 3817197766, th_offset = {sec = 44363, frac = 7084573033620442688}, th_bintime = { sec = 1636195324, frac = 14622574300909856022}, th_microtime = {tv_sec = 1636195324, tv_usec = 792691}, th_nanotime = {tv_sec = 1636195324, tv_nsec = 792691341}, th_boottime = {sec = 1636150961, frac = 7538001267289413334}, th_generation = 2204358, th_next = 0xffffffff8109e1a0 } (kgdb) p ths[1] $80 = {th_counter = 0xfffffe0010916060, th_adjustment = 254492583661022, th_scale = 1288420532460, th_large_delta = 14317331, th_offset_count = 3832485779, th_offset = {sec = 44364, frac = 8334125784005739824}, th_bintime = { sec = 1636195325, frac = 15872127051295153158}, th_microtime = {tv_sec = 1636195325, tv_usec = 860429}, th_nanotime = {tv_sec = 1636195325, tv_nsec = 860429731}, th_boottime = {sec = 1636150961, frac = 7538001267289413334}, th_generation = 2204358, th_next = 0xffffffff8109e120 } th_offset_count difference between the hands is 15288013. It's a bit above tc_frequency of 14318180, so before the latest wind-up there hasn't been a wind-up for more than a second, a rare situation indeed. The difference is also greater than th_large_delta of 14317331. I redid the th_offset calculations in tc_windup by hand and arrived at exactly the same value of ths[1].th_offset as seen in kgdb using ths[0].th_offset and the delta as inputs. So, this is consistent. Then I did a thought experiment: what would binuptime() return at exactly the same moment when tc_windup was called? That binuptime() would still use ths[0] as the timehands because the hands have not been switched yet and it would also see exactly the same timecounter delta. So, starting conditions: delta = 15288013 th_large_delta = 14317331 th_offset = {sec = 44363, frac = 7084573033620442688} th_scale = 1288420532592 The calculations in the code (bintime_off) are: if (__predict_false(delta >= large_delta)) { /* Avoid overflow for scale * delta. */ x = (scale >> 32) * delta; bt->sec += x >> 32; bintime_addx(bt, x << 32); bintime_addx(bt, (scale & 0xffffffff) * delta); } else { My manual calculations: x = (1288420532592 >> 32) * 15288013 == 4571115887 bt->sec += 4571115887 >> 32 == 1 bt = { 44364, 7084573033620442688 } bt 4571115887 << 32 == 1186049167181479936 bt = { 44364, 8270622200801922624 } bt (scale & 0xffffffff) * delta == 4225311088 * 15288013 == 64596610842388144 bt = { 44364, 8335218811644310768 } So, comparing to ths[1].th_offset the resulting time has larger 'frac' part: 8335218811644310768 - 8334125784005739824 == 1093027638570944 So, IMO, this means that at the moment of the hands switch the binuptime (and all other times) would jump backwards. Converting both times to sbintime_t I got: 190543869814104 is sbinuptime using ths[0] (and delta of 15288013) 190543869559614 is sbinuptime using ths[1] (and delta of 0) This is a jump backwards by 254490 parts. If I put these times together with the times found in the crash dump stack (quoted at the start), then I get: 190543869559614 - hand switch time calculated using ths[1] (and delta of 0) 190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime(); 190543869738008 - td_sleeptimo, also c_time in the callout; 190543869798505 - 'now' as captured when the LAPIC timer fired 190543869814104 - hand switch time calculated using ths[0] (and delta of 15288013) So, it seems that the original (greater) 'now' was obtained some time before the switch. And the second (lesser) 'now' was obtained some time after the switch. So, as I was saying, when the delta is large the calculations in tc_windup and bintime_off give slightly different results and that can lead to a discontinuity of the time when timehands are switched. Just in case, the code in tc_windup for comparison: th->th_offset_count += delta; th->th_offset_count &= th->th_counter->tc_counter_mask; while (delta > th->th_counter->tc_frequency) { /* Eat complete unadjusted seconds. */ delta -= th->th_counter->tc_frequency; th->th_offset.sec++; } if ((delta > th->th_counter->tc_frequency / 2) && (th->th_scale * delta < ((uint64_t)1 << 63))) { /* The product th_scale * delta just barely overflows. */ th->th_offset.sec++; } bintime_addx(&th->th_offset, th->th_scale * delta); In my case there was a single iteration of the while loop and the (delta > th->th_counter->tc_frequency / 2) branch was not taken. -- Andriy Gapon