Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Jan 2022 21:00:51 +0100
From:      Peter <pmc@citylink.dinoex.sub.org>
To:        Mark Johnston <markj@freebsd.org>
Cc:        freebsd-stable@freebsd.org, jtl@freebsd.org
Subject:   Re: dtrace bitfields failure (was: 12.3-RC1 fails ...)
Message-ID:  <YdX48%2B2ZooheyGiB@gate.intra.daemon.contact>
In-Reply-To: <YdTRBU2EoMORjPOw@nuc>
References:  <YdRiUXviQK8hCBhC@gate.intra.daemon.contact> <YdSLk72bft8ed%2BdH@nuc> <YdTC9WzqT2p2NgxX@gate.intra.daemon.contact> <YdTRBU2EoMORjPOw@nuc>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 04, 2022 at 05:58:13PM -0500, Mark Johnston wrote:
! On Tue, Jan 04, 2022 at 10:58:13PM +0100, Peter wrote:
! > On Tue, Jan 04, 2022 at 01:01:55PM -0500, Mark Johnston wrote:
! > ! On Tue, Jan 04, 2022 at 04:05:53PM +0100, Peter wrote:
! > ! > 
! > ! > Hija,
! > ! > 
! > ! >   sadly, I was too early in agreeing that the two patches
! > ! >      22082f15f9
! > ! >      68396709e7
! > ! > together do solve the issue. They only do on a certain assumption,
! > ! > which does not hold true in all cases.
! > ! > 
! > ! > 
! > ! > Let's look at https://reviews.freebsd.org/D27213
! > ! > 
! > ! > This is the code in question that will trigger the action:
! > ! > 
! > ! >      if (dst_type == CTF_ERR && name[0] != '\0' &&
! > ! >              (hep = ctf_hash_lookup(&src_fp->ctf_names, src_fp, name,
! > ! >              strlen(name))) != NULL &&
! > ! >              src_type != (ctf_id_t)hep->h_type) {
! > ! > 
! > ! > What happens here: in the case of a bitfield type we need to also
! > ! > copy the corresponding intrinsic type. This condition here checks for
! > ! > the case and also should deliver that respective intrinsic type
! > ! > into the "hep" variable.
! > ! > 
! > ! > But this depends on the assumption that the intrinsic type appears
! > ! > first in the "src_fp" container, so that the hash will point to it.
! > ! > And that is not necessarily true; it depends on what options you have
! > ! > in your kernel config.
! > ! > 
! > ! > 
! > ! > For instance, with my custom kernel, things look like this:
! > ! > 
! > ! > $ ctfdump -t kernel.full
! > ! > 
! > ! > - Types ----------------------------------------------------------------------
! > ! > 
! > ! >   [1] STRUCT (anon) (8 bytes)
! > ! >         sle_next type=262 off=0
! > ! > 
! > ! >   [2] STRUCT (anon) (8 bytes)
! > ! >         stqe_next type=262 off=0
! > ! > 
! > ! >   [3] UNION (anon) (8 bytes)
! > ! >         m_next type=262 off=0
! > ! >         m_slist type=1 off=0
! > ! >         m_stailq type=2 off=0
! > ! > 
! > ! >   [4] UNION (anon) (8 bytes)
! > ! >         m_nextpkt type=262 off=0
! > ! >         m_slistpkt type=1 off=0
! > ! >         m_stailqpkt type=2 off=0
! > ! > 
! > ! >   <5> INTEGER char encoding=SIGNED CHAR offset=0 bits=8
! > ! >   <6> POINTER (anon) refers to 5
! > ! >   <7> TYPEDEF caddr_t refers to 6
! > ! >   <8> INTEGER int encoding=SIGNED offset=0 bits=32
! > ! >   <9> TYPEDEF __int32_t refers to 8
! > ! >   <10> TYPEDEF int32_t refers to 9
! > ! >   [11] INTEGER unsigned int encoding=0x0 offset=0 bits=8
! > ! >   [12] INTEGER unsigned int encoding=0x0 offset=0 bits=24
! > ! >   [13] STRUCT (anon) (8 bytes)
! > ! >         cstqe_next type=229 off=0
! > ! > 
! > ! >   <14> POINTER (anon) refers to 229
! > ! >   [15] STRUCT (anon) (16 bytes)
! > ! >         le_next type=229 off=0
! > ! >         le_prev type=14 off=64
! > ! > 
! > ! >   <16> INTEGER long encoding=SIGNED offset=0 bits=64
! > ! >   <17> ARRAY (anon) content: 5 index: 16 nelems: 16
! > ! > 
! > ! >   <18> INTEGER unsigned int encoding=0x0 offset=0 bits=32
! > ! >   <19> TYPEDEF u_int refers to 18
! > ! > [etc.etc.]
! > ! > 
! > ! > 
! > ! > As we can see, this one has the bitfield types as #11 and #12, and
! > ! > the intrinsic type as #18. And consequentially things do fail.
! > ! > 
! > ! > 
! > ! > I currently do not know what is the culprit. Has the linking stage of
! > ! > the kernel a flaw? Or is the patch D27213 based on a wrong assumption?
! > ! > 
! > ! > I hope You guys can answer that. For now I changed the patch D27213
! > ! > to cover the case, so that things do work.
! > ! > Further details on request.
! > ! 
! > ! I'm not immediately sure where the problem is.  Could you please post
! > ! the kernel configuration and src revision that you're using, so that I
! > ! can try and reproduce this?
! > 
! > Oh, I feared that would come...
! > Src revision is easy now: release/12.3.0 (70cb68e7a00)
! > 
! > Kernel config is difficult. I have compiled into the kernel
! >  * ipfw (obviousely)
! >  * dtraceall 
! >  * drm2 & friends (that needs objects to be added to conf/files)
! >  * khelp/h_ertt/etc. (that needs the files and fixing the SI_SUB
! >    sequence to make it boot)
! > So the kernel config itself doesn't help to reproduce. 
! 
! Can you show output of "ctfdump -S /path/to/your/kernel"?  Though you're
! on a fairly old revision, with this set of extra modules linked into the
! kernel you might be overflowing CTFv2's limit of 2^15 distinct type
! definitions.  That is not the proximate cause of the problem, which as
! you identified is that a bitfield type is appearing before the
! corresponding intrinsic, but it might be the root cause if a type ID
! overflow is causing ctfmerge to emit types in the wrong order.


# ctfdump -S kernel.full 

- CTF Statistics -------------------------------------------------------------

  total number of data objects        = 27754

  total number of functions           = 27789
  total number of function arguments  = 66565
  maximum argument list length        = 22
  average argument list length        = 2.40

  total number of types               = 23838
  total number of integers            = 60
  total number of floats              = 1
  total number of pointers            = 6618
  total number of arrays              = 2079
  total number of func types          = 1910
  total number of structs             = 7474
  total number of unions              = 362
  total number of enums               = 622
  total number of forward tags        = 38
  total number of typedefs            = 3929
  total number of volatile types      = 44
  total number of const types         = 551
  total number of restrict types      = 0
  total number of unknowns (holes)    = 150

  total number of struct members      = 50431
  maximum number of struct members    = 248
  total size of all structs           = 6834343
  maximum size of a struct            = 1593440
  average number of struct members    = 6.75
  average size of a struct            = 914.42

  total number of union members       = 1234
  maximum number of union members     = 36
  total size of all unions            = 39472
  maximum size of a union             = 8208
  average number of union members     = 3.41
  average size of a union             = 109.04

  total number of enum members        = 6380
  maximum number of enum members      = 1023
  average number of enum members      = 10.26

  total number of unique strings      = 47418
  bytes of string data                = 630726
  maximum string length               = 69
  average string length               = 13.30

 
! > What I am currently looking for is only an educated statement, about
! > if that types sequence (as quoted above) can possibly happen, or, should
! > never happen at all.
! > If it should not happen, then it's my fault and I might go and look why
! > it happens.
! > 
! > ! How exactly does the bug manifest?
! > 
! > Exactly as is to be expected, with either of these two errors
! > (depending on the native order of files in /usr/lib/dtrace);
! > 
! > [1]  dtrace: failed to establish error handler: "/usr/lib/dtrace/ipfw.d",
! >   line 107: failed to copy type of 'inp': Conflicting type is already
! >   defined
! > [2]  dtrace: failed to establish error handler:
! >   "/usr/lib/dtrace/psinfo.d", line 41: failed to copy type of
! >   'pr_gid': Conflicting type is already defined 
! > 
! > 
! > Then I single-stepped the libctf and it clearly showed the mismatch
! > between type #11 and type #18 (and the patch 68396709e7 one time doing
! > things where it shouldn't and the other time not doing things where
! > it should).
! > 
! > So I am probably on track with understanding what happens, nevertheless
! > I would greatly appreciate some input from You how it *is supposed to*
! > work.
! 
! Reading libctf's init_types(), it seems pretty clear that the comment
! added in D27213 is true: the second pass over the type graph inserts
! definitions into various hash tables, and for integer types only the
! first instance of a type with a particular name is inserted.  In your
! case this means that a lookup by name of "unsigned int" will return a
! bitfield type.

That is my understanding so far, as well. So then, when I had figured
that the hash lookup delivers the wrong type, I just removed the hash
lookup, and instead search through the types sequentially. That seems
to work. [3], see below

! I can't immediately see how exactly ctfconvert/ctfmerge ensure that
! bitfields are ordered after intrinsic types, if they really do at all.
! It would be interesting to try running ctfconvert on each object file
! for your kernel to see if the ordering you showed above exists in a
! specific object file.  If not, then I think we should check for a type
! ID overflow, then look more closely at ctfmerge.

Lets do it, that will be fun. Because, I tried already by removing
files from the kernel config. The funny thing is, if I remove any
one of (dtraceall, drm2, khelp/h_errt), the problem goes away. Also, if
I remove any one of the h_ertt files (cc_vegas, cc_cdg), the problem
goes away. And, if I *add* things to the kernel config, it also goes
away.
Conclusion: it seems impossible to pinpoint a single file as the cause.

At that point I gave up and started to singlestep the libctf instead.

So, if You're curious and want to figure it out, here is a git-am patch,
that should apply right on top of release/12.3.0 (70cb68e7a00).
It should create a sys/amd64/conf/D6R12V1 kernelconfig and all my
sourcetree modifications, and should cleanly build the respective
kernel. [3] is included, so the flaw is already covered.

http://oper.dinoex.de/.well-known/acme-challenge/patch-for-mark-johnston.patch
(That one should be accessible)

concerning [3]: this is just an utterly non-optimized proof-of-concept
patch.

cheerio,
PMc



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YdX48%2B2ZooheyGiB>