Date: Fri, 24 Jan 1997 09:59:40 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: julian@whistle.com (Julian Elischer) Cc: bde@FreeBSD.ORG, hackers@FreeBSD.ORG Subject: Re: Bruce! HHHEEELLLLPPPP! Message-ID: <199701241659.JAA27323@phaeton.artisoft.com> In-Reply-To: <32E8BEFD.167EB0E7@whistle.com> from "Julian Elischer" at Jan 24, 97 05:54:05 am
next in thread | previous in thread | raw e-mail | index | archive | help
> I can run the test program once which links the nodes up and > sends some messages around, and then shuts down. > the second time I do that the system hangs. How are you draining the interrupts prior to deregisterring, so that when you reuse the kernel address space for a different driver, an interrupt handler currently in process is not overwritten? There is a generic problem with the LKM system in terms of reference counting, since really there is no active reference counting that can be inversely implied from devices anyway. The only place I really took care of this in the original LKM system was FS's and system calls. The System calls were predicated on the idea that you would write your call's stat routine, and write an unload routine that would: 1) Prevent further entry 2) Wait for processes currently in the code to exit OR force them to exit with error (if possible) 3) Unload when the code was no longer in use Because the rundown code was module-specific, the actual practice was to veto unload and start rundown; this was shown in the sample system call and in the FS unload-without-rundown in the case of an FS module in use for a mount at the time of the request. > attempts to follow it in the kernel debugger > result in the system 'stoppping' on passing an splx(). > I presume that it's racing off somewhere and falling over > or looping, buit whatever it's doing, it doesn't make it back to > the debugger. I've traced the problem down to s few simple bits of > TOTALLY INNOCENT bits of code (i.e if they aer linked into > the node graph the problem occurs but if they are replaced by > a dummy (echo) node. I an run the test forever.. You can not permit interrupts to be processed while you are manipulating structure. Most likely, the code is not present when this happens. It might be worthwhile to flush the instruction cache pipeline (assuming you are using a Pentium or better) by causing a sync operation for the pipe. This would guard against the code in the icache being stale (not very likely, but possible). Alternately, you could be experiencing a nasty L2 cache interaction (I think this is less likely than the "replacing active but unreferenced code entered by an interrupt handler" case). > Bruce.. I spent quite a few hours trying to understand > the interrupt masking system and failed.. I think it's a case of unregistration being required, not masking; Bruce can correct me, but I think a masked interrupt which occurs while masked is delivered at unmask time. This is a scary thought, since it bears on the question "how do I disable an interrupt without potentially causing it to occur following unregistration?". Probably this will have to be done in the routing/delivery code instead of at the hardwre level (mask it, disable it, unmask it, and eat it if it subsequently occurs). Masking a hardware interrupt can not be the answer; I think it would fail in APIC I/O mode anyway, but it would *certainly* fail to do the right thing for a shared PCI or Level EISA interrupt. One diagnostic you might try is to modify the LKM system to not free the memory on unload. This would leave identical code in memory. If you are masking for stack architecture changes, and this causes the code to not fail, then you have identified the problem as the "unloaded" code being entered. Alternately, you could replace the LKM loaded code with a JMP to panic code at the end of the allocation area, and incrementally reverse back up with NOP's. This would let you catch the code being erroneously entered, since after the fill, if the IP goes to one of the NOP's, it executes them all until the JMP (or whatever; I recommend something as small as you can possibly get, though). The inverted fill is in case you run into the IP going the other way. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199701241659.JAA27323>