Date: Sun, 8 May 2011 10:53:14 +0200 From: Joerg Wunsch <freebsd-scsi@uriah.heep.sax.de> To: freebsd-scsi@freebsd.org Subject: Panic when removing a SCSI device entry Message-ID: <20110508085314.GA5364@uriah.heep.sax.de>
next in thread | raw e-mail | index | archive | help
I've got a setup where a tape library is attached with a computer-controllable power switch, so it is only turned on during the time when backups (or restores) are done. This is mainly to reduce the noise level, but also to reduce the overall power consumption energy while that library is not needed. Every now and then, the kernel panics with a page fault during the (unattented, it happens at night times) power cycling and surrounding actions. The current process when the page fault happens is always mt(1), which is used inside the powerup/down script to ensure the drive is being properly rewound. The page fault happens in destroy_devl(), at this location: /* If we are a child, remove us from the parents list */ if (dev->si_flags & SI_CHILD) { here --->>> LIST_REMOVE(dev, si_siblings); dev->si_flags &= ~SI_CHILD; } The preprocessed code of that looks like: if (dev->si_flags & 0x0010) { if ((((dev))->si_siblings.le_next) != ((void *)0)) (((dev))->si_siblings.le_next)->si_siblings.le_prev = (dev)->si_siblings.le_prev; *(dev)->si_siblings.le_prev = (((dev))->si_siblings.le_next); dev->si_flags &= ~0x0010; } and it's the indirection of *(dev)->si_siblings.le_prev that hits a NULL pointer. Obviously, LIST_REMOVE doesn't anticipate that dev->si_siblings.le_prev might be a NULL pointer, so this is a usage error, somehow. Could it be that destroy_devl() is called twice for the same device? This used to happen on an earlier system (some version of 7.x-stable), and I eventually managed it to tweak the powerup/down scripts of the library so to avoid the critical sequence of actions triggering this situation. Now that I finally upgraded the machine to 8.2-STABLE, it is triggered very frequently again though. Any ideas how to fix it, or at least apply a workaround, other than turning *(elm)->field.le_prev = LIST_NEXT((elm), field); \ in the LIST_REMOVE macro into if ((elm)->field.le_prev != NULL) \ *(elm)->field.le_prev = LIST_NEXT((elm), field); \ which affects the entire system, not just the SCSI subsystem part? -- cheers, J"org .-.-. --... ...-- -.. . DL8DTL http://www.sax.de/~joerg/ NIC: JW11-RIPE Never trust an operating system you don't have sources for. ;-)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110508085314.GA5364>