From nobody Tue Nov 2 14:08:27 2021 X-Original-To: bugs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 3B6351843E5A for ; Tue, 2 Nov 2021 14:08:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HkBZg171tz4hMM for ; Tue, 2 Nov 2021 14:08:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 066E326492 for ; Tue, 2 Nov 2021 14:08:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 1A2E8QZt056294 for ; Tue, 2 Nov 2021 14:08:26 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 1A2E8QSs056293 for bugs@FreeBSD.org; Tue, 2 Nov 2021 14:08:26 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 259611] [smartpqi] file system checksum on blocks give errors on higher parallel loads Date: Tue, 02 Nov 2021 14:08:27 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 13.0-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: girgen@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Bug reports List-Archive: https://lists.freebsd.org/archives/freebsd-bugs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-bugs@freebsd.org MIME-Version: 1.0 X-ThisMailContainsUnwantedMimeParts: N https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D259611 Bug ID: 259611 Summary: [smartpqi] file system checksum on blocks give errors on higher parallel loads Product: Base System Version: 13.0-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: girgen@FreeBSD.org Hi! I also have problems with this controller. With 13.0 installed, it crashed quite quickly on just IO intermediate load. After upgrading to -STABLE on October 12 2021, the system is quite stable, BUT, when restoring postgresql databases with pg_restore -j 5 (five writes in parallel), the database later reports checksum errors when reading some blocks back. This seems to happen mainly for big database indexes that where generated in parallel. I didn't notice until I took a pg_basebackup because postgresql does not validate the checksum until it is read. Sorry, lots of database methods, not necessarily common knowledge for scsi experts. A pg_basebackup basically copies all the files, quite similar to an rsync, but optiionally also validates a CRC checksum, that was calculated f= or each block was they where written, as it reads the data pg_restore reads a database dumps, writes all the data to disk and creates = the indexes using sql create index commands, that is, looking the written files= and calculates the index and writes them. For about 1,3 TB of database data, the system had 2324 blocks with checksum errors. All but two of them where with indexes, which kind of suggest that = this *could* be a postgresql issue, but given the amount of users using postgres= ql as opposed to the amount of users using this controller with freebsd, I'm reluctant to discredit postgresql here. We should have heard of it if there= was a problem with postgresql? Since most errors where with the indexes, they could be reindexed, and the = one data table that was broken, I managed to fix, so at the moment my data seem= s to be safe, but I do not trust this controller-driver-OS combo much at the mom= ent.=20 Anything I can do to help find a solution to the problem? I'm considering moving the databases back to an old "trusted" box, so if it could help, I c= ould perhaps supply you with a login to the box in a week or so? Would that help= ? It has an ILO for remote console as well. I am using the built in RAID: $ dmesg |grep -i smart smartpqi0: port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff = at device 0.0 numa-domain 0 on pci9 smartpqi0: using MSI-X interrupts (32 vectors) da0 at smartpqi0 bus 0 scbus0 target 0 lun 1 da1 at smartpqi0 bus 0 scbus0 target 0 lun 2 ses0 at smartpqi0 bus 0 scbus0 target 72 lun 0 ses0: Fixed Enclosure Services SPC-3 SCSI device pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 1 $ sudo camcontrol devlist at scbus0 target 0 lun 1 (pass0,da0) at scbus0 target 0 lun 2 (pass1,da1) at scbus0 target 72 lun 0 (ses0,pass2) at scbus0 target 1088 lun 1 (pass3) at scbus1 target 0 lun 0 (da2,pass4) I use the UFS filesystem. Regards, Palle --=20 You are receiving this mail because: You are the assignee for the bug.=