Date: Wed, 17 Mar 2010 13:23:23 -0700 (PDT) From: Matthew Dillon <dillon@apollo.backplane.com> To: freebsd-hackers@freebsd.org Subject: Re: ATA 4K sector issues Message-ID: <201003172023.o2HKNNbj069321@apollo.backplane.com> References: <alpine.BSF.2.00.1003171114280.74067@mignon.ki.iif.hu> <86tysf58a2.fsf@ds4.des.no> <alpine.BSF.2.00.1003171652260.74067@mignon.ki.iif.hu> <f8e3d83f1003171034m5e75eae4r5e8b31d88d361d3b@mail.gmail.com> <367b2c981003171112n785ea9d4q21d00b533819ca67@mail.gmail.com> <f8e3d83f1003171117k20d553b7y7ce4c3c8ed2f5c96@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
We experimented a bit with aligning fdisk (dos slices) by changing the sector offset to 2 but I came to the conclusion that it was better to do the alignment in disklabel / gpt / whatever higher-level partitioner floats your boat and not mess with anything the BIOS uses to boot the machine My recommendation is to use a 1MB physical base alignment. That's what I adjusted DragonFly's disklabel64 to do. It's definitely best to have the partitioner deal with it instead of having to mess around manually because the partitioner can calculate the actual physical alignment by querying the kernel's disk subsystem regardless of the topology. There are several reasons for using a large alignment: * A variety of media already uses much larger physical block sizes. MLC flash uses 128K and SLC uses 64K blocks. See the note below on why this matters even though SSDs do write combining. * A larger alignment is more likely to work well as a default in RAID configurations and doesn't hurt non-RAID. * The kernel cluster I/O subsystem wants to collect stuff into 64K-256K clusters for reading and writing (writing being the most important). A larger alignment plus some minor tweeks in the cluster code will cause the cluster writes to also be well aligned. * Even though UFS does not take advantage of cluster alignment (because BMAP tends to align only to the UFS block size which is a fairly small <= 32K usually), filesystems such as ZFS (with 128K blocks I believe) and HAMMER (with 64K blocks and 8MB super blocks) will. And fixing up UFS isn't difficult. One might need to mess with the cylinder group alignment and make some minor tweeks to the bmap allocator but that's about it. * A large alignment hurts nothing. Who cares about ~512K-1MB of wasted space at the beginning of the drive? I don't. This is particularly important for SSDs. Even though SSDs do write combining a properly aligned write will theoretically greatly improve write endurance by reducing internal fragmentation, reducing write amplification effects, and also reducing the amount of internal rewriting the drive does to defragment and wear-level. It is hard to test this but I am seeing wear rates condusive with a 100TB write endurance on 40G Intel drives vendor-speced for a 35TB write endurance. So even though you might not see a major difference in performance you could very well see a big difference in write endurance. It isn't possible to benchmark this with a standard benchmark which keeps the SSD 100% active so I've been using real work loads and it just takes forever to tick-down the SSDs wear-meter. The SSD also needs idle time to implement internal defragmentation and wear leveling efficiently (This seems more apparent in the OCZs than in the Intels). There are a lot of moving parts in the kernel related to alignment. The cluster code and the filesystem block allocation code are the two biggest issues and adjustments have to made to take proper advantage of it, particularly for SSDs. So the answer is: Aligning things certainly isn't going to hurt anything so you might as well kick it hard (use a large alignment) so you don't have to revisist the problem again a year from now. -- For hard drives with larger physical sector sizes it shouldn't matter for asynchronous writes. It really shouldn't. And nearly all of UFS's writes are asynchronous. That said: I read Thiago's posting. I will note something specifics about a ports tarball. Ports has 261,000+ files in it, mostly small. UFS and the cluster code CANNOT COMBINE those writes (because the buffer-cache for file data is per-vnode), so UFS will wind up doing a very large number of fragment-sized writes. These fragment-sized writes (4K in Thiago's aligned test that ran in 1:25, and 2K in Thiago's aligned test that ran in 10:24) should STILL be write-combined in the drive. That is, UFS STILL has good write linearity even with the small writes. So I suspect the issue here is that the drive is not properly write-combining the writes, possibly coupled with additional issues in UFS's bmap and inode allocator that might not be presenting the drive with enough write-combinable data that fits in the drive's cache, forcing the drive to do a lot of read-before-write. In terms of write-combinable data and UFS it could be a cylinder-group alignment issue. Bitmap blocks are a particular problem because they use an odd-sized block size (typically 6K if I remember right), though I'm not sure how the filesystem fragment size effects it. You would have to instrument the write activity to determine how good the linearity is verses the size of the drive's ram cache. There are definitely several possible explanations for the horrible performance when using 2K fragments. ZFS (and also HAMMER) would not have this particular problem. ZFS clearly has other issues in those tests but I don't know enough about its internals to guess, other than maybe it is a ZIL tuning issue. -Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201003172023.o2HKNNbj069321>