Date: Sun, 11 May 2014 08:33:56 -0700 From: Nathan Whitehorn <nwhitehorn@freebsd.org> To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r265864 - head/sys/dev/vt/hw/ofwfb Message-ID: <536F9864.9080606@freebsd.org> In-Reply-To: <20140511133517.N1100@besplex.bde.org> References: <201405110158.s4B1wvFA072381@svn.freebsd.org> <20140511133517.N1100@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 05/10/14 23:51, Bruce Evans wrote: > On Sun, 11 May 2014, Nathan Whitehorn wrote: > >> Log: >> Make ofwfb not be painfully slow. This reduces the time for a >> verbose boot >> on my G4 iBook by more than half. Still 10% slower than syscons, but >> that's >> much better than a factor of 2. >> >> The slowness had to do with pathological write performance on 8-bit >> framebuffers, which are almost universally used on Open Firmware >> systems. >> Writing 1 byte at a time, potentially nonconsecutively, resulted in >> many >> extra PCI write cycles. This patch, in the common case where it's >> writing >> one or several characters in an 8x8 font, gangs the writes together >> into >> a set of 32-bit writes. This is a port of r143830 to vt(4). > > Only 10% slower? Bitmapped mode with 256 colors is inherently 4 times > slower for an 8x8 font (8 bytes/char instead 2) of and 8 times slower for > an 8x16 font. That's without any I/O pathology. Perhaps you are > comparing > with a syscons that is already very slow due to the hardware not > supporting > text mode. > > However, syscons has buffering that should limit this problem. This is indeed comparison to syscons in bitmap mode. PowerPC has no VGA text mode, so that's the best we could do. That using newcons's bitmap console instead of syscons's bitmap console almost tripled my boot time, however, was totally unreasonable and needed fixing. Whatever buffering syscons may have beyond what newcons has is at most a 10% thing. >> The EFI framebuffer is also extremely slow, probably for the same >> reason, >> and the same patch will likely help there. >> >> Modified: >> head/sys/dev/vt/hw/ofwfb/ofwfb.c >> >> Modified: head/sys/dev/vt/hw/ofwfb/ofwfb.c >> ============================================================================== >> >> --- head/sys/dev/vt/hw/ofwfb/ofwfb.c Sun May 11 01:44:11 2014 >> (r265863) >> +++ head/sys/dev/vt/hw/ofwfb/ofwfb.c Sun May 11 01:58:56 2014 >> (r265864) >> @@ -136,6 +136,10 @@ ofwfb_bitbltchr(struct vt_device *vd, co >> uint32_t fgc, bgc; >> int c; >> uint8_t b, m; >> + union { >> + uint32_t l; >> + uint8_t c[4]; >> + } ch1, ch2; >> >> fgc = sc->sc_colormap[fg]; >> bgc = sc->sc_colormap[bg]; >> @@ -147,36 +151,70 @@ ofwfb_bitbltchr(struct vt_device *vd, co >> return; >> >> line = (sc->sc_stride * top) + left * sc->sc_depth/8; >> - for (; height > 0; height--) { >> - for (c = 0; c < width; c++) { >> - if (c % 8 == 0) >> + if (mask == NULL && sc->sc_depth == 8 && (width % 8 == 0)) { >> + for (; height > 0; height--) { >> + for (c = 0; c < width; c += 8) { >> b = *src++; >> - else >> - b <<= 1; >> - if (mask != NULL) { >> + > > Style bug (extra newline). This newline is an artifact of the weird way SVN has chosen to make a diff. There is no actual newline in the inside of this loop. >> + /* >> + * Assume that there is more background than >> + * foreground in characters and init accordingly >> + */ >> + ch1.l = ch2.l = (bg << 24) | (bg << 16) | >> + (bg << 8) | bg; >> + >> + /* >> + * Calculate 2 x 4-chars at a time, and then >> + * write these out. >> + */ >> + if (b & 0x80) ch1.c[0] = fg; >> + if (b & 0x40) ch1.c[1] = fg; >> + if (b & 0x20) ch1.c[2] = fg; >> + if (b & 0x10) ch1.c[3] = fg; >> + >> + if (b & 0x08) ch2.c[0] = fg; >> + if (b & 0x04) ch2.c[1] = fg; >> + if (b & 0x02) ch2.c[2] = fg; >> + if (b & 0x01) ch2.c[3] = fg; > > Style bugs (missing newlines). This is copied and pasted from the syscons driver. I'd prefer to keep it the same while we have both in the tree and are trying to keep them in sync. >> + >> + *(uint32_t *)(sc->sc_addr + line + c) = ch1.l; >> + *(uint32_t *)(sc->sc_addr + line + c + 4) = >> + ch2.l; >> + } >> + line += sc->sc_stride; >> + } >> + } else { >> + for (; height > 0; height--) { >> + for (c = 0; c < width; c++) { >> if (c % 8 == 0) >> - m = *mask++; >> + b = *src++; >> else >> - m <<= 1; >> - /* Skip pixel write, if mask has no bit set. */ >> - if ((m & 0x80) == 0) >> - continue; >> - } >> - switch(sc->sc_depth) { >> - case 8: >> - *(uint8_t *)(sc->sc_addr + line + c) = >> - b & 0x80 ? fg : bg; >> - break; >> - case 32: >> - *(uint32_t *)(sc->sc_addr + line + 4*c) = >> - (b & 0x80) ? fgc : bgc; >> - break; >> - default: >> - /* panic? */ >> - break; >> + b <<= 1; >> + if (mask != NULL) { >> + if (c % 8 == 0) >> + m = *mask++; >> + else >> + m <<= 1; >> + /* Skip pixel write, if mask not set. */ >> + if ((m & 0x80) == 0) >> + continue; >> + } >> + switch(sc->sc_depth) { >> + case 8: >> + *(uint8_t *)(sc->sc_addr + line + c) = >> + b & 0x80 ? fg : bg; >> + break; >> + case 32: >> + *(uint32_t *)(sc->sc_addr + line + 4*c) >> + = (b & 0x80) ? fgc : bgc; >> + break; >> + default: >> + /* panic? */ >> + break; >> + } >> } >> + line += sc->sc_stride; >> } >> - line += sc->sc_stride; >> } >> } > > A correctly-implemented console driver doesn't have itty-bitty hardware > i/o like the old version of this or itty-bitty buffering like the changed > version. There are many deficiencies in the general approach being used here. I'm trying to patch it just to work for the time being so that it isn't a huge regression in console performance compared to syscons. Hopefully, the general architectural issues -- which you outline well below -- get solved in due course. This patch at least fixes the immediate problem. -Nathan > I thought that syscons always had correct buffering. Actually, it > uses a hybrid scheme where, at least in text mode, the initial i/o is > itty-bitty 1 character+attribute at a time (16-bit i/o), but scrolling > and screen refresh is done bcopy, bcopy_io(), bcopy_fromio() and > bcopy_toio() and a couple of other functions (bzero_io(), fill*()) > from/to a properly cached buffer in normal memory. It used to use > only bcopy() and a couple of others (bzero(), fill*()), so it > automatically did 64-bit i/o's on 64-bit systems, except for fillw*() > which was intentionally 16 bits for compatibilty (but it didn't use > bcopy() which is needed for even more compatibility). It is unclear > which old systems break with frame buffer i/o's larger (or smaller) > than 16 bits. I never had any (x86) hardware that didn't work with any > size. The video card might be 16-bit only, but then it should just > tell the CPU this so that the CPU reduces to 16 bits using standard > x86 mechanisms. Video cards have been PCI or better for about 20 > years. PCI should support precisely 32-bits, but 64-bit frame buffer > accesses to PCI and AGP video cards always worked for me. > > bcopy*io() is more technically correct, but is very badly implemented > and much slower than bcopy() on most systems. Its misimplementation > includes not even using bus-space on x86. All bcopy*io() functions > use copyw() on x86, and copyw() is just a dumb 16-bit memcpy() written > in C. Writing it in C doesn't lose anything when it is used for a > slow i/o memory, but doing 16-bit i/o's does. And doing 16-bit i/o's > doesn't even give compatibility, since bzero_io() is just bzero() on > x86, so it always does wider i/o's. syscons has always used fillw*() > and never plain fill() since it doesn't the corresponding 32-bit > writes that might be given by fill(). fill() actually does 8-bit > writes. fb also uses the badly named and implemented filll_io(). > This doesn't actually support longs, but only u_int32_t. fill_io() > is at least ifdefed on ${ARCH}, so its access size is not completely > hard-coded. On arm and mips, all the ifdefed "io" functions except > fill_io use plain memcpy() or memset() so they get a maximum access > size and minimum hardware compatibilty. fillw() is 16 bits on these > arches since the access size is hard-coded in the API (and conversion > to memset() is not done). > > Pessimizations in syscons have made it about twice as slow as in > FreeBSD-5. > This is probably mostly due to switching from bcopy() to copyw(). There > is a lot of bloat in upper layers, but with 2GHz CPUs it would take a > factor of about 10 pessimizations there to be comparable with i/o > pessimizations. > > A correctly-implemented console driver assembles an image of the frame > buffer in fast memory and copies from there to the frame buffer in > large chunks. It is tricky to keep track of changed regions so as to > not copy unchanged regions. Copying everything at a refresh rate of > not much slower than 20 Hz works well. 200 Hz for animation, but that > is rarely needed. The bandwidth for 80x25 text mode at 20 Hz is 80 kB/ > second. That was easy in 1982. I aimed for 100 Hz refresh on 2 MHz > 6809 systems in 1987. PC hardware at 5 MHz was about twice as slow, > especially for frame buffers. But it could do 80 kB/second. The > bandwidth for 80x25 8x16 256 color bitmapped mode is 640kB/second. > This was difficult in 1982, but very easy now. Yet the WindowsXP > safe mode with command prompt console is about as slow at scrolling > as a 1982 system in graphics mode. It uses similar techniques to > implement the slowness: > - a large bitmapped screen. 640x200 8 colors in 1982. Quite > a bit larger (something like 1024x768 256 colors) in 20XX. > - write to the screen very slowly. Use 8-bit writes with i/o artifacts > if possible. The 1982 system had to do 8-bit writes to 3 color planes. > 256-color mode is simpler than most. Writes can also be done very > slowly by using another mode and misaligning text so that every > character written needs merging with pixels from adjacent characters. > - do scrolling in software by copying 1 pixel at a time, using > read-modify- > write > - I only tested this on 5-10 year old hardware, with a 1920x1080 screen > but not all of it used for the console window, and with a laptop > 1024x768 screen. A good way to be slow, one that has been portable to > PC systems since 1982, is to use the BIOS for video. The console was > about twice as fast on the laptop. This might be due to a combination > of fewer pixels and a less well pessimized BIOS. > > Some old screen benchmarks. The benchmark is basically to write lines > of the screen width and scroll. I stopped updating this often about 15 > years ago when frame buffers and CPUs became fast enough. But it appears > that software bloat and design errors have caught up. > > % ISA ET4000: 2.4MB/sec read, 5.9MB/sec write > % VLB ET4000/W32i: 6.8MB/sec read, 25.5MB/sec write > % PCI S3/868: 3.5MB/sec read, 23.1MB/sec write > % PCI S3/Virge: 4.1MB/sec read, 40.0MB/sec write > % PCI S3/Savage: 3.3MB/sec read, 25.8MB/sec write > % PCI Xpert: 5.3MB/sec read, 21.8MB/sec write > % PCI R9200SE: 5.8MB/sec read, 60.2MB/sec write (but 120MB/sec fpu, > 250/sec sfpu) > % -o means stty flag -opost > % % No-scroll: > > Scrolling is avoided by repositioning the cursor after every screenful. > > % % machine video O/S where real user > sys speed > % --------- ------- -------------- --------- ----- ---- > ----- ----- > % A/2223 PCI R9200SE FreeBSD-5.2m onscreen-o .026 0.00 > .026 76.9 > % A/2223 PCI R9200SE FreeBSD-5.2m offscreen-o .026 0.00 > .026 76.9 > % A/2223 PCI R9200SE FreeBSD-5.2m onscreen .031 0.00 > .031 64.5 > % A/2223 PCI R9200SE FreeBSD-5.2m offscreen .031 0.00 > .031 64.5 > > An 11 year old system. > > 'onscreen' means output to an active vty, 'offscreen' to an inactive vty. > The mere existence of vtys requires full buffering to fast memory for > inactive vtys, since there is no hardware frame buffer memory to write > to for the inactive vtys. You have to buffer the writes in a form that > can be replayed when an inactive vty becomes active, and converting > immediately to the final form is a good method (it does take more memory > and limits history to a raw form). 'offscreen' is potentially much > faster, > but in most cases it is only slightly faster, due to delayed refreshes > for 'onscreen' and relatively fast frame buffer memory. > > -opost is tested separately because the Linux console driver was > amazingly > slow without it. This shows that it is possible for the software bloat > to be so large that it dominates hardware slowness. FreeBSD also has > lots of bloat in the tty and syscons layers near opost, but it is in the > noise compared with the old console Linux driver. > > I forget the units for these measurements, except that the speed column > gives a bandwidth in MB/sec. I don't remember if this is for write(2) > bandwidth or is related to frame buffer bandwidth). Interpret them as > relative. > > On a system similar to the above, syscons scrolls at 50000 lines/sec. > Non-virtually, this would require a frame buffer bandwidth of 200MB/sec, > which is several times faster than possible. Since syscons only does > a direct update for bytes written, it needs only about 1/25 of this > bandwidth or 800KB/sec. This is not quite in the noise compared with > a frame buffer bandwidth of 60.2MB/sec. > > % K6/233 PCI S3/Virge minix-1.6.25++ offscreen 0.2 0.00 > 0.12 16.0 > % K6/233 PCI S3/Virge minix-1.6.25++ onscreen 0.2 0.00 > 0.12 16.0 > > The Minix driver from 1990 (rewritten to support virtual consoles and to > be efficient) is faster than syscons of course. It is smarter about > buffering, so the onnscreen case goes at almost the same speed as the > offscreen case. > > % K6/233 PCI S3/Virge FreeBSD-current onscreen-o 0.23 0.00 > 0.23 8.85 > % K6/233 PCI S3/Virge FreeBSD-current offscreen-o 0.23 0.00 > 0.23 8.85 > > syscons is just slightly slower for the offscreen case. -current was > only > current in ~2004. > > % K6/233 PCI S3/Virge FreeBSD-current onscreen 0.34 0.00 > 0.34 5.83 > % K6/233 PCI S3/Virge FreeBSD-current offscreen 0.34 0.00 > 0.34 5.81 > > But in the onscreen case, syscons is more than 50% slower, due to less > virtualization. This slowness became slower with faster frame buffers, > but is still noticeable in benchmarks with the S3/Virge's write bandwidth > of 40.0MB/sec. > > % P5/133 PCI S3/868 FreeBSD-current onscreen-o 0.39 0.00 > 0.39 5.10 > % P5/133 PCI S3/868 FreeBSD-current offscreen-o 0.40 0.00 > 0.40 5.00 > % P5/133 PCI S3/868 FreeBSD-current onscreen 0.51 0.00 > 0.50 3.92 > % P5/133 PCI S3/868 FreeBSD-current offscreen 0.51 0.00 > 0.51 3.92 > % K6/233 PCI S3/Virge linux-2.1.63 offscreen-o 0.97 0.00 > 0.97 2.06 > % K6/233 PCI S3/Virge linux-2.1.63 onscreen-o 1.03 0.00 > 1.03 1.93 > % K6/233 PCI S3/Virge linux-2.1.63 offscreen 1.18 0.00 > 1.18 1.69 > % DX2/66 VLB ET4000/W32i FreeBSD-current offscreen-o 1.18 0.00 > 1.16 1.69 > % DX2/66 VLB ET4000/W32i FreeBSD-current onscreen-o 1.27 0.02 > 1.23 1.57 > % K6/233 PCI S3/Virge linux-2.1.63 onscreen 1.38 0.00 > 1.38 1.45 > % 486/33 ISA ET4000 minix-1.6.25++ offscreen 2 0.01 1.45 > 1.37 > % 486/33 ISA ET4000 minix-1.6.25++ onscreen 2 0.01 1.60 > 1.24 > % DX2/66 VLB ET4000/W32i FreeBSD-current offscreen 1.60 0.00 > 1.59 1.25 > % DX2/66 VLB ET4000/W32i FreeBSD-current onscreen 1.70 0.01 > 1.66 1.18 > % 486/33 ISA ET4000 FreeBSD-current offscreen-o 2.30 0.01 > 2.28 0.87 > % 486/33 ISA ET4000 FreeBSD-current onscreen-o 2.39 0.02 > 2.32 0.84 > % 486/33 ISA ET4000 FreeBSD-current offscreen 3.15 0.03 > 3.10 0.63 > % 486/33 ISA ET4000 FreeBSD-current onscreen 3.27 0.00 > 3.21 0.61 > % DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen-o 3.63 0.01 > 3.62 0.15 > % DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen-o 3.65 0.01 > 3.63 0.55 > % DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen 12.48 0.01 > 12.47 0.16 > % 486/33 ISA ET4000 linux-1.1.36 offscreen 20.80 0.00 > 20.80 0.10 > % DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen 26.98 0.01 > 26.95 0.07 > % 486/33 ISA ET4000 linux-1.1.36 onscreen 38.34 0.02 > 38.38 0.05 > > The speedup from the worst case (old Linux on old hardware) to the > best case > (old Minix on new hardware) is a factor of 38.34/0.26 = 1475. Hardware > speeds only increased by a factor of about 223/33 = 67. Minix was only > 1.5 times faster than syscons and 10-20 times faster than Linux on old > hardware. > > % % Scroll: > % % machine video O/S where real user > sys speed > % --------- ------- -------------- --------- ----- ---- > ----- ----- > % A/2223 PCI R9200SE FreeBSD-5.2m onscreen-o .047 0.00 > .047 42.6 > % A/2223 PCI R9200SE FreeBSD-5.2m offscreen-o .047 0.00 > .047 42.6 > % A/2223 PCI R9200SE FreeBSD-5.2m onscreen .051 0.00 > .051 39.2 > % A/2223 PCI R9200SE FreeBSD-5.2m offscreen .051 0.00 > .051 39.2 > % K6/233 PCI S3/Virge minix-1.6.25++ offscreen 0.2 0.00 > 0.14 14.0 > % K6/233 PCI S3/Virge minix-1.6.25++ onscreen 0.2 0.00 > 0.14 14.0 > % K6/233 PCI S3/Virge FreeBSD-current onscreen-o 0.36 0.00 > 0.36 5.54 > % K6/233 PCI S3/Virge FreeBSD-current offscreen-o 0.40 0.00 > 0.40 5.01 > % K6/233 PCI S3/Virge FreeBSD-current onscreen 0.47 0.00 > 0.47 4.22 > % K6/233 PCI S3/Virge FreeBSD-current offscreen 0.51 0.00 > 0.51 3.92 > > Scrolling makes no difference for Minix due to the better virtualization. > It slows down syscons by about 50%. Strangely, the onscreen case is now > faster?! > > % P5/133 PCI S3/868 FreeBSD-current onscreen-o 1.24 0.00 > 1.23 1.61 > % P5/133 PCI S3/868 FreeBSD-current offscreen-o 1.28 0.00 > 1.27 1.56 > % P5/133 PCI S3/868 FreeBSD-current onscreen 1.35 0.00 > 1.34 1.48 > % P5/133 PCI S3/868 FreeBSD-current offscreen 1.39 0.00 > 1.38 1.44 > % K6/233 PCI S3/Virge linux-2.1.63 onscreen-o 1.49 0.00 > 1.49 1.34 > % 486/33 ISA ET4000 minix-1.6.25++ offscreen 2 0.00 1.70 > 1.18 > % 486/33 ISA ET4000 minix-1.6.25++ onscreen 2 0.00 1.81 > 1.10 > % K6/233 PCI S3/Virge linux-2.1.63 onscreen 1.85 0.00 > 1.85 1.08 > % K6/233 PCI S3/Virge linux-2.1.63 offscreen-o 2.88 0.00 > 2.88 0.69 > % K6/233 PCI S3/Virge linux-2.1.63 offscreen 3.10 0.00 > 3.10 0.65 > % DX2/66 VLB ET4000/W32i FreeBSD-current offscreen-o 3.39 0.02 > 3.36 0.59 > % DX2/66 VLB ET4000/W32i FreeBSD-current onscreen-o 3.67 0.02 > 3.63 0.54 > % DX2/66 VLB ET4000/W32i FreeBSD-current offscreen 3.82 0.00 > 3.81 0.52 > % DX2/66 VLB ET4000/W32i FreeBSD-current onscreen 4.14 0.03 > 4.06 0.48 > % DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen-o 4.34 0.01 > 4.32 0.46 > % 486/33 ISA ET4000 FreeBSD-current offscreen-o 5.54 0.03 > 5.48 0.36 > % 486/33 ISA ET4000 FreeBSD-current onscreen-o 5.73 0.00 > 5.61 0.35 > % 486/33 ISA ET4000 FreeBSD-current offscreen 6.41 0.03 > 6.34 0.31 > % 486/33 ISA ET4000 FreeBSD-current onscreen 6.62 0.01 > 6.45 0.30 > > The old systems didn't have the CPU or frame buffer bandwidth to scroll > at 50000 lines/sec. Rescaling 50000 by this 6.62 divided by the above > 0.026 > gives only 196 lines/sec. That was usable, but since you can see the > scroll move it is not very good. Rescaling Minix's 2.0 gives 650 > lines/sec, > or a full screen refresh rate of 26 Hz. You can probably see the scroll > flicker but not move at this rate. Of course, the implementation does > delayed refresh to reach this rate, so most of the scrolling steps are > virtual and you can only see the screen flicker for other reasons. > syscons' > scrolling is also virtual. > > % DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen-o13.48 0.01 > 13.47 0.15 > % DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen 22.60 0.01 > 22.42 0.09 > % 486/33 ISA ET4000 linux-1.1.36 offscreen 23.56 0.03 > 23.60 0.08 > % DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen 27.73 0.01 > 27.72 0.08 > % 486/33 ISA ET4000 linux-1.1.36 onscreen 40.26 0.00 > 40.27 0.05 > > Rescaling 50000 by this 40.26 divided by the above 0.026 gives 26 > lines/sec. > That is only a bit better than 1982 pixel mode quality. But this is for > text mode. > > Bruce >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?536F9864.9080606>