From owner-freebsd-arm@FreeBSD.ORG Mon Dec 29 06:41:27 2014 Return-Path: Delivered-To: freebsd-arm@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3494A82B for ; Mon, 29 Dec 2014 06:41:27 +0000 (UTC) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1B55C1E5E for ; Mon, 29 Dec 2014 06:41:27 +0000 (UTC) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.14.9/8.14.9) with ESMTP id sBT6fQ51063746 for ; Mon, 29 Dec 2014 06:41:26 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-arm@FreeBSD.org Subject: [Bug 194635] Speed optimisation for framebuffer console driver on Raspberry Pi Date: Mon, 29 Dec 2014 06:41:25 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: arm X-Bugzilla-Version: 10.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: adrian@freebsd.org X-Bugzilla-Status: Open X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: freebsd-arm@FreeBSD.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Dec 2014 06:41:27 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194635 --- Comment #8 from Adrian Chadd --- Ok, so I finally got around to this! FreeBSD-HEAD is using vt now, not syscons - I'll still merge your stuff at some point, but your code is for the syscons console. For vt, it exposes a straight simple mapped framebuffer to the vt code that then uses the code in sys/dev/vt/hw/fb/ to draw things. So, it also does mostly what you've done, and it's doing it 8, 16, or 32 bits at a time depending upon the bpp depth. So, I figured I'd write something that just mmap'ed /dev/fb0 into userland and tried 8, 16 and 32 bit stores to see what's faster. #include #include #include #include #include #include #include //fb0: 1184x624(0x0@0,0) 16bpp #define WIDTH 1184 #define HEIGHT 624 #define BPP 16 // Not true - need to know "stride". // but treat this as if it's in bytes #define FB_SIZE (1184*624*2) struct timespec ts_diff(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec-start.tv_nsec)<0) { temp.tv_sec = end.tv_sec-start.tv_sec-1; temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec; } else { temp.tv_sec = end.tv_sec-start.tv_sec; temp.tv_nsec = end.tv_nsec-start.tv_nsec; } return temp; } void fill_1byte(char *fb, char val) { int i; for (i = 0; i < FB_SIZE; i++) fb[i] = val; } void fill_2byte(char *fb, uint16_t val) { uint16_t *f = (void *) fb; int i; for (i = 0; i < FB_SIZE / 2; i++) { f[i] = val; } } void fill_4byte(char *fb, uint32_t val) { uint32_t *f = (void *) fb; int i; for (i = 0; i < FB_SIZE / 4; i++) { f[i] = val; } } int main(int argc, const char *argv[]) { char *fb = NULL; int fd; int i; struct timespec tv_start, tv_end, tv_diff; fd = open("/dev/fb0", O_RDWR); if (fd < 0) { err(1, "%s: open", __func__); } fb = mmap(NULL, FB_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (fb == MAP_FAILED) { err(1, "%s: mmap", __func__); } clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_1byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("8 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_2byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("16 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_4byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("32 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); exit(0); } .. and the output: root@raspberry-pi:~ # ./test 8 bit: 100 runs: 4.15364000 sec 16 bit: 100 runs: 2.107316000 sec 32 bit: 100 runs: 1.12614000 sec root@raspberry-pi:~ # .. so: * Your work is good and it's still good for people using syscons, but you should double-check what's in sys/dev/vt/hw/fb/ to see if there's any optimisation there; * To get really fast speed, we should be doing 32 bit stores, not lots of 8 or 16 bit stores. The above test filled the same region of memory but with 8, 16 and 32 bit stores. The difference between 8, 16 and 32 bit is quite substantial. -- You are receiving this mail because: You are the assignee for the bug.