From owner-freebsd-hackers Tue Jan 23 06:03:58 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id GAA19841 for hackers-outgoing; Tue, 23 Jan 1996 06:03:58 -0800 (PST) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id GAA19783 for ; Tue, 23 Jan 1996 06:03:33 -0800 (PST) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.9/8.6.9) id AAA29712; Wed, 24 Jan 1996 00:43:55 +1100 Date: Wed, 24 Jan 1996 00:43:55 +1100 From: Bruce Evans Message-Id: <199601231343.AAA29712@godzilla.zeta.org.au> To: davidg@Root.COM, terry@lambert.org Subject: Re: stanford benchmark/usenix Cc: freebsd-hackers@freefall.freebsd.org, hasty@rah.star-gate.com, rmallory@wiley.csusb.edu Sender: owner-hackers@FreeBSD.ORG Precedence: bulk >Do you remember Bruce's message regarding reordering the cache line >loads in the P5 optimized bcopy? He said: >| On my 486DX2/66 with an unknown writing strategy, copy() is about 20% >| faster than memcpy() (*) but can be improved another 20% by changing the >| cache line allocation strategy slightly: replace the load of 28(%edi) by >| a load of 12(%edi) and add a load of 28(%edi) in the middle of the loop. >| The pairing stuff and the nops make little difference. cache-line >| alignment of the source and target made little difference. >| >| (*) When memcpy() is run a second time, it is as fast as the fastest >| version as copy()! >I didn't quite follow the reasoning, since it would write the contents >of 12(%edi) into 28(%edi)?!? >I mailed Bruce about this directly, but haven't seen a response yet... I didn't see the mail. The contents of 12(%edi) should be loaded into a free register and not stored anywhere (cmpl $0,12(%edi) can be used if it is too inconvenient to have a free register, but takes longer). All this is very machine-dependent (not just cpu-dependent). For my 486DX2/66 with an unknown writing strategy, the best way to write is to write a (16 byte) cache line at a time and prefetch that cache line by reading an aligned 32 bit word from it. For my 486DX/33 with a write buffer, the prefetch just wastes time. The (*) has something to do with data being left in a cache by a previous benchmark. Prefetching apparently helps by causing a cache hit for the writes; already cached data works in the same way. Memory benchmarks should do something to set all caches to nearly a known state before starting. The best way to read depends on whether the data is in the L1 cache, the L2 cache, or in main memory. For 486's: (1) if it's in the L1 cache, just reading aligned 32 bit words at a time works as fast as possible (1 cycle/word). If it's in the L2 cache but not in the L1 cache, then prefetching (16 byte) cache lines (the next one you need, not the current one as for writes) works best (stalls for prefetching can be overlapped with stalls for reading). If it's only in main memory, then prefetching makes little difference on my systems. There might be a difference on systems with faster main memory or different L2 caches... Copying requires both reading and writing. I don't know what the best combined strategy is. There are cases when you have a good idea which cache the data is in. E.g., after busting the L1 cache by reading >= 8K data from an IDE drive, the data is probably in the L2 cache and may be in the L1 cache, so it would be good to copy it immediately to user space to satisfy any user reads; bcopy could take lots of flags telling it where you think the data is or which strategy you think is best. If you don't plan to copy the data immediately, then you should have disabled caching before reading it. Bruce