Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Jul 1999 22:28:02 -0700 (PDT)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Peter Jeremy <jeremyp@gsmx07.alcatel.com.au>
Cc:        mike@smith.net.au, freebsd-current@FreeBSD.ORG
Subject:   Re: "objtrm" problem probably found (was Re: Stuck in "objtrm")
Message-ID:  <199907130528.WAA74299@apollo.backplane.com>
References:   <99Jul13.134051est.40360@border.alcanet.com.au>

next in thread | previous in thread | raw e-mail | index | archive | help
:
:Based on general computer architecture principles, I'd say that a lock
:prefix is likely to become more expensive[1], whilst a function call
:will become cheaper[2] over time.
:...
:
:[1] A locked instruction implies a synchronous RMW cycle.  In order
:    to meet write-ordering guarantees (without which, a locked RMW
:    cycle would be useless as a semaphore primitive), it implies a
:    complete write serialization, and probably some level of
:    instruction serialisation.  Since write-back pipelines will get

    A locked instruction only implies cache coherency across the 
    instruction.  It does not imply serialization.  Intel blows it
    big time, but that's intel for you.

:    longer and parallel execution units more numerous, the cost of
:    a serialisation operation will get relatively higher.  Also,
:Peter

    It is not a large number of execution units that implies a higher
    cost of serialization but instead data interdependancies.  A 
    locked instruction doe snot have to imply serialization.

    Modern cache coherency protocols do not have a problem with 
    a large number of caches in a parallel processing subsystem.

    This is how a typical two-level cache coherency protocol works with an
    L1 and L2 cache:

    * The L2 cache is usually a superset of the L1 cache.  That is,
      all cache lines in the L1 cache also exist in the L2 cache.

    * Both the L1 and L2 caches implement a shared and an exclusive
      bit, usually on a per-cache-line basis.

    * When a processor, A, issues a memory op that is not in its cache,
      it can request the cache line from main memory either shared 
      (unlocked) or exclusive (locked).

    * All other processors snoop the main memory access.

    * If another processor's L2 cache contains the cache line being
      requested by processor A, it can provide the cache line to
      processor A and no main memory op actually occurs.

      In this case, the shared and exclusive bits in processor B's L1 
      and L2 caches are updated appropriately.

      - if A is requesting shared (unlocked) access, both A and B will
	set the shared bit and B will clear the exclusive bit.  (B will
	cause A to stall if it is in the middle of operating on the
	locked cache line).

      - if A is requesting exclusive (locked) access, B will invalidate
	its cache line and clear both the shared and exclusive bits,
	and A will set its exclusive bit.  A now owns that cache line.

      - If no other processor owns the cache line, A obtains the data
	from main memory and other processors have the option to snoop
	the access in their L2 caches.

    So, using the above rules as an example, a locked instruction can cost
    as little as 0 extra cycles no matter how many cpu's you have running
    in parallel.  There is no need to serialize or synchronize anything.

    The worst case is actually not even as bad as a complete cache-miss
    case.  The cost of snooping another cpu's L2 cache is much less then
    the cost of going to main memory.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199907130528.WAA74299>