Date: Wed, 26 Jul 2000 03:50:26 -0700 (PDT) From: Kris Kennaway <kris@FreeBSD.org> To: Mark Murray <mark@grondar.za> Cc: arch@FreeBSD.org Subject: Re: Estimating entropy Message-ID: <Pine.BSF.4.21.0007260333360.95417-100000@freefall.freebsd.org> In-Reply-To: <200007261019.MAA00605@grimreaper.grondar.za>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Jul 2000, Mark Murray wrote: > > 2) Keep a frequency table and calculate or estimate the shannon entropy > > periodically. This may be feasible if we treat the samples as 8-bit > > sources, as you only have to loop over 256 values and calculate a log_2 of > > the probabilities (although lack of FP in the kernel would complicate > > this) > > I have been looking for articles on Shannon entropy; all I can find > is a theorem that covers ergodic systems. Do you have any online > references? It's pretty simple: if you have an information source which outputs data with a given probability distribution p(x) (technically speaking a statistical ensemble of sources, in order to define the probability) then the entropy content is S = - \Int_x dx p(x) ln_2 p(x) Where \Int_x is an integral over x. If x is a discrete variable (as it is in most/all of the cases we'll be looking at) then replace \Int_x dx with \Sigma_x (i.e. sum over all possible output values) Basically, (Shannon) entropy is a measure (in bits) of how much information you learn by performing the measurement, given that some measurements are more likely than others and therefore not as much of a "surprise" (you probably knew this, but for the benefit of anyone else who might not..) It's also the absolute minimum length a given piece of data can be compressed to by a general-purpose, "perfect" compression algorithm, which is why gzip is a decent metric for it. The requirement about ergodic systems means basically that the way a single system behaves if you watch it for a long period of time is the same as a large number of identical systems (in different states) behave when observed at a single instant (the aformentioned statistical ensemble assumption). There's also another implicit assumption about a "stationary" distribution which basically means it's not time-varying: if you have a probability distribution which changes over time then shannon entropy will not be a true measure of the information content of the source, because it doesn't take into account temporal patterns which may decrease the actual information content. See the code I posted a few days ago for a C implementation which calculates the entropy of the PCM input which should make the above clearer. Kris -- In God we Trust -- all others must submit an X.509 certificate. -- Charles Forsythe <forsythe@alum.mit.edu> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0007260333360.95417-100000>