Information Theory

steve_bank · Mar 17, 2019

Shannon is the grand daddy of digital communications. He developed the fundamental relationships between link bandwidth(frequency response), noise, signal level, bit rate, and probability of detection(bit error rate) for digital communications. He adapted entropy form thermodynamics.

His original paper is probably behind a pay wall. It belongs more under math than science. Mostly probability, statistics, and combinatorics.

https://en.wikipedia.org/wiki/Information_theory

Information theory studies the transmission, processing, extraction, and utilization of information. Abstractly, information can be thought of as the resolution of uncertainty. In the case of communication of information over a noisy channel, this abstract concept was made concrete in 1948 by Claude Shannon in his paper "A Mathematical Theory of Communication", in which "information" is thought of as a set of possible messages, where the goal is to send these messages over a noisy channel, and then to have the receiver reconstruct the message with low probability of error, in spite of the channel noise. Shannon's main result, the noisy-channel coding theorem showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.[1] ....

Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the distributions associated with random variables. Important quantities of information are entropy, a measure of information in a single random variable, and mutual information, a measure of information in common between two random variables. The former quantity is a property of the probability distribution of a random variable and gives a limit on the rate at which data generated by independent samples with the given distribution can be reliably compressed. The latter is a property of the joint distribution of two random variables, and is the maximum rate of reliable communication across a noisy channel in the limit of long block lengths, when the channel statistics are determined by the joint distribution. ....

Entropy of an information source[edit]

Based on the probability mass function of each source symbol to be communicated, the Shannon entropy H, in units of bits (per symbol), is given by
H = − ∑ i p i log 2 ⁡ ( p i ) {\displaystyle H=-\sum _{i}p_{i}\log _{2}(p_{i})} {\displaystyle H=-\sum _{i}p_{i}\log _{2}(p_{i})}
where pi is the probability of occurrence of the i-th possible value of the source symbol. This equation gives the entropy in the units of "bits" (per symbol) because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the "shannon" in his honor. Entropy is also commonly computed using the natural logarithm (base e, where e is Euler's number), which produces a measurement of entropy in "nats" per symbol and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used. For example, a logarithm of base 28 = 256 will produce a measurement in bytes per symbol, and a logarithm of base 10 will produce a measurement in decimal digits (or hartleys) per symbol.

Intuitively, the entropy HX of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X when only its distribution is known.

The entropy of a source that emits a sequence of N symbols that are independent and identically distributed (iid) is N ⋅ H bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length N will be less than N ⋅ H.

The entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function, Hb(p). The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.
If one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If 𝕏 is the set of all messages {x1, ..., xn} that X could be, and p(x) is the probability of some x ∈ X {\displaystyle x\in \mathbb {X} } x\in \mathbb {X} , then the entropy, H, of X is defined:[9] ....

steve_bank · Mar 17, 2019

https://en.wikipedia.org/wiki/Entropy_(information_theory)

When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more "information" ("surprisal") than when the source data produces a high-probability value. The amount of information conveyed by each event defined in this way becomes a random variable whose expected value is the information entropy. Generally, entropy refers to disorder or uncertainty, and the definition of entropy used in information theory is directly analogous to the definition used in statistical thermodynamics. The concept of information entropy was introduced by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication".[2]

The basic model of a data communication system is composed of three elements, a source of data, a communication channel, and a receiver, and – as expressed by Shannon – the "fundamental problem of communication" is for the receiver to be able to identify what data was generated by the source, based on the signal it receives through the channel.[3]:379–423 and 623–656 The entropy provides an absolute limit on the shortest possible average length of a lossless compression encoding of the data produced by a source, and if the entropy of the source is less than the channel capacity of the communication channel, the data generated by the source can be reliably communicated to the receiver (at least in theory, possibly neglecting some practical considerations such as the complexity of the system needed to convey the data and the amount of time it may take for the data to be conveyed).

Information entropy is typically measured in bits (alternatively called "shannons") or sometimes in "natural units" (nats) or decimal digits (called "dits", "bans", or "hartleys"). The unit of the measurement depends on the base of the logarithm that is used to define the entropy.

Introduction[edit]

The basic idea of information theory is that the more one knows about a topic, the less new information one is apt to get about it. If an event is very probable, it is no surprise when it happens and provides little new information. Inversely, if the event was improbable, it is much more informative that the event happened. The information content is an increasing function of the reciprocal of the probability of the event (1/p, where p is the probability of the event). If more events may happen, entropy measures the average information content you can expect to get if one of the events actually happens. This implies that casting a die has more entropy than tossing a coin because each outcome of the die has smaller probability than each outcome of the coin.

Entropy is a measure of unpredictability of the state, or equivalently, of its average information content. To get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll is not already known. In other words, the outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information; these are just different ways of saying that the a priori entropy of the poll results is large. Now, consider the case that the same poll is performed a second time shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well and the results should not contain much new information; in this case the a priori entropy of the second poll result is small relative to that of the first.

Consider the example of a coin toss. Assuming the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it could be. There is no way to predict the outcome of the coin toss ahead of time: if one has to choose, the best one can do is predict that the coin will come up heads, and this prediction will be correct with probability 1/2. Such a coin toss has one bit of entropy since there are two possible outcomes that occur with equal probability, and learning the actual outcome contains one bit of information. In contrast, a coin toss using a coin that has two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can be predicted perfectly. Analogously, a binary event with equiprobable outcomes has a Shannon entropy of log 2 ⁡ 2 = 1 {\displaystyle \log _{2}2=1} \log _{2}2=1 bit. Similarly, one trit with equiprobable values contains log 2 ⁡ 3 {\displaystyle \log _{2}3} \log _{2}3 (about 1.58496) bits of information because it can have one of three values.

English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. If we do not know exactly what is going to come next, we can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of the message

lpetrich · Mar 18, 2019

I will now derive Claude Shannon's entropy formula. It may be interpreted as the number of bits needs to specify a system's state when the system can be in any of several states with a probability for each state. For states 1, 2, ..., n:

Entropy = H(p1, p2, ..., pn)

Entropy obeys a composition law. Consider a set of states 1, 2, ..., n, a1, a2, ..., am, with overall probability pa = pa1 + pa2 + ... + pam.

One can group together all the "a" states, and their contribution to the total entropy will be pa * entropy of all the "a" states relative to their combined state.

H(p1, p2, ..., pn, pa1, pa2, ..., pam) = H(p1, p2, ..., pn, pa) + pa*H(pa1/pa, pa2/pa, ..., pam/pa)

This suggests a simplification: J(p1, p2, ..., pn) = psum*H(p1/psum, p2/psum, ..., pn/psum)

where psum is the sum of all the p's here. This gives us

J(p1, p2, ..., pn, pa1, pa2, ..., pam) = J(p1, p2, ..., pn, pa) + J(pa1, pa2, ..., pam)

This has the consequence that J(p) = 0 for a single argument, and this is consistent with J(p1, p2, ..., pn) = K(p1) + K(p2) + ... + K(pn) - K(psum) for some function K, though I can't think of a way to prove it.

lpetrich · Mar 18, 2019

I will consider the case of two independent sets of probabilities multiplied together: p1*q1, p1*q2, p2*q1, p2*q2, ...

This has a very interesting result: H({p*q}) = H({p}) + H({q}) and likewise for J. Independent sets of probabilities have additive entropy. For all the probabilities equal, we have

H(m*n of 1/(m*n)) = H(m of 1/m) + H(n of 1/n)

With H(1) = 0, we find H(n of 1/n) = H0*log

for some constant H0.

Returning to the J function, it has this scaling: J(a*p1, a*p2, ..., a*pn) = a * J(p1, p2, ..., pn)

For same probabilities, J(n of a) = (n*a)*H0*log

With the J-K decomposition, (n*a)*H0*log

= n*K(a) - K(n*a)

Dividing out n*a gives H0*log

= K(a)/a - K(n*a)/(n*a)

Holding a fixed and varying n gives K(x)/x = - H0*log(x) + K0, or K(x) = x*(- H0*log(x) + K0)

For J, this gives us I(p1) + I(p2) + ... + I(pn) - I(psum) where I(x) = - H0*x*log(x)

For H, we use the expression for J with the constraint that psum = 1. That gives us I(p1) + I(p2) + ... + I(pn)

The familiar result.

Information Theory

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...

lpetrich

Contributor

lpetrich

Contributor