信息理论的基本概念:learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.
对信息的量化:
- Likely events should have low information content
- Less likely events should have higher information content
- Independent events should have additive information.
事件$\mathbf{x} = x$的 self-information:
\begin{equation}
I(x) = - \log P(x)
\end{equation}
base e: nats; base-2: bits or shannons. (information measured in bits is just a rescaling of information measured in nats.)
Shannon entropy:
the amount of uncertainty in an entire probability distribution:
\begin{equation}
H(\mathbf{x}) = E[I(x)] = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}
- Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; (a binary random variable p取0或1时,熵最小)
- distributions that are closer to uniform have high entropy. (a binary random variable p取0.5时,熵最大)
当随机变量$\mathbf{x}$连续时,Shannon entropy 也称为 differential entropy
Kullback-Leibler (KL) divergence
也称为 relative entropy
measures the difference between two distributions
suppose two separate probability distributions $P (x)$ and $Q(x) $ over the same random variable x
The Kullback–Leibler divergence from Q to P,(relative entropy of P with respect to Q.)
\begin{equation}
D_{KL}(P||Q) = E[\log \frac{P(x)}{Q(x)}] = E[\log P(x) - \log Q(x)] = \sum_x P(x) \log \frac{P(x)}{Q(x)}
\end{equation}
useful properties of KL divergence
- non-negative.
- 可以用来衡量两个分布之间的距离,但不是一个真正的距离函数,因为不满足对称性。存在$D_{KL}(P||Q) \ne D_{KL}(Q||P)$
cross-entropy
\begin{equation}
H(P,Q)= H(P)+D_{KL}(P||Q)
\end{equation}
其中:
\begin{equation}
H(P) = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}
\begin{equation}
H(P,Q) = -E[\log Q(x)] = - \sum_x P(x) \log Q(x)
\end{equation}