Information Theory in deep learning

信息理论的基本概念：learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

对信息的量化：

Likely events should have low information content
Less likely events should have higher information content
Independent events should have additive information.

事件$\mathbf{x} = x$的 self-information：
\begin{equation}
I(x) = - \log P(x)
\end{equation}

base e: nats; base-2: bits or shannons. (information measured in bits is just a rescaling of information measured in nats.)

Shannon entropy:

the amount of uncertainty in an entire probability distribution:
\begin{equation}
H(\mathbf{x}) = E[I(x)] = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}

Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; (a binary random variable p取0或1时，熵最小)
distributions that are closer to uniform have high entropy. (a binary random variable p取0.5时，熵最大)

当随机变量$\mathbf{x}$连续时，Shannon entropy 也称为 differential entropy

Kullback-Leibler (KL) divergence

也称为 relative entropy
measures the difference between two distributions

suppose two separate probability distributions $P (x)$ and $Q(x) $ over the same random variable x

The Kullback–Leibler divergence from Q to P,(relative entropy of P with respect to Q.)
\begin{equation}
D_{KL}(P||Q) = E[\log \frac{P(x)}{Q(x)}] = E[\log P(x) - \log Q(x)] = \sum_x P(x) \log \frac{P(x)}{Q(x)}
\end{equation}

useful properties of KL divergence

non-negative.
可以用来衡量两个分布之间的距离，但不是一个真正的距离函数，因为不满足对称性。存在$D_{KL}(P||Q) \ne D_{KL}(Q||P)$

cross-entropy

\begin{equation}
H(P,Q)= H(P)+D_{KL}(P||Q)
\end{equation}

其中：
\begin{equation}
H(P) = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}

\begin{equation}
H(P,Q) = -E[\log Q(x)] = - \sum_x P(x) \log Q(x)
\end{equation}