Probability in deep learning

three possible sources of uncertainty

Inherent stochasticity(随机性) in the system
Incomplete observability.
Incomplete modeling. (比如把连续的运行轨迹离散化后回丢失准确位置)

Random Variables

A random variable is a variable that can take on different values randomly.

Random variables may be discrete or continuous.

A discrete random variable is one that has a finite or countably infinite number of states.
A continuous random variable is associated with a real value.

Probability Distributions

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.

Discrete Variables and Probability Mass Function (PMF)

单个随机变量：$\mathbf{x} \sim P(\mathbf{x})$ 随机变量$\mathbf{x}$服从概率分布$P(\mathbf{x})$, $P(\mathbf{x}=x)$
多个随机变量：joint probability distribution: $P(\mathbf{x} = x,\mathbf{y} = y)$ 简写为$P(x,y)$

PMF 函数$P$需满足以下性质：

The domain of P must be the set of all possible states of $\mathbf{x}$.
$\forall x \in \mathbf{x},0 \le P(x) \le 1$
$\sum_{x \in \mathbf{x}} P(x) = 1$

Continuous Variables and Probability Density Functions (PDF)

PDF 函数$p$需满足以下性质：

The domain of p must be the set of all possible states of $\mathbf{x}$.
$\forall x \in \mathbf{x},p(x) \ge 0$ 不要求 $p(x) \le 1$
$\int p(x)dx = 1$

Marginal Probability

The probability distribution over the subset is known as the marginal probability distribution.

如果已知联合概率分布$P(\mathbf{x},\mathbf{y})$
则$P(\mathbf{x})$为：

\begin{equation}
\forall x \in \mathbf{x},P(x) = \sum_y P(\mathbf{x} = x,\mathbf{y} = y)
\end{equation}

对于连续变量：
\begin{equation}
p(x) = \int p(x,y)dy
\end{equation}

Conditional Probability

\begin{equation}
P(\mathbf{y} = y | \mathbf{x} = x) = \frac{P(\mathbf{y} = y,\mathbf{x} = x)}{P(\mathbf{x} = x)}
\end{equation}

只有$P(\mathbf{x} = x) > 0$时才能计算条件概率。（We cannot compute the conditional probability conditioned on an event that never happens.）

The Chain Rule of Conditional Probabilities

略

Independence and Conditional Independence

Two random variables x and y are independent，记为$\mathbf{x} \perp \mathbf{y}$, if

\begin{equation}
\forall x \in \mathbf{x}, y \in \mathbf{y}, p(x,y) = p(x)p(y)
\end{equation}

Two random variables x and y are conditionally independent given a random variable z, 记住为$\mathbf{x} \perp \mathbf{y}|z$
\begin{equation}
\forall x \in \mathbf{x}, y \in \mathbf{y}, z \in \mathbf{z} ,p(x,y|z) = p(x|z)p(y|z)
\end{equation}

Expectation, Variance and Covariance

expectation or expected value

设离散随机变量$\mathbf{x}$, 其概率分布为$P(\mathbf{x})$,期望值为：

\begin{equation}
E[\mathbf{x}] = \sum_x xP(x)
\end{equation}

\begin{equation}
E[f(\mathbf{x})] = \sum_x f(x)P(x)
\end{equation}

设连续随机变量$\mathbf{x}$, 期望值为：

\begin{equation}
E[\mathbf{x}] = \int xp(x)dx
\end{equation}

\begin{equation}
E[f(\mathbf{x})] = \int f(x)p(x)dx
\end{equation}

variance

\begin{equation}
Var(\mathbf{x}) = E[(\mathbf{x}-\mu)^2]=E(\mathbf{x}^2)-[E(\mathbf{x})]^2
\end{equation}

\begin{equation}
Var(f(\mathbf{x})) = E[(f(\mathbf{x})-E[f(\mathbf{x})])^2]
\end{equation}

covariance 协方差

\begin{equation}
Cov(f(x),g(y)) = E[(f(x)-E[f(x)])(g(y)-E[g(y)])]
\end{equation}

\begin{equation}
Cov(X,Y) = E[(X-\mu_x)(Y-\mu_y)]
\end{equation}

\begin{equation}
Cov(X,Y) = E(XY)-E(X)E(Y)
\end{equation}

\begin{equation}
Cov(X,X) = Var(X)
\end{equation}

Common Probability Distributions

Bernoulli distribution

Multinoulli Distribution

Gaussian distribution / normal distribution

Exponential and Laplace Distributions

The Dirac Distribution and Empirical Distribution

概率分布只在某个单独点附近有，使用Dirac delta function $\delta (x)$ (是一种generalized function )来定义PDF
\begin{equation}
p(x) = \delta (x-\mu)
\end{equation}
The Dirac delta function is defined such that it is zero-valued everywhere except 0, yet integrates to 1

The Dirac delta function as the limit (in the sense of distributions) of the sequence of zero-centered normal distributions

Mixtures of Distributions

A latent variable c is a random variable that we cannot observe directly. x 是能观测到的变量。
\begin{equation}
P(x,c) = P(x|c)P(c)
\end{equation}

Gaussian mixture model

prior probability：在观测到x之前， the model’s beliefs about c
posterior probability：$P(c|x)$, 观测到x之后….

A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific, non-zero amount of error by a Gaussian mixture model with enough components.
（高斯模型混合模型（GMM）理论上可以拟合任意形状的概率分布）

Structured Probabilistic Models (graphical models)

factorization of a probability distribution with a graph in which each node in the graph corresponds to a random variable, and an edge connecting two random variables means that the probability distribution is able to represent direct interactions between those two random variables.

Directed models: 使用directed edges来分解成conditional probability distributions. 假设随机变量（节点）$x_i$的父亲节点集合为$Pa(x_i)$,则随机变量$\mathbf{x}$的概率分布可以分解为
\begin{equation}
p(\mathbf{x}) = \prod_i p(x_i|Pa(x_i))
\end{equation}
Undirected models: 使用undirected edges来分解成函数集合。clique $C^i$: any set of nodes that are all connected to each other. 其中$\phi$是与$C^i$有关的函数， Z是normalizing constant ，the sum or integral over all states of the product of the $\phi$ functions

\begin{equation}
p(\mathbf{x}) = \frac{1}{Z}\prod_i \phi^i (C^i)
\end{equation}