Numerical Computation in deep learning

update estimates of the solution via an iterative process (通过迭代过程逐步逼近解，而analytically deriving a formula providing a symbolic expression for the correct solution.)

Overflow and Underflow

rounding error：数值在计算机中存储时产生精度方面的误差，包含

underflow。数字接近0
overflow。数字接近正无穷大或负无穷大。

因此设计函数时要求 be stabilized against underflow and overflow

Poor Conditioning

输入的微小变化会引起计算结果的剧烈变化，即Poor Conditioning。

考虑$AX=b$，如果系统的解对系数矩阵A或b太敏感，（一般系数矩阵A和b是从实验数据里面估计得到，存在误差），方程组系统就是ill-conditioned。否则就是well-condition系统。

condition number：衡量ill-condition系统的可信度，衡量输入发生微小变化时，输出会发生多大的变化。也就是系统对微小变化的敏感度。

condition number值小（在1附近）：well-conditioned
condition number值大（远大于1）：ill-conditioned. 输出结果可靠性不高。

（L2范数有助于处理 condition number不好的情况下矩阵求逆很困难的问题）。

如果方阵A非奇异，则condition number定义为：
\begin{equation}
\kappa (A) = ||A|| ||A^{-1}||
\end{equation}

如果方阵A奇异，那么A的condition number正无穷大.

设函数$f(x) = A^{-1}x$, 方阵有特征分解。则condition number为：
\begin{equation}
\kappa (A) = \max_{i,j} |\frac{\lambda_i}{\lambda_j}|
\end{equation}

条件数即 the ratio of the magnitude of the largest and smallest eigenvalue.

Gradient-Based Optimization

Jacobian and Hessian Matrices

Jacobian矩阵和Hessian矩阵写得很好。

雅可比矩阵

假设$f: R^n \to R^m$是一个从欧式n维空间转换到欧式m维空间的函数. 这个函数$f(x) \in \mathbb{R}^m$由m个实函数组成: $f_1(x_1,…,x_n), …, f_m(x_1,…,x_n)$. 这些函数的偏导数(如果存在)可以组成一个m行n列的矩阵, 这就是所谓的雅可比矩阵：

$J=[\frac{\partial f}{\partial x_1},...,\frac{\partial f}{\partial x_n}]=\left[ \begin{matrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots &\ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \\ \end{matrix} \right]$

(这个矩阵的第i行是由梯度函数的转置$f_i(i=1,…,m)$表示的)

海森Hessian矩阵

the Hessian is the Jacobian of the gradient
(我的理解是对雅可比矩阵的某一行再做一次雅可比)

设函数$f(x_1,x_2,…,x_n)$,如果$f$的所有二阶导数都存在, 那么$f$的海森矩阵即：

$H(f)=\left[ \begin{matrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \\ \end{matrix} \right]$

Hessian matrix is real and symmetric，we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.

判断极大值以及极小值

二阶导数用于判断函数critical point是否极大值/极小值：（一阶导数$f0(x) = 0$的点为 critical points 或 stationary points）

$f’(x) = 0, f’’(x) > 0 $ $x$ 局部最小值
$f’(x) = 0, f’’(x) < 0 $ $x$ 局部最大值
$f’(x) = 0, f’’(x) = 0 $ $x$ may be a saddle point(鞍点), or a part of a flat region.

多维情况下，使用Hessian matrix的特征分解：

Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum
Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum
at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x is a local maximum on one cross section of f but a local minimum on another cross section. (a saddle point)

最优化方法

When the Hessian has a poor condition number, gradient descent performs poorly. 使用Hessian matrix to guide the search，最简单的一种是Newton’s method（基于2阶形式的泰勒展开）

first-order optimization algorithms：use only the gradient，比如 gradient descent
second-order optimization algorithms：use the Hessian matrix，比如 Newton’s method

弱限制条件：限制函数满足Lipschitz continuous or have Lipschitz continuous derivatives. quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output.

强限制条件：Convex optimization. 只能应用于convex functions—functions for which the Hessian is positive semidefinite everywhere. 在deep learning中使用率不高。