Numerical Computation in deep learning

update estimates of the solution via an iterative process (通过迭代过程逐步逼近解,而analytically deriving a formula providing a symbolic expression for the correct solution.)

Overflow and Underflow

rounding error:数值在计算机中存储时产生精度方面的误差,包含

  • underflow。 数字接近0
  • overflow。 数字接近正无穷大或负无穷大。

因此设计函数时要求 be stabilized against underflow and overflow

Poor Conditioning

输入的微小变化会引起计算结果的剧烈变化,即Poor Conditioning。

考虑$AX=b$,如果系统的解对系数矩阵A或b太敏感,(一般系数矩阵A和b是从实验数据里面估计得到,存在误差),方程组系统就是ill-conditioned。否则就是well-condition系统。

condition number:衡量ill-condition系统的可信度,衡量输入发生微小变化时,输出会发生多大的变化。也就是系统对微小变化的敏感度。

  • condition number值小(在1附近):well-conditioned
  • condition number值大(远大于1):ill-conditioned. 输出结果可靠性不高。

(L2范数有助于处理 condition number不好的情况下矩阵求逆很困难的问题)。

如果方阵A非奇异,则condition number定义为:
\begin{equation}
\kappa (A) = ||A|| ||A^{-1}||
\end{equation}

如果方阵A奇异,那么A的condition number正无穷大.

设函数$f(x) = A^{-1}x$, 方阵有特征分解。则condition number为:
\begin{equation}
\kappa (A) = \max_{i,j} |\frac{\lambda_i}{\lambda_j}|
\end{equation}

条件数即 the ratio of the magnitude of the largest and smallest eigenvalue.

Gradient-Based Optimization

Jacobian and Hessian Matrices

Jacobian矩阵和Hessian矩阵 写得很好。

雅可比矩阵

假设$f: R^n \to R^m$是一个从欧式n维空间转换到欧式m维空间的函数. 这个函数$f(x) \in \mathbb{R}^m$由m个实函数组成: $f_1(x_1,…,x_n), …, f_m(x_1,…,x_n)$. 这些函数的偏导数(如果存在)可以组成一个m行n列的矩阵, 这就是所谓的雅可比矩阵:

(这个矩阵的第i行是由梯度函数的转置$f_i(i=1,…,m)$表示的)

海森Hessian矩阵

the Hessian is the Jacobian of the gradient
(我的理解是对雅可比矩阵的某一行再做一次雅可比)

设函数$f(x_1,x_2,…,x_n)$,如果$f$的所有二阶导数都存在, 那么$f$的海森矩阵即:

Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.

判断极大值以及极小值

二阶导数用于判断函数critical point是否极大值/极小值:(一阶导数$f0(x) = 0$的点为 critical points 或 stationary points)

  • $f’(x) = 0, f’’(x) > 0 $ $x$ 局部最小值
  • $f’(x) = 0, f’’(x) < 0 $ $x$ 局部最大值
  • $f’(x) = 0, f’’(x) = 0 $ $x$ may be a saddle point(鞍点), or a part of a flat region.

多维情况下,使用Hessian matrix的特征分解:

  • Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum
  • Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum
  • at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x is a local maximum on one cross section of f but a local minimum on another cross section. (a saddle point)

最优化方法

When the Hessian has a poor condition number, gradient descent performs poorly. 使用Hessian matrix to guide the search,最简单的一种是Newton’s method(基于2阶形式的泰勒展开)

  • first-order optimization algorithms:use only the gradient,比如 gradient descent
  • second-order optimization algorithms:use the Hessian matrix,比如 Newton’s method

弱限制条件:限制函数满足Lipschitz continuous or have Lipschitz continuous derivatives. quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output.

强限制条件:Convex optimization. 只能应用于convex functions—functions for which the Hessian is positive semidefinite everywhere. 在deep learning中使用率不高。