update estimates of the solution via an iterative process (通过迭代过程逐步逼近解,而analytically deriving a formula providing a symbolic expression for the correct solution.)
Overflow and Underflow
rounding error:数值在计算机中存储时产生精度方面的误差,包含
- underflow。 数字接近0
- overflow。 数字接近正无穷大或负无穷大。
因此设计函数时要求 be stabilized against underflow and overflow
Poor Conditioning
输入的微小变化会引起计算结果的剧烈变化,即Poor Conditioning。
考虑$AX=b$,如果系统的解对系数矩阵A或b太敏感,(一般系数矩阵A和b是从实验数据里面估计得到,存在误差),方程组系统就是ill-conditioned。否则就是well-condition系统。
condition number:衡量ill-condition系统的可信度,衡量输入发生微小变化时,输出会发生多大的变化。也就是系统对微小变化的敏感度。
- condition number值小(在1附近):well-conditioned
- condition number值大(远大于1):ill-conditioned. 输出结果可靠性不高。
(L2范数有助于处理 condition number不好的情况下矩阵求逆很困难的问题)。
如果方阵A非奇异,则condition number定义为:
\begin{equation}
\kappa (A) = ||A|| ||A^{-1}||
\end{equation}
如果方阵A奇异,那么A的condition number正无穷大.
设函数$f(x) = A^{-1}x$, 方阵有特征分解。则condition number为:
\begin{equation}
\kappa (A) = \max_{i,j} |\frac{\lambda_i}{\lambda_j}|
\end{equation}
条件数即 the ratio of the magnitude of the largest and smallest eigenvalue.
Gradient-Based Optimization
Jacobian and Hessian Matrices
Jacobian矩阵和Hessian矩阵 写得很好。
雅可比矩阵
假设$f: R^n \to R^m$是一个从欧式n维空间转换到欧式m维空间的函数. 这个函数$f(x) \in \mathbb{R}^m$由m个实函数组成: $f_1(x_1,…,x_n), …, f_m(x_1,…,x_n)$. 这些函数的偏导数(如果存在)可以组成一个m行n列的矩阵, 这就是所谓的雅可比矩阵:
(这个矩阵的第i行是由梯度函数的转置$f_i(i=1,…,m)$表示的)
海森Hessian矩阵
the Hessian is the Jacobian of the gradient
(我的理解是对雅可比矩阵的某一行再做一次雅可比)
设函数$f(x_1,x_2,…,x_n)$,如果$f$的所有二阶导数都存在, 那么$f$的海森矩阵即:
Hessian matrix is real and symmetric,we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors.
判断极大值以及极小值
二阶导数用于判断函数critical point是否极大值/极小值:(一阶导数$f0(x) = 0$的点为 critical points 或 stationary points)
- $f’(x) = 0, f’’(x) > 0 $ $x$ 局部最小值
- $f’(x) = 0, f’’(x) < 0 $ $x$ 局部最大值
- $f’(x) = 0, f’’(x) = 0 $ $x$ may be a saddle point(鞍点), or a part of a flat region.
多维情况下,使用Hessian matrix的特征分解:
- Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum
- Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum
- at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x is a local maximum on one cross section of f but a local minimum on another cross section. (a saddle point)
最优化方法
When the Hessian has a poor condition number, gradient descent performs poorly. 使用Hessian matrix to guide the search,最简单的一种是Newton’s method(基于2阶形式的泰勒展开)
- first-order optimization algorithms:use only the gradient,比如 gradient descent
- second-order optimization algorithms:use the Hessian matrix,比如 Newton’s method
弱限制条件:限制函数满足Lipschitz continuous or have Lipschitz continuous derivatives. quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output.
强限制条件:Convex optimization. 只能应用于convex functions—functions for which the Hessian is positive semidefinite everywhere. 在deep learning中使用率不高。