Information Theory in deep learning

Posted on 2017-12-05 | In notes.math

信息理论的基本概念：learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

对信息的量化：

Likely events should have low information content
Less likely events should have higher information content
Independent events should have additive information.

事件$\mathbf{x} = x$的 self-information：
\begin{equation}
I(x) = - \log P(x)
\end{equation}

base e: nats; base-2: bits or shannons. (information measured in bits is just a rescaling of information measured in nats.)

Shannon entropy:

the amount of uncertainty in an entire probability distribution:
\begin{equation}
H(\mathbf{x}) = E[I(x)] = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}

Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; (a binary random variable p取0或1时，熵最小)
distributions that are closer to uniform have high entropy. (a binary random variable p取0.5时，熵最大)

当随机变量$\mathbf{x}$连续时，Shannon entropy 也称为 differential entropy

Kullback-Leibler (KL) divergence

也称为 relative entropy
measures the difference between two distributions

suppose two separate probability distributions $P (x)$ and $Q(x) $ over the same random variable x

The Kullback–Leibler divergence from Q to P,(relative entropy of P with respect to Q.)
\begin{equation}
D_{KL}(P||Q) = E[\log \frac{P(x)}{Q(x)}] = E[\log P(x) - \log Q(x)] = \sum_x P(x) \log \frac{P(x)}{Q(x)}
\end{equation}

useful properties of KL divergence

non-negative.
可以用来衡量两个分布之间的距离，但不是一个真正的距离函数，因为不满足对称性。存在$D_{KL}(P||Q) \ne D_{KL}(Q||P)$

cross-entropy

\begin{equation}
H(P,Q)= H(P)+D_{KL}(P||Q)
\end{equation}

其中：
\begin{equation}
H(P) = -E[\log P(x)] = - \sum_x P(x) \log P(x)
\end{equation}

\begin{equation}
H(P,Q) = -E[\log Q(x)] = - \sum_x P(x) \log Q(x)
\end{equation}

logistic sigmoid & softplus

Posted on 2017-12-04 | In notes.math

logistic sigmoid:

\begin{equation}
\sigma(x) = \frac{1}{1+\exp(-x)}
\end{equation}

logistic sigmoid经常用于产生Bernoulli distribution的参数p，因为值域(0,1)

softplus function

\begin{equation}
\varsigma(x) = \log {(1+\exp(x))}
\end{equation}

softplus function常用语产生正态分布的$\beta$ 或$\sigma$参数，因为值域$(0,\infty)$

Useful Properties of logistic sigmoid and softplus function

\begin{equation}
\sigma(x) = \frac{\exp(x)}{\exp(x)+\exp(0)}
\end{equation}

\begin{equation}
\frac{d}{dx}\sigma(x) = \sigma(x)(1-\sigma(x))
\end{equation}

\begin{equation}
1-\sigma(x) = \sigma(-x)
\end{equation}

\begin{equation}
\log \sigma(x) =- \varsigma(-x)
\end{equation}

\begin{equation}
\frac{d}{dx}\varsigma(x) = \sigma(x)
\end{equation}

\begin{equation}
\forall x \in (0,1),\sigma^{-1}(x) = \log {(\frac{x}{1-x})}
\end{equation}

$\sigma^{-1}(x)$ is called the logit in statistics

\begin{equation}
\forall x > 0,\varsigma^{-1}(x) = \log {(\exp(x)-1)}
\end{equation}

\begin{equation}
\varsigma(x) = \int_{-\infty}^x \sigma(y)dy
\end{equation}

\begin{equation}
\varsigma(x) - \varsigma(-x)= x
\end{equation}

Probability in deep learning

Posted on 2017-12-04 | In notes.math

three possible sources of uncertainty

Inherent stochasticity(随机性) in the system
Incomplete observability.
Incomplete modeling. (比如把连续的运行轨迹离散化后回丢失准确位置)

Random Variables

A random variable is a variable that can take on different values randomly.

Random variables may be discrete or continuous.

A discrete random variable is one that has a finite or countably infinite number of states.
A continuous random variable is associated with a real value.

Probability Distributions

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.

Discrete Variables and Probability Mass Function (PMF)

单个随机变量：$\mathbf{x} \sim P(\mathbf{x})$ 随机变量$\mathbf{x}$服从概率分布$P(\mathbf{x})$, $P(\mathbf{x}=x)$
多个随机变量：joint probability distribution: $P(\mathbf{x} = x,\mathbf{y} = y)$ 简写为$P(x,y)$

PMF 函数$P$需满足以下性质：

The domain of P must be the set of all possible states of $\mathbf{x}$.
$\forall x \in \mathbf{x},0 \le P(x) \le 1$
$\sum_{x \in \mathbf{x}} P(x) = 1$

Continuous Variables and Probability Density Functions (PDF)

PDF 函数$p$需满足以下性质：

The domain of p must be the set of all possible states of $\mathbf{x}$.
$\forall x \in \mathbf{x},p(x) \ge 0$ 不要求 $p(x) \le 1$
$\int p(x)dx = 1$

Marginal Probability

The probability distribution over the subset is known as the marginal probability distribution.

如果已知联合概率分布$P(\mathbf{x},\mathbf{y})$
则$P(\mathbf{x})$为：

\begin{equation}
\forall x \in \mathbf{x},P(x) = \sum_y P(\mathbf{x} = x,\mathbf{y} = y)
\end{equation}

对于连续变量：
\begin{equation}
p(x) = \int p(x,y)dy
\end{equation}

Conditional Probability

\begin{equation}
P(\mathbf{y} = y | \mathbf{x} = x) = \frac{P(\mathbf{y} = y,\mathbf{x} = x)}{P(\mathbf{x} = x)}
\end{equation}

只有$P(\mathbf{x} = x) > 0$时才能计算条件概率。（We cannot compute the conditional probability conditioned on an event that never happens.）

The Chain Rule of Conditional Probabilities

略

Independence and Conditional Independence

Two random variables x and y are independent，记为$\mathbf{x} \perp \mathbf{y}$, if

\begin{equation}
\forall x \in \mathbf{x}, y \in \mathbf{y}, p(x,y) = p(x)p(y)
\end{equation}

Two random variables x and y are conditionally independent given a random variable z, 记住为$\mathbf{x} \perp \mathbf{y}|z$
\begin{equation}
\forall x \in \mathbf{x}, y \in \mathbf{y}, z \in \mathbf{z} ,p(x,y|z) = p(x|z)p(y|z)
\end{equation}

Expectation, Variance and Covariance

expectation or expected value

设离散随机变量$\mathbf{x}$, 其概率分布为$P(\mathbf{x})$,期望值为：

\begin{equation}
E[\mathbf{x}] = \sum_x xP(x)
\end{equation}

\begin{equation}
E[f(\mathbf{x})] = \sum_x f(x)P(x)
\end{equation}

设连续随机变量$\mathbf{x}$, 期望值为：

\begin{equation}
E[\mathbf{x}] = \int xp(x)dx
\end{equation}

\begin{equation}
E[f(\mathbf{x})] = \int f(x)p(x)dx
\end{equation}

variance

\begin{equation}
Var(\mathbf{x}) = E[(\mathbf{x}-\mu)^2]=E(\mathbf{x}^2)-[E(\mathbf{x})]^2
\end{equation}

\begin{equation}
Var(f(\mathbf{x})) = E[(f(\mathbf{x})-E[f(\mathbf{x})])^2]
\end{equation}

covariance 协方差

\begin{equation}
Cov(f(x),g(y)) = E[(f(x)-E[f(x)])(g(y)-E[g(y)])]
\end{equation}

\begin{equation}
Cov(X,Y) = E[(X-\mu_x)(Y-\mu_y)]
\end{equation}

\begin{equation}
Cov(X,Y) = E(XY)-E(X)E(Y)
\end{equation}

\begin{equation}
Cov(X,X) = Var(X)
\end{equation}

Common Probability Distributions

Bernoulli distribution

Multinoulli Distribution

Gaussian distribution / normal distribution

Exponential and Laplace Distributions

The Dirac Distribution and Empirical Distribution

概率分布只在某个单独点附近有，使用Dirac delta function $\delta (x)$ (是一种generalized function )来定义PDF
\begin{equation}
p(x) = \delta (x-\mu)
\end{equation}
The Dirac delta function is defined such that it is zero-valued everywhere except 0, yet integrates to 1

The Dirac delta function as the limit (in the sense of distributions) of the sequence of zero-centered normal distributions

Mixtures of Distributions

A latent variable c is a random variable that we cannot observe directly. x 是能观测到的变量。
\begin{equation}
P(x,c) = P(x|c)P(c)
\end{equation}

Gaussian mixture model

prior probability：在观测到x之前， the model’s beliefs about c
posterior probability：$P(c|x)$, 观测到x之后….

A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific, non-zero amount of error by a Gaussian mixture model with enough components.
（高斯模型混合模型（GMM）理论上可以拟合任意形状的概率分布）

Structured Probabilistic Models (graphical models)

factorization of a probability distribution with a graph in which each node in the graph corresponds to a random variable, and an edge connecting two random variables means that the probability distribution is able to represent direct interactions between those two random variables.

Directed models: 使用directed edges来分解成conditional probability distributions. 假设随机变量（节点）$x_i$的父亲节点集合为$Pa(x_i)$,则随机变量$\mathbf{x}$的概率分布可以分解为
\begin{equation}
p(\mathbf{x}) = \prod_i p(x_i|Pa(x_i))
\end{equation}
Undirected models: 使用undirected edges来分解成函数集合。clique $C^i$: any set of nodes that are all connected to each other. 其中$\phi$是与$C^i$有关的函数， Z是normalizing constant ，the sum or integral over all states of the product of the $\phi$ functions

\begin{equation}
p(\mathbf{x}) = \frac{1}{Z}\prod_i \phi^i (C^i)
\end{equation}

norms

Posted on 2017-12-01 | In notes.math

范数包括向量范数和矩阵范数，

向量范数表征向量空间中向量的大小。
矩阵范数表征矩阵引起变化的大小。

可以把范数当作距离来理解，向量x的范数即从原点到点x的距离。

任意满足以下性质的函数都是norm。(将向量映射到非负值)

$f(x)=0 \Rightarrow x = 0$
$f(x+y) \le f(x) + f(y)$ (the triangle inequality)
$\forall \alpha \in \mathbb{R},f(\alpha x) = |\alpha|f(x)$

$L^p$ norm

对应闵可夫斯基距离(Minkowski Distance)。定义了一组范数。
\begin{equation}
||x||= (\sum_i |x_i|^p)^\frac{1}{p}
\end{equation}

“aaaa”
上图表示了p从无穷到0变化时，三维空间中到原点的距离（范数）为1的点构成的图形的变化情况。

$p=0$ 即L0范数：度量向量中非零元素的个数。不是一个真正的范数。
$0 \le p < 1$时，$L^p$不满足三角不等式。（有些书籍中规定$p \ge 1$）
$p=1$ 即L1范数：$||x||$ 为x向量各个元素绝对值之和。对应曼哈顿距离。
$p=2$ 即L2范数：$||x||$为x向量各个元素平方和的1/2次方。对应欧氏距离。
$p \to \infty$ 即$L^\infty$范数：$||x||$为x向量各个元素绝对值最大那个元素的绝对值，对应切比雪夫距离。

L0范数

度量向量中非零元素的个数，在实际情况中，L0的最优问题会被放宽到L1或L2下的最优化。(L1范数是L0范数的最优凸近似?)

L1范数

向量x中非零元素的绝对值之和。也叫曼哈顿距离、最小绝对误差等。

\begin{equation}
||x||_1= \sum_i |x_i|
\end{equation}

使用L1范数可以度量两个向量间的差异，如绝对误差和（Sum of Absolute Difference）：

\begin{equation}
SAD(x_1,x_2)= \sum_i |x_{1i}-x_{2i}|
\end{equation}

L1范数也被叫做稀疏规则算子（Lasso regularization）。通过L1可以实现特征选择，过滤掉一些没有信息的特征,使模型更有可解释性（Interpretability）。

L2范数

\begin{equation}
||x||_2= \sqrt {\sum_i x_i^2}
\end{equation}

L2也可以度量两个向量间的差异，如平方差和（Sum of Squared Difference）:
\begin{equation}
SSD(x_1,x_2)= \sum_i (x_{1i}-x_{2i})^2
\end{equation}
L2范数通常会被用来做优化目标函数的正则化项，防止模型为了迎合训练集而过于复杂造成过拟合（训练误差很小而测试误差很大）的情况，从而提高模型的泛化能力。在回归里面叫“岭回归”（Ridge Regression）。

“bbb”

$L^\infty$范数(max norm)

用来度量向量元素的最大值

矩阵范数（Frobenius norm）

类似于向量的$L^2$ 范数。
\begin{equation}
||A||_F= \sqrt {\sum_{i,j} A_{ij}^2}
\end{equation}

向量点乘的范数形式

\begin{equation}
x^\top y = ||x||_2 ||y||_2 \cos \theta
\end{equation}

$\theta$是向量x与y之间的夹角。

matrix decomposition in deep learning

Posted on 2017-11-30 | In notes.math

矩阵分解 (decomposition, factorization)是将矩阵拆解为数个矩阵的乘积。(就像可以把整数分解为素数的乘积来研究整数的一些性质)

eigendecomposition

特征分解（Eigendecomposition），又称谱分解（Spectral decomposition）是将矩阵分解为由其特征值和特征向量表示的矩阵之积的方法。(需要注意只有对可对角化矩阵才可以施以特征分解）

N 维非零向量 v 是 N×N 矩阵 A 的特征向量，当且仅当下式成立：

\begin{equation}
Av = \lambda v
\end{equation}

其中$\lambda$ 是一个scalar，特征向量 v 对应的特征值。也即特征向量被施以线性变换 A 只会使向量伸长或缩短而其方向不被改变。（左特征向量也成立：$v^\top A = \lambda v^\top$）

如果v是A的一个特征向量，则v的伸长或缩短 (sv for $ s \in \mathbb{R},s \ne 0$ )仍然是A的特征向量。因此只关注单位特征向量。

假设矩阵A的n个线性独立特征向量（列向量）组成矩阵V，对应的特征值组成列向量$\lambda = [\lambda_1,…,\lambda_n]^\top$. 则A的特征分解为：

\begin{equation}
A = V \mathbf{ diag}(\lambda) V^{-1}
\end{equation}

不是所有的矩阵都能分解成特征值和特征向量，这里我们只关注实对称矩阵（能分解成only real-valued eigenvectors and eigenvalues）

\begin{equation}
A = Q \Lambda Q^{-1}
\end{equation}

其中Q是orthogonal matrix composed of eigenvectors of A, $\Lambda = diag(\lambda)$。由于Q是正交矩阵，可以将A看成是在Q的不同方向(列向量$v^i$)伸缩$\lambda_i$倍。

特征分解的用处：

The matrix is singular if and only if any of the eigenvalues are 0.
optimize quadratic expressions of the form $f(x)=x^\top Ax $ subject to $||x||_2 =1$. f的最大值是最大的特征值，f的最小值是最小的特征值。
判定矩阵是否正定。
- positive semidefinite：eigenvalues are all positive or zero-valued. 半正定矩阵能保证$\forall x,x^\top A x \ge 0$
- positive definite：eigenvalues are all positive. 正定矩阵能额外保证 $x^\top A x = 0 \Rightarrow x = 0$
- negative definite： all eigenvalues are negative
- negative semidefinite： all eigenvalues are negative or zero-valued

Singular Value Decomposition (SVD分解)

奇异值分解（Singular Value Decomposition）将矩阵分解为奇异向量和奇异值（singular vectors and singular values.）,对矩阵的扰动不敏感。

SVD分解比特征分解更通用，任意实数矩阵（即使不是方阵）都有奇异值分解，但特征分解不是。

\begin{equation}
A = U D V^\top
\end{equation}
如果A是$m \times n$矩阵，则U是$m \times m$,D 是 $m \times n$, V 是$n \times n$.
U和V是正交矩阵（orthogonal matrices），D是对角矩阵（diagonal matrix），D不一定是方阵。

D中对角线上的元素是矩阵A的奇异值（singular values），矩阵A的非零奇异值是$A^\top A$或$A^\top A$的特征值的平方根。
U中的列向量是左奇异向量（left-singular vectors），左奇异向量是$AA^\top$的特征向量。
V中的列向量是右奇异向量（right-singular vectors），右奇异向量是$A^\top A$的特征向量。

SVD主要用于求矩阵（non-square matrices）的伪逆和主成分分析（PCA）。

matlab code:
[U,D,V] = svd(X)

SVD分解用于求伪逆（The Moore-Penrose Pseudoinverse）

矩阵A的伪逆定义：
\begin{equation}
A^+ = \lim_{\alpha \to 0}(A^\top A+ \alpha I)^{-1}A^\top
\end{equation}

实际算法更趋于使用下式：
\begin{equation}
A^+ = VD^+U^\top
\end{equation}
其中U,D,V是矩阵A的SVD分解。D的伪逆$D^+$是通过把矩阵D主对角线上每个非零元素都求倒数之后再转置得到的。

求伪逆通常可以用来求解线性最小平方、最小二乘法问题。

当矩阵A的列数量多于行数量时，可能存在许多解。用伪逆可以算出其中一个。在所有的答案中，答案$x = A^+y$拥有最小的$L^2$ norm $||x||_2$.
当矩阵A的行数量多于列数量时，可能不存在解。通过伪逆算出来的解会使得$Ax$尽可能接近y，$||Ax-y||_2$最小。

Linear Algebra in Deep learning

Posted on 2017-11-29 | In notes.math

只涉及到与理解deep learning有关的线性代数部分。详细请参考《The Matrix Cookbook》。

Scalars, Vectors, Matrices and Tensors

scalars：单纯数字。 $a = a^\top$
Vectors：数字的数组。（只含有一列的矩阵，如无特别说明，所提到的向量都为列向量）
Matrices：矩阵。2维数组。$\mathbf{A}_{i,:}$ 矩阵A的第i行， $\mathbf{A}_{:,j}$ 矩阵A的第j列。
Tensors：张量。3维

transpose

转置。沿主对角线做镜像翻转。 $(\mathbf{A}^\top)_{ij}=\mathbf{A}_{ji}$

矩阵乘法

distributive: $A(B+C) = AB+AC$
associative: $A(BC) = (AB)C$
not commutative: $AB = BA$不总是满足。但向量的点乘满足：$x^\top y = y^\top x$
矩阵乘积的转置: $(AB)^\top = B^\top A^\top$
线性等式的矩阵形式: $Ax=b$ A矩阵，x：变量，列向量。b：列向量。

Identity and Inverse Matrices

identity matrix: $I_n$, 主对角线上n个元素都为1，其余元素全为0的矩阵。
matrix inverse: $A^{-1}A = I_n$

Linear Dependence and Span

space & subspace

space: 空间内的元素对加法和乘法封闭，即任意的加或者乘，所得的结果仍然属于该空间。
subspace：W是线性空间V的一个非空子集，如果 W对于V 中定义的加法和乘法也构成线性空间，那么就成W是V的线性子空间。

span & range

设有一个列向量集合$\\{v^1,v^2,…,v^n\\}$
linear combination： $\sum_i{c_iv^i}$
span: 向量的所有线性组合
column space/ range of A：矩阵A中列向量的span
null space: 矩阵A的零空间为使A中的列向量组合和为零

线性方程组有解的条件

$Ax=b$是否有解取决于b是否存在于矩阵A的列空间中。
对任意$b \in \mathbb{R}^m$要求有解，要求A的列空间为所有的$\mathbb{R}^m$。如果$\mathbb{R}^m$中存在一个点，在列空间之外，则该点对应的b不存在解。

有解的necessary condition：矩阵A的列数量 $n \ge m$。 m列中可能存在冗余（称为linear dependence）。
有解的necessary and sufficient condition：矩阵A中包含m列线性独立的列。the matrix must contain at least one set of m linearly independent columns.

矩阵有逆的条件

有唯一解，矩阵为方阵（square，n=m）且m列线性独立。
singular matrix：A square matrix with linearly dependent columns

如果矩阵A不是方阵，或者A是singular矩阵，则不能使用matrix inversion。

Special Kinds of Matrices and Vectors

Diagonal matrices

主对角线元素非零，其余元素为零的矩阵。
对角线上元素相等的对角矩阵称为数量矩阵；对角线上元素全为1的对角矩阵称为单位矩阵。

性质：

和差运算：同阶对角阵的和、差仍是对角阵
数乘运算：数与对角阵的乘积仍为对角阵
乘积运算：同阶对角矩阵的乘积仍为对角阵，且它们的乘积是可交换的，即AB=BA

diag(v): a square diagonal matrix whose diagonal entries are given by the entries of the vector v.

对角矩阵的乘法运算非常高效，$diag(v)x=v \odot x$

只有当主对角线上元素全为非零元素时，对角矩阵的逆存在。$diag(v)^{-1} = diag([1/v_1,…,1/v_n]^\top)$

symmetric matrix

\begin{equation}
A = A^\top
\end{equation}

unit vector 单位向量

范数为1的向量。$||x||_2 = 1$

orthogonal matrix 正交矩阵

当向量x和y满足 $x^\top y = 0$时，x和y垂直。如果向量不但垂直而且范数为1，则称之为orthonormal.

orthogonal matrix: a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal
如果A是正交矩阵

$A^\top 是正交矩阵$
$A^\top A = A A^\top = I$ (I为单位矩阵)
A的各行是单位向量且两两正交
A的各列是单位向量且两两正交
$A^{-1} = A^\top$

The Trace Operator （矩阵的迹）

矩阵A的迹是矩阵A的主对角线上各个元素的总和。 (迹是所有特征值的和)
\begin{equation}
Tr(A) = \sum_i A_{ii}
\end{equation}

矩阵范数(Frobenius norm)的迹形式：

\begin{equation}
||A||_F = \sqrt {Tr(AA^T)}
\end{equation}

迹的不变性（invariant）

$Tr(A)=Tr(A^\top)$ invariant to the transpose operator
$Tr(ABC) = Tr(CAB) = Tr(BCA)$
$Tr(A_{m \times n}B_{n \times m}) = Tr(B_{n \times m}A_{m \times n})$
$a = Tr(a)$ a scalar is its own trace
$Tr(mA+nB)=m Tr(A)+n Tr(B)$

The Determinant 矩阵的行列式

det(A)：方阵A的行列式

The determinant is equal to the product of all the eigenvalues of the matrix.

行列时的性质：
https://en.wikipedia.org/wiki/Determinant

matlab稀疏矩阵

Posted on 2017-11-23 | In notes.code

sparse函数

S=sparse(X) 返回矩阵x的稀疏矩阵形式
S = sparse(I,J,V,m,n,nzmax) 行索引I，列索引J，及对应的值向量V. m和n指定生成的矩阵的维数，nzmax指定为非0元素分配的空间.默认 nzmax = length(I); m = max(I); n = max(J)
S = sparse(m,n), 创建全0的稀疏矩阵

full函数

X = full(S) 将稀疏矩阵S转化为非稀疏矩阵。

稀疏矩阵的存储

使用find函数
[I, J, V] = find(A) 返回矩阵A中非零元素的行索引I，列索引J，及对应的值向量V
其中矩阵A可以是稀疏矩阵也可以是非稀疏矩阵。

文本标注工具

Posted on 2017-11-22 | In 调研

IEPY

github代码地址： https://github.com/machinalis/iepy
一个关注关系提取（Relation Extraction）的开源信息提取工具，前端对用户不太友好。

特点：

使用active learning技术。根据用户提供的信息（让用户标注更为重要的样本）来预测剩下的案例，默认分类器为C-Support Vector Classification，可选分类器Stochastic Gradient Descent、Nearest Neighbors、Random Forest、AdaBoost。
对半结构化数据和高准确率要求的案例采用基于规则的关系提取工具。需要用户自定义“regular expression like” rules，一个规则可以认为是一个python函数。
使用Stanford CoreNLP技术来实现共指消解（Coreference resolution is the task of finding all expressions that refer to the same entity in a text，即将文章中所有表述划分为现实世界中不同实体的等价描述）

DeepDive

github代码地址： https://github.com/HazyResearch/mindbender
迭代开发知识库的工具mindbender，通过弱监督学习从非结构化的文本中抽取结构化的关系数据，可以判断两个实体间是否存在指定关系。其核心的标注工具是Mindtagger，交互式用户界面十分友好。

当前版本只支持对precision/recall的估计。对于precision估计任务，首先使用SQL查询语句从数据集中找出正相关子集，对系统识别到的实体只让用户判断相关与否，以及增加ad-hoc 标签。
允许用户判断哪些特征能增强预测的表现，这一任务中ad-hoc标签很重要。
其标注结果可以导出成(SQL, CSV/TSV, and JSON) 等格式以作它用，比如作为DeepDive应用的ground truth.

brat

github代码地址： https://github.com/nlplab/brat

brat rapid annotation tool(brat)旨在提供一个直觉性质的、快速的方式创造受文本约束的实体和关系标签。

通过选择文本的方式标注（annotate by select text）。
通过拖拉的方式标注关系（connect by drag and drop）。
支持标注命名实体（Named Entity annotation），标注依赖关系（Dependency annotation）,分块（chunking，比如名词词组分块），共指标注（coreference annotation，找出同一实体的不同表述），事件标注（event annotation）等。
支持任意语言的标注。

YEDDA

github代码地址：https://github.com/jiesutd/SUTDAnnotator
一个标注文本、符号和表情中的分块/实体／事件的标注工具，基本支持所有语言。

支持将标注过的文本导出成序列文本。
支持两种标注方式。a，选择文本并在shortcup map中按下相应的标签（标签可自己配置）。b，在文本框中输入（其格式为第几个子串及对应的标签缩写）
推荐系统。两个按钮控制推荐系统的开启与关闭。推荐内容为子串及对应的推荐标签，以不同的颜色在原始文本中标出。并考虑了不能忽视推荐标签错误时的修正时间，设计了针对推荐系统的撤销，调整和删除操作。

Elasticsearch

Posted on 2017-11-22 | In notes

什么是Elasticsearch

Elasticsearch是一个分布式搜索服务，提供Restful API，底层基于Lucene，采用多shard的方式保证数据安全，并且提供自动resharding的功能。

数据存储

Elastcisearch 是分布式的文档存储。它能以实时的方式将数据序列化成为json文档来存储。

索引

全文本数据每个字段都设置专用倒排索引
数值与位置数据使用BKD树（A Dynamic Scalable kd-Tree，用来索引多维点数据）

Elasticsearch能自动判断field类型并建立合适的索引，也可以定制mapping的方式来设置不同field的索引规则。

Elasticsearch与传统关系型数据库的关系

Elasticsearch	关系型数据库
index	DB
type	table
Document	row
Field	column

相似度计算方法

默认情况下，返回结果按相关性倒序排列，Elasticsearch 的相似度算法被定义为检索词频率/反向文档频率:

检索词频率 检索词在该字段出现的频率越高，相关性也越高。
反向文档频率 每个检索词在索引中出现的频率越高，相关性越低。检索词出现在多数文档中会比出现在少数文档中的权重更低。
字段长度准则 长度越长，相关性越低。检索词出现在一个短的 title 要比同样的词出现在一个长的 content 字段权重更大。

单个查询可以联合使用 TF/IDF 和其他方式，比如短语查询中检索词的距离或模糊查询里的检索词相似度。

模糊查询

以编辑距离衡量模糊性，查询时使用 fuzziness参数设置最大编辑距离，默认为2。模糊匹配不参与评分，只在有拼写错误时扩大匹配项的范围。

Reference

Elasticsearch: 权威指南

Evaluation of clustering

Posted on 2017-11-17 | In notes.algorithm

聚类的目标

high intra-cluster similarity and low inter-cluster similarity

数据集有groundtruth时，如何评估聚类结果的优劣？

4种评估方法

purity。简单。
NMI。从信息论角度解释。
RI。实质是accuracy，同时惩罚了FP和FN两种错误。
F-measure。为不同的错误类型（FP和FN）赋予不同的权重。

“bbb”

purity

purity取值范围[0，1]，越大越好。当cluster数量(k)很大时，purity值容易高，极端情况下，当一个类只包含一条记录时，purity取1.

\begin{equation}
purity(W,C) = \frac {1}{N} \sum_k \max_j |w_k \cap c_j| \label{rij}
\end{equation}

N为数据集大小，$W=\\{w_1,w_2,…,w_k\\}$ is the set of clusters,即聚类结果。 $C = \\{c_1,c_2,…,c_j\\}$ is the set of classes，即groundtruth. 取$w_k$与$c_j$交集的最大值。

\begin{equation}
purity(W,C) = \frac {1}{17} *（5+4+3）\approx 0.71
\end{equation}

NMI

\begin{equation}
NMI(W,C) = \frac {I(W,C)}{[H(W)+H(C)]/2}
\end{equation}

$I(W,C)$衡量类W与groundtruth C之间的互信息。

\begin{align}
I(W,C) &= \sum_k \sum_j P(w_k \cap c_j) \log \frac {P(w_k \cap c_j)}{P(w_k)P(c_j)} \nonumber \\
&= \sum_k \sum_j \frac{|w_k \cap c_j| }{N} \log \frac {N|w_k \cap c_j| }{|w_k||c_j|} \label{imp}
\end{align}

当类与groundtruth之间关系随机时，互信息取0。当类与groundtruth一样时，互信息取1，当类进一步分割成更小的类时，互信息不变。
极端情况下，当一个类只包含一条记录时，互信息仍然取1. 因此使用熵来惩罚过大的类数量。（熵随着类数目的增加而增加，极端情况下，当一个类只包含一条记录时，熵取最大值）

$H(W)$ 和$H(C)$分布衡量类W与groundtruth C的熵 (entropy).

\begin{equation}
H(W) = - \sum_k P(w_k) \log P(w_k)
\end{equation}

$P(w_k)$ 用 cluster $w_k$中的记录在总数据集的出现频率来估算，即$P(w_k)=|w_k|/N$

RI （Rand index）

类型	说明
TP	2个相似实体被分配到了相同类中
TN	2个不相似实体分配到了不同类中
FP	2个不相似实体分配到了相同类中
FN	2个相似实体分配到了不同类中

\begin{equation}
RI(W,C) = \frac {TP+TN}{TP+FP+FN+TN}
\end{equation}

例子：

$TP+FP = C_6^2+C_6^2+C_5^2 = 40$

$TP = C_5^2+C_4^2+C_3^2+C_2^2 = 20$

thus $FP = 40 -20 = 20$

for each classes, $x:8,o:5,\diamondsuit:4$

the number of similar entities: $TP+FN = C_8^2+C_5^2+C_4^2 = 44$

$FN = 44 - 20 = 24$

the number of dissimilar entities : $ TN+FP = 8\times5+8\times 4+5\times4=92$

$TN = 92-20 = 72$

	same cluster	different clusters
same class	TP=20	FN=24
different classes	FP=20	TN=72

F-measure

$Precision = TP/(TP+FP)$

$Recall= TP/(TP+FN)$
$F_\beta = \frac{(\beta^2+1)PR}{\beta^2P+R}$

当$\beta > 1$时，更多的权重给Recall，对FN的惩罚力度更强。

Reference

Evaluation of clustering

Shannon entropy:

Kullback-Leibler (KL) divergence

useful properties of KL divergence

cross-entropy

logistic sigmoid:

softplus function

Useful Properties of logistic sigmoid and softplus function

three possible sources of uncertainty

Random Variables

Probability Distributions

Discrete Variables and Probability Mass Function (PMF)

Continuous Variables and Probability Density Functions (PDF)

Marginal Probability

Conditional Probability

The Chain Rule of Conditional Probabilities

Independence and Conditional Independence

Expectation, Variance and Covariance

expectation or expected value

variance

covariance 协方差

Common Probability Distributions

Bernoulli distribution

Multinoulli Distribution

Gaussian distribution / normal distribution

Exponential and Laplace Distributions

The Dirac Distribution and Empirical Distribution

Mixtures of Distributions

Structured Probabilistic Models (graphical models)

$L^p$ norm

L0范数

L1范数

L2范数

$L^\infty$范数(max norm)

矩阵范数（Frobenius norm）

向量点乘的范数形式

eigendecomposition

Singular Value Decomposition (SVD分解)

SVD分解用于求伪逆 （The Moore-Penrose Pseudoinverse）

Scalars, Vectors, Matrices and Tensors

transpose

矩阵乘法

Identity and Inverse Matrices

Linear Dependence and Span

space & subspace

span & range

线性方程组有解的条件

矩阵有逆的条件

Special Kinds of Matrices and Vectors

Diagonal matrices

symmetric matrix

unit vector 单位向量

orthogonal matrix 正交矩阵

The Trace Operator （矩阵的迹）

矩阵范数(Frobenius norm)的迹形式：

迹的不变性 （invariant）

The Determinant 矩阵的行列式

sparse函数

full函数

稀疏矩阵的存储

IEPY

DeepDive

brat

YEDDA

什么是Elasticsearch

数据存储

索引

Elasticsearch与传统关系型数据库的关系

相似度计算方法

模糊查询

Reference

聚类的目标

4种评估方法

purity

NMI

RI （Rand index）

F-measure

Reference

SVD分解用于求伪逆（The Moore-Penrose Pseudoinverse）

迹的不变性（invariant）