How do we find the best \(W\)?
\[w^*=\arg\min_w L(w)\] -
Random search (bad idea!)
随机生成权重矩阵,计算损失,记录损失最小的矩阵。测试中只有 15.5% 的正确率。
Follow the slope
In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension.
Numeric gradient: approximate, slow, easy to write.
数字梯度,复杂度 \(O(\#dimensions)\).
Analytic gradient: exact, fast, error-prone.
Loss is a function of \(W\). Use calculus to compute an analytic gradient.
In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. e.g.
torch.autograd.gradcheck gradgradcheck
Gradient Descent¶
Idea: Iteratively step in the direction of the negative gradient (direction of local steepest descent).
* Hyperparameters:w = initialize_weights() for t in range(num_iterations): dw = compute_gradient(L, data, w) w -= learning_rate * dw
- Weight initialization method
- Number of steps
Learning rate

我们讨论的梯度下降的版本实际上是 (Full) Batch Gradient Descent 批量梯度下降。
\[ \begin{aligned} L & =\dfrac{1}{N}\sum\limits_iL_i(x_i, y_i, W)+\lambda R(W)\\ \nabla_W L(W)& = \dfrac{1}{N}\sum\limits_i\nabla_W L_i(x_i, y_i,W) + \lambda \nabla_W R(W) \end{aligned} \]批量是因为,损失是数据集里单个样本的损失的集合。因此可以看到梯度也是单个样本的梯度的集合。
这可能会很昂贵,可能需要很长的时间来遍历整个数据集。实际中我们尝试用 SGD。
Stochastic Gradient Descent (SGD)¶
Idea: 我们不再在整个训练数据集上计算总和,而是通过绘制完整训练数据集的小的子样本 (mini-batch) 近似梯度。
Approximate sum using a minibatch of examples 32 / 64 / 128 common.
- Weight initialization
- Number of steps
- Learning rate
Batch size
Dont worry too much about the batch size, instead just try to make it as big as you can fit.
批量大小,我们每个 mini-batch 有多少样本数量。常见的做法是让 mini-batch 尽可能大,直到用完 GPU 内存。
Data sampling
not matter too much.
常见的策略是在开始时对数据集进行打乱 (shuffle),然后按顺序遍历数据集。
Think of loss as an expectation over the full data distribution pdata.
我们的数据是从概率分布中采样而来的,相当于是进行了期望估计(蒙特卡洛估计)。 -
Problems with SGD
What if loss changes quickly in one direction and slowly in another?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large.
- 如果步长较大,会发生震荡,会需要更多的步数。
- 我们如果把步长设得很小,可以避免震荡,但会导致收敛非常慢。
What if the loss function has a local minimum or saddle point?
- 局部最小(梯度为 0,但却不是函数的最低点),这种情况我们可能会在局部最小点停下,因为这里的梯度为 0,所以我们的步长也为 0.
- 鞍点:在一个方向上函数增加,另一个方向上降低。这个鞍点的梯度也为 0. (高维优化中易出现)
Zero gradient, gradient descent gets stuck.
stochastic part
Our gradients come from minibatches so they can be noisy!
- Build up “velocity” as a running mean of gradients
- Rho gives “friction”; typically rho=0.9 or 0.99
想象一个球加速下坡的时候,他会获得加速度。尽管局部梯度没有直接与它的运动方向对齐,它也会继续沿着该方向移动。 * 算法有不同版本,但下面的公式实际上是等价的。
You may see SGD+Momentum formulated different ways, but they are equivalent - give same sequence of x.
如何解决 SGD 的三个问题:
- 尽管到达了局部最小,但是依然有速度,可以继续移动。鞍点
- 速度相当于加权平均了我们在训练期间的所有梯度。如果出现了震荡的情况,速度矢量有助于平滑这一点。
- 可以平滑噪声。
思考动量更新:Combine gradient at current point with velocity to get step used to update weights.
Nesterov Momentum 另一种形式:
“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction.
\[v_{t+1}=\rho v_t - \alpha \nabla f(x_t+\rho v_t), x_{t+1}=x_t+v_{t+1}\]Annoying, usually we want update in terms of \(x_t, \nabla f(x_t)\)
我们一般是根据当前点进行优化,使用一个变量替换 \(\widetilde{x_t}=x_t+\rho v_t\) 就可以改写为当前位置的形式。
我们依然希望找到一些其他办法来克服 SGD 的问题。
Idea: Added element-wise scaling of the gradient based on the historical sum of squares in each dimension.
- Progress along “steep” directions is damped; progress along “flat” directions is accelerated
- 长时间运行 AdaGrad 会发生什么:梯度平方会不断累积,并可能停止继续运动。
RMSProp: “Leaky Adagrad”¶
- Idea: 有摩擦系数,会使得梯度平方和不断缩减。
- Code:
Adam (almost): RMSProp + Momentum¶
What happends at t=0? (Assume beta2=0.999)
会得到接近 0 的平方根(在分母下面)
Idea: 优化的开始我们希望建立对第一和第二动量的稳健估计。
Bias correction for the fact that first and second moment estimates start at zero.
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3, 5e-4, 1e-4 is a great starting point for many models!
Adam: Very Common in Practice!
Second-Order Optimization¶
So far: First-Order Optimization.
- Use gradient to make linear approximation
- Step to minimize the approximation
Use gradient and Hessian to make quadratic approximation.
\[ \begin{aligned} L(w) & \approx L(w_0)+(w-w_0)^\top\nabla_w L(w_0)+\dfrac{1}{2}(w-w_0)^\top H_w L(w_0)(w-w_0)\\ w^* & = w_0 - H_wL(w_))^{-1}\nabla_w L(w_0) \end{aligned} \] -
Why is this impractical?
Hessian has \(O(N^2)\) elements Inverting takes \(O(N^3)\) \(N=\) (Tens or Hundreds of) Millions.
高维空间下 Hession 矩阵会很大。我们要求 H 的逆,需要更大的空间。
低维空间可以使用,高维空间 impractical.
In practice:
- Adam is a good default choice in many cases SGD+Momentum can outperform Adam but may require more tuning.
- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise).
- Use Linear Models for image classification problems
- Use Loss Functions to express preferences over different choices of weights
- Use Regularization to prevent overfitting to training data
- Use Stochastic Gradient Descent to minimize our loss functions and train the model