# Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

@article{Wojtowytsch2021StochasticGD, title={Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis}, author={Stephan Wojtowytsch}, journal={ArXiv}, year={2021}, volume={abs/2105.01650} }

Stochastic gradient descent (SGD) is one of the most popular algorithms in modern machine learning. The noise encountered in these applications is different from that in many theoretical analyses of stochastic gradient algorithms. In this article, we discuss some of the common properties of energy landscapes and stochastic noise encountered in machine learning problems, and how they affect SGD-based optimization. In particular, we show that the learning rate in SGD with machine learning noise… Expand

#### Figures from this paper

#### 3 Citations

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis

- Computer Science, Mathematics
- ArXiv
- 2021

In a continuous time model for SGD with noise that follows the ‘machine learning scaling’, it is shown that in a certain noise regime, the optimization algorithm prefers ‘flat’ minima of the objective function in a sense which is different from the flat minimum selection of continuous timeSGD with homogeneous noise. Expand

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

- Computer Science, Mathematics
- ArXiv
- 2021

This article proves the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero as the width of the ANNs, the number of independent random initializations, and the numberof GD steps increase to infinity in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval. Expand

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

- Computer Science
- ArXiv
- 2021

The findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent. Expand

#### References

SHOWING 1-10 OF 36 REFERENCES

Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis

- Computer Science, Mathematics
- ArXiv
- 2021

In a continuous time model for SGD with noise that follows the ‘machine learning scaling’, it is shown that in a certain noise regime, the optimization algorithm prefers ‘flat’ minima of the objective function in a sense which is different from the flat minimum selection of continuous timeSGD with homogeneous noise. Expand

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

- Mathematics, Computer Science
- ICML
- 2012

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. Expand

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

- Computer Science, Mathematics
- NIPS
- 2011

This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant. Expand

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, Mathematics
- NIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which… Expand

Strong error analysis for stochastic gradient descent optimization algorithms

- Mathematics
- 2018

Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGD… Expand

Optimization Methods for Large-Scale Machine Learning

- Computer Science, Mathematics
- SIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand

AdaGrad stepsizes: sharp convergence over nonconvex landscapes

- Computer Science, Mathematics
- ICML
- 2019

The norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/ √ N) rate in the stochastic setting, and at the optimal O(1/N) rates in the batch (non-stochastic) setting – in this sense, the convergence guarantees are “sharp”. Expand

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

- Computer Science, Mathematics
- AISTATS
- 2019

It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand

On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems

- Computer Science, Mathematics
- NeurIPS
- 2020

This paper analyzes the trajectories of stochastic gradient descent (SGD) to help understand the algorithm's convergence properties in non-convex problems. We first show that the sequence of iterates… Expand

Linear Convergence of Adaptive Stochastic Gradient Descent

- Mathematics, Computer Science
- AISTATS
- 2020

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions… Expand