Limited-memory BFGS

Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.^[1]^[2]

Like the original BFGS, L-BFGS uses an estimation to the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense n×n approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with a large number of variables. Instead of the inverse Hessian H_k, L-BFGS maintains a history of the past m updates of the position x and gradient ∇f(x), where generally the history size m can be small (often m<10). These updates are used to implicitly do operations requiring the H_k-vector product.

Algorithm

L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out $d_k=-H_k g_k\,\!$ . There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."^[3]^[4]

We'll take as given $x_k\,\!$ , the position at the $k\,\!$ -th iteration, and $g_k\equiv\nabla f(x_k)$ where $f\,\!$ is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last $m$ updates of the form $s_k = x_{k+1} - x_k\,\!$ and $y_k = g_{k+1} - g_k\,\!$ . We'll define $\rho_k = \frac{1}{y^{\rm T}_k s_k}$ , and $H^0_k\,\!$ will be the 'initial' approximate of the inverse Hessian that our estimate at iteration $k\,\!$ begins with. Then we can compute the (uphill) direction as follows:

 $q = g_k\,\!$ 
For  $i=k-1, k-2, \ldots, k-m$ 
     $\alpha_i = \rho_i s^{\rm T}_i q\,\!$ 
     $q = q - \alpha_i y_i\,\!$ 
 $H^0_k=y^{\rm T}_{k-1} s_{k-1}/y^{\rm T}_{k-1} y_{k-1}$ 
 $z = H^0_k q$ 
For  $i=k-m, k-m+1, \ldots, k-1$ 
     $\beta_i = \rho_i y^{\rm T}_i z\,\!$ 
     $z = z + s_i (\alpha_i - \beta_i)\,\!$ 
Stop with  $H_k g_k = z\,\!$

This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing, $H^{0}_k$ should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).

Commonly, the inverse Hessian $H^0_k\,\!$ is represented as a diagonal matrix, so that initially setting $z\,\!$ requires only an element-by-element multiplication.

This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian $B_k\,\!$ have also been developed, as have other means of approximating the inverse Hessian.^[5]

Applications

L-BFGS has been called "the algorithm of choice" for fitting log-linear (MaxEnt) models and conditional random fields with $\ell_2$ -regularization.^[2]

Variants

Since BFGS (and hence L-BFGS) is designed to minimize smooth functions without constraints, the L-BFGS algorithm must be modified to handle functions that include non-differentiable components or constraints. A popular class of modifications are called active-set methods, based on the concept of the active set. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.

L-BFGS-B

The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form $l i \leq x i \leq u i$ where $l i$ and $u i$ are per-variable constant lower and upper bounds, respectively (for each $x i$ , either or both bounds may be omitted).^[6]^[7] The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.

OWL-QN

Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting $\ell_1$ -regularized models, exploiting the inherent sparsity of such models.^[2] It minimizes functions of the form

f(\vec x) = g(\vec x) + C \|\vec x\|_1

where $g$ is a differentiable convex loss function. The method is an active-set type method: at each iterate, it estimates the sign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable $\|\vec x\|_1$ term becomes a smooth linear term which can be handled by L-BFGS. After a L-BFGS step, the method allows some variables to change sign, and repeats the process.

O-LBFGS

Schraudolph et al. present an online approximation to both BFGS and L-BFGS.^[8] Similar to stochastic gradient descent, this can be used to reduce the computational complexity by evaluating the error function and gradient on a randomly drawn subset of the overall dataset in each iteration.

Implementations

An early, open source implementation of L-BFGS in Fortran exists in Netlib as a shar archive [1]. Multiple other open source implementations have been produced as translations of this Fortran code (e.g. java, and python via SciPy). Other implementations exist:

fmincon (Matlab optimization toolbox)
FMINLBFGS (for Matlab, BSD license)
minFunc (also for Matlab)
LBFGS-D (in the D programming language))
Frequently as part of generic optimization libraries (e.g. Mathematica, FuncLib C# library, and dlib C++ library)
The libLBFGS is a C implementation.
Maximization in Two-Class Logistic Regression (in Microsoft Azure ML)

Implementations of variants

The L-BFGS-B variant also exists as ACM TOMS algorithm 778.^[7] In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).

A reference implementation^[9] is available in Fortran 77 (and with a Fortran 90 interface) at the author's website. This version, as well as older versions, has been converted to many other languages, including a Java wrapper for v3.0; Matlab interfaces for v3.0, v2.4, and v2.1; a C++ interface for v2.1; a Python interface for v3.0 as part of scipy.optimize.minimize; an OCaml interface for v2.1 and v3.0; version 2.3 has been converted to C by f2c and is available at this website; and R's optim general-purpose optimizer routine includes L-BFGS-B by using method="L-BFGS-B".^[10]

There exists a complete C++11 rewrite of the L-BFGS-B solver using Eigen3.

OWL-QN implementations are available in:

C++ implementation by its designers, includes the original ICML paper on the algorithm^[2]
The CRF toolkit Wapiti includes a C implementation
libLBFGS

Works cited

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />

Limited-memory BFGS

Contents

Algorithm

Applications

Variants

L-BFGS-B

OWL-QN

O-LBFGS

Implementations

Implementations of variants

Works cited

Further reading

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools