Conjugate Gradient

Originally Conjugate Gradient method was developed for solving certain systems of linear equations

{$\mathbf{Ax}=\mathbf{b}$}

where $x$ is a known vector, $b$ is unknown and $A$ is a square, symmetric, positive-definite matrix. Solving this equation is equivalent to finding minimum of a corresponding square function

{$f(x) = \frac{1}{2}x^TAx - b^Tx + c$}

This method is very similar to Gradient Descent in the sense that from the initial point $x_0$ we choose a vector $d_i$ which will take us to the next point, where the objective function will decrease. The above is repeated until we reach function minimum, i.e. gradient becomes zero.

Unlike Gradient Descent the direction $d_{i+1}$ is not set to $r_{i+1} = -\nabla f(x_{i+1})$, is chosen such that the next direction $d_{i+1}$ is conjugate to $d_i$ with respect to Hessian matrix.

{$d_{i+1} = r_{i+1} + \beta_{i+1}d_i$}

A few approaches exist for calculating $\beta$. The most popular one are ones by Polak-Ribiere(1) and Fletcher-Reeves(2)

  1. {$\beta_{i+1} = \frac{r_{i+1}^T r_{i+1}}{r_{i}^T r_{i}}$}
  2. {$\beta_{i+1} = \frac{r_{i+1}^T(r_{i+1} - r_{i})}{r_{i}^T r_{i}}$}

So the algorithm looks like this:

  1. Initialize $x_0$ to a random value
  2. Initialize direction $d_0 = r_0 = -\nabla f(x_0)$
  3. Perform a line search: $a_i = argmin_a f(x_i + a_id_i)$
  4. Move to the next point: $x_{i+1} = x_i + a_id_i$
  5. Calculate gradient in next point: $r_{i+1} = -\nabla f(x_{i+1})$
  6. Calculate new direction: $d_{i+1} = r_{i+1} + \beta_{i+1}d_i$
  7. If $\nabla f(x_{i+1}$ is not small enough go to 3

Line search on step 3 can be replaced with an explicit solution, in cases where we can calculate Hessian matrix $A$ of our objective function. In this case we will get

{$a_i = \frac{d_i^Tg_i}{d_i^TAd_i}$}

References