Optimization

In this chapter, we will explain: how elsa solves different problems, the general approach it takes, and some algorithms to solve certain specific and common problem formulations.

X-ray CT as an Inverse Problem

Before, we dive in, we need to explain how we arrive at the optimization problem. This will only be a sketch, if you’re unsure, we kindly refer you to one of the references given in the previous chapter. If you already know this part, feel free to skip ahead.

Inverse problems are all kinds of problems. Intuitivly, you can think of the solution of an inverse problem, to finding the cause of an effect, but only knowing the observations. For X-ray CT, the cause of the effect are the X-ray projections, and the effect is the X-rays traveling through the object of interest from different positions around the object.

Mathematically, we describes the effect as some linear operator (in finite dimensions, you can think of it as a matrix). So what we have is a simple system of linear equations:

where $\mathcal{A}$ is the effect, which acts on some quantity $x$, and the observation of the effect on the quantity is $m$. Again, for X-ray CT, the quantities are attenuation coefficients of an object of interest, the observations are X-ray projections, and the effect is the so called X-ray transform, which models the interaction of X-rays with the object of interest.

Now, this is simply a linear system of equation, and one could use classical methods, such as the QR decomposition. However, for X-ray CT, the size of the system is typically very very large. Each row of the matrix, represents the interaction of a single ray with the object of interest. So the number of rows grows quickly with the number of projections, and the resolution of each projection. Storing the matrix densely, would quickly costs many terabytes of data and even using a sprase representation is not sufficient. So for practical implementations, the X-ray transform is computed on the fly. Which makes classical methods impossible to use. Here, we will present one way to solve the above equation.

One the way to a simple optimization problem

A mathematically intuitive approach, is to find a solution, for which the error is minimal. This could looks like this:

In this formulation one minimizes the least squares problem. We try to find some $x$, which is as close to the measurements, if we transform it into the measurement space using the operator $\mathcal{A}$. This formulation of least squares is highly important to many imaging modalities. The least squares formulations somehow measures how far away is our $x$ from the measurements we took.

This formulation has some other benefits. If the measurements contain some kind of noise, we can still expect some robustness (if the noise isn’t to strong). This wasn’t the case for the above system of linear equations.

Please note, that this is only one way to arrive at the least squares formulation. There are many different explanations, which are mostly all valid.

Generalization

Now, assume we arrived at the above optimization problem with least squares. In elsa, we generalize this further to the following formulation:

where $x$ is the quantity one wishes to reconstruct (e.g. the attenuation coefficients), $D(Ax, m)$ represents the data fidelity term, and $\mathcal{R}_k$ are regularization terms, with the corresponding regularization parameter $\lambda_k$. $\mathcal{A}$ is the forward model (i.e. the X-ray transform).

The data fidelity term represents, how closely our desirded quantity models the measurements. In X-ray CT, if you assume Gaussian noise the least squares functional $\frac{1}{2} || \mathcal{A}x - m ||_2^2$ is most often used. If the dosage level is low, and hence the assumption of Gaussian noise is wrong, one often uses the weighted least squares formulation or the transmission log likelihood.

Regularization enforces some assumed prior information. Common types of regularization are the $\ell^2$, $\ell^1$ and Total Variation (TV) regularization, which all assume different prior information.

Next, we will cover some common types of problems and what kind of algorithms can be used to solve each respective problems.

Least Squares

Let us first assume, no regularization terms and the least squares functional as the data fidelity term. A number of algorithms are developed to solve this problem. First, as the leasts squares functional is differentiable, all first-order methods (e.g. FGM or OGM) are a possible options. Gradient descent for the least squares functional is also known as Landweber, for which some properties are well studied (such as step lengths for which the problem is guaranteed to convergece).

Another popular algorithm is Conjugate Gradient (CG). CG is different from the usual first-order methods, by not choosing the direct negative gradient (as with vanilla gradient descent), but always choosing conjugate directions. This drastically improves performance. Importantly, CG usually provides pleasing results in few iterations. Other gradient based methods, usually need in the order of hundreds of iterations.

However, one of the key problems with the least squares approach is noise. If the noise is too severe the reconstruction will quickly overfit the noise.

$\ell^2$ Regularization

$\ell^2$ Regularization is an often used an well studied regularization. It assumes that the solutions $\ell^2$-norm is small. As a result, images reconstructed with the $\ell^2$-norm are usually blurry and edges are washed out. However, artifacts which appear as sharp edges are suppressed as well. The problem is often known as Tikhonov regularization as well.

Most of the algorithms discussed for the least squares problem are also applicable for the $\ell^2$ regularization. Specifically, the first-order methods and CG can be used for it.

Interestingly, the regularization term can incoorporate further operators, resulting in the General Tikhonov Regularisation. Specifically, interesting is the gradient operator. The problem then is given as:

To certain degrees this reduces the blurrines of the reconstructed images. However, the assumption of e.g. a small gradient (if $\mathcal{L} = \nabla$) is often not practical.

Edge preserving Regularization

Edge preserving regularization improves upon the general Tikhonov problem in some sense. The $\ell^2$-norm is replaced by an edge preserving functional. A quite well known option is the Huber functional. It is a differentiable approximation of the $\ell^1$-norm. Hence, first-order methods are still usable to solve the problem.

This problem formulation creates reasonable good images. However, they must be tuned carefully, as the closer the approximation to the $\ell^1$-norm gets, the larger the Lipschitz constant of the problem becomes, which in turns decreases the possible step size for the first-order methods. On the other hand, if not close enough to the $\ell^1$-norm, it will create images which are very similar to the plan $\ell^2$-regularization.

$\ell^1$ Regularization

Optimization problems with $\ell^1$ regularization are of the from

This problem is also ofen know as LASSO (least absolute shrinkage and selection operator). The problem favours sparse solutions, i.e. many components will be zero. For many applications this is beneficial. However, it turns out that thou X-ray CT images are not necessarily sparse, LASSO problems still handle noise quite well.

However, the LASSO problem comes at a cost. All of the above discussed algorithms no longer work. As the problem is not continuously differentiable anymore, no first-order methods can be used anymore. Other algorithm have been developed. Among the most important ones are ISTA and FISTA, which are specifcal casses of the proximal gradient descent and accelerated proximal gradient descent respectively.

Constrained optimization

So far, all problems introduced here are unconstrained. For X-ray CT the non-negativity constrain is a sensible and powerful constraint. I.e. the closed convex set $\mathcal{C} = \mathbb{R}_+^n$. Then the problem is defined as:

It can also be formulated using the indicator function:

This problem can be solved using proximal gradient descent and accelerated proximal gradient descent, as the proximal operator of the indicator function is the projection onto the set.

This problem is a very nice default for X-ray CT. Assuming sufficient and good data, the reconstructions with a non-negativity constrain are usually excellent.

Total Variation Regularization

The final for of regularization disucssed here is total variation. The problem is defined as:

The prior information included in this regularization is a sparse gradient, i.e. the problem penalizes frequent changes in the gradient. This results in piecewise smooth images. For many image applications in X-ray CT this is a good default assumption.

This problem is the hardest problem to solve in this collection. None of the previously mentioned algorithms are able to solve this. There are two popular algorithms to solve problems with TV regularization, alternating direction method of multipliers (ADMM) and Primal-Dual Hybrid Gradient (PDHG) can be used as well. In elsa, we currently implement two forms of ADMM, one with an explicit least squares term (ADMML2) and a special version called linearized ADMM (see its documentation for details)

Thou these algorithms can solve a variety of complex and important problems. They also need tuning of multiple variables, for which no real default values exist. This makes them cumbersome to work with.