Sparse modeling: Chapter 5 From exact solution to approximate solution

Overview

Sparse modeling Implemented the numerical example in Chapter 5 in Python. Jupyter notebook summarizing the code and experimental results. Introduction

Chapter 5 deals with the case where the observation data contains noise. Since the observation data contains noise, the constraint of the error term is weakened to $ \ epsilon $ or less.

(P_{0}^{\epsilon}): \min_{\mathbf{x}} \|\|\mathbf{x}\|\|_{0} \text{ subject to } \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2} \leq \epsilon

In Orthogonal Matching Tracking (OMP), simply find the solution as the stop condition $ \ epsilon_ {0} = \ epsilon $ for iterative calculation.

Relaxing the $ \ ell_ {0} $ norm to the $ \ ell_ {1} $ norm yields a variant of $ (P_ {1}) $, basis pursuit denoising (BPDN).

(P_{1}^{\epsilon}): \min_{\mathbf{x}} \|\|\mathbf{x}\|\|_{1} \text{ subject to } \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2} \leq \epsilon

For a suitable Lagrange multiplier $ \ lambda $, the solution of $ (P_ {1} ^ {\ epsilon}) $ is consistent with the solution of the following unconstrained optimization problem.

(Q_{1}^{\lambda}): \min_{\mathbf{x}} \lambda \|\|\mathbf{x}\|\|_{1} + \frac{1}{2} \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2}

A simple way to solve $ (Q_ {1} ^ {\ lambda}) $ is the iterative-reweighted-least-squares (IRLS) method.

\mathbf{X}=\rm{diag}(\mathbf{\|x\|})Then$ ||\mathbf{x}||_{1} = \mathbf{x}\mathbf{X}^{-1}\mathbf{x}It becomes. In other words\ell_{1}The norm is (weighted) squared\ell_{2}It can be regarded as a norm. Current approximate solution\mathbf{x_{\rm{k-1}}}Given,\mathbf{X_{\rm{k-1}}}=\rm{diag}(\left|\mathbf{x_{\rm{k-1}}}\right|)$Anyway, solve the following problem.

(M_{k}): \min_{\mathbf{x}} \lambda \mathbf{x^{\rm{T}}} \mathbf{X^{\rm{-1}}} \mathbf{x} + \frac{1}{2} \|\| \mathbf{b} - \mathbf{Ax} \|\|_{2}^{2}

So the IRLS algorithm uses the current weights $ \ mathbf {X_ {\ rm {k-1}}} $ to get $ \ mathbf {x_ {k}} $ and then $ \ mathbf {x_ {k} } $ Is used to obtain $ \ mathbf {X_ {\ rm {k}}} $ alternately, and approximately $ (Q_ {1} ^ {\ lambda}) $ is obtained. .. Chapter 5 mainly consists of an explanation of IRLS that approximates $ (Q_ {1} ^ {\ lambda}) $ and an outline of LARS (least angle regression stagewise).

IRLS algorithm

Initialization

As $ k = 0 $

(Arbitrary) initial solution $ \ mathbf {x_ {\ rm {0}}} = \ mathbf {1} $
Initial weight matrix $ \ mathbf {X_ {\ rm {0}}} = \ mathbf {I} $

Main loop

Set $ k \ leftarrow k + 1 $ and execute the following steps.

Least squares with regularization: The following linear simultaneous equations $ \ left (2 \ lambda \ mathbf {X_ {\ rm {k-1}} ^ {\ rm {-1}}} + \ mathbf {A ^ {\ rm {T}} A} \ right) \ mathbf {x} = \ mathbf {A ^ {\ rm {T}} b} $ is solved using the iterative method, and the approximate solution $ \ mathbf {x_ { Get \ rm {k}}} $. The textbook states that several iterations of the conjugate gradient method are sufficient. In this study, scipy.optimize.minimize is used.

*Weight update:\mathbf{x_{\rm{k}}}UsingX_{k}(j,j)=\|x_{k}(j)\|+\epsilonAs a weighted diagonal matrix\mathbf{X}Is updated. *Stop condition: If\|\|\mathbf{x_{\rm{k}}}-\mathbf{x_{\rm{k-1}}}\|\|_{2}If is less than the pre-given threshold, it exits, otherwise it repeats.

Experiment

simulation

$ \ mathbf {x} $ 200 dimensional vector $ \ mathbf {b} $ 100 dimensional vector $ \ mathbf {A} $ 100 × 200 dimensional matrix, column normalization of uniform random numbers of $ [-1, 1) $

$ \ mathbf {x} $ has a uniform distribution in the range of $ [-2, -1] \ cup [1, 2] $, where $ k = 4 $ for the number of nonzero elements Extract iid from.
Calculate $ \ mathbf {b} = \ mathbf {Ax} + \ mathbf {e} $ using additive noise $ \ mathbf {e} $ extracted from a Gaussian distribution with standard deviation $ \ sigma = 0.1 $. To do.

One of the issues to be solved is how to determine the value of $ \ lambda $. Here, according to the rule of thumb, $ \ lambda $ is a value close to the ratio of $ \ sigma $ and the standard deviation of the nonzero element. Since the standard deviation of non-zero elements is about 2, the value near $ \ sigma / 2 = 0.05 $ is set as $ \ lambda $.

result

\lambdaThe solution obtained while changing the value of$ \frac{|| \hat{\mathbf{x_{\rm{\lambda}}}} - \mathbf{x_{\rm{0}}} ||}{||\mathbf{x_{\rm{0}}}||} $Evaluated in. The number of IRLS repetitions was 100.

The dashed line is\left\|\log\left(\|\|\mathbf{A\hat{x_{\rm{\lambda}}}}-\mathbf{b}\|\|^{2}/\left(n\sigma^{2}\right)\right)\right\|Indicates the value of\lambdaResidual at what value\|\|\mathbf{A\hat{x_{\rm{\lambda}}}}-\mathbf{b}\|\|Is the power of noisen\sigma^{2}It shows whether it will be about the same as. this is\lambdaAnother empirical rule that determines the value of.

Three solutions when using the best $ \ lambda $ value and when using larger and smaller values

Compared to the true solution $ \ mathbf {x_ {\ rm {0}}} $ (red circle), $ \ mathbf {x_ {\ rm {0}} when using the optimal $ \ lambda $ value } The position of the nonzero element of $ is best restored. If the value of $ \ lambda $ is small, the solution is dense, and if it is large, the solution is sparse.

Display solutions for all $ \ lambda $ at once in one graph

Each plot shows the value of one element of $ \ mathbf {x} $ for $ \ lambda $. The dashed line is the value of $ \ mathbf {x_ {\ rm {0}}} $.

*Optimal\lambdaFunction value for the number of IRLS iterations when(\lambda\|\|\mathbf{x}\|\|_{1}+\frac{1}{2}\|\|\mathbf{b}-\mathbf{Ax}\|\|)

LARS algorithm

I couldn't understand it, so I only experimented with sklearn's Lasso Lars.

Experiment

simulation

$ \ mathbf {x} $ 50 dimensional vector $ \ mathbf {b} $ 30 dimensional vector $ \ mathbf {A} $ 30 × 50 dimensional matrix, column normalization of uniform random numbers of $ [-1, 1) $

Assuming that the number of non-zero elements k of $ \ mathbf {x} $, extract the value of the non-zero element from the uniform distribution in the range of $ [-2, -1] \ cup [1, 2] $. ..
Calculate $ \ mathbf {b} = \ mathbf {Ax} + \ mathbf {e} $ using additive noise $ \ mathbf {e} $ extracted from a Gaussian distribution with standard deviation $ \ sigma = 0.1 $. To do.
Simulate 200 times each with $ k = 1, ..., 15 $
Compare OMP and LARS

index

The $ L_2 $ norm of the estimated $ \ mathbf {\ hat {x}} $ and the true $ \ mathbf {x} $ *Estimated\mathbf{\hat{x}}And true\mathbf{x}Distance between supports$dist(\hat{S},S)=\frac{max\\{|\hat{S}|, |S| \\} - |\hat{S}\cap S|}{max\\{|\hat{S}|,|S|\\}}$
Take the average of 200 simulations.

result

Consideration

If k is small, OMP has a smaller error. When k is large, LARS has a smaller error.
Support restoration seems to be more accurate in OMP. It reverses as K increases.