Derivatives of xitorch.linalg.symeig
¶
Author: Muhammad Firmansyah Kasim (2020)
Problem¶
The function xitorch.linalg.symeig
decomposes a linear operator to its
\(k\) smallest or largest eigenvectors and eigenvalues,
where \(\mathbf{A}, \mathbf{M}\) are symmetric \(n\times n\) linear operators that act as the inputs of the function. The outputs: \(\mathbf{X}\) is an \(n\times k\) matrix containing the eigenvectors on its column, and \(\mathbf{E}\) is a \(k\times k\) diagonal matrix containing the corresponding eigenvalues.
The linear operators \(\mathbf{A}\) and \(\mathbf{M}\) have parameters that their elements depend on, which are denoted by \(\theta_A\) and \(\theta_M\), respectively. In this case, we only consider 1 parameters for each linear operator. Extending it to multiple parameters for one linear operator can be done trivially because the obtained expression will be similar to other parameters.
In this page, we will derive the expression for backward derivative (a.k.a. the vector-Jacobian product) of the linear operators parameters: \(\overline{\theta_A} \equiv \partial \mathcal{L}/\partial \theta_A\) and \(\overline{\theta_M} \equiv \partial \mathcal{L}/\partial \theta_M\) as functions of \(\mathbf{\overline{X}} \equiv \partial \mathcal{L}/\partial \mathbf{X}\) and \(\mathbf{\overline{E}} \equiv \partial \mathcal{L}/\partial \mathbf{E}\) for a loss value \(\mathcal{L}\). One challenge is that we only have implicit linear operators \(\mathbf{A}\) and \(\mathbf{M}\) where they are expressed by their matrix-vector multiplication and right-multiplications without explicit representation on their matrix elements. Another challenge is that only \(k\) eigenpairs are available, so calculations involving full eigenvectors and eigenvalues cannot be used.
This derivation assumes the eigenvalues are all unique. Cases with degenerate eigenvalues are treated differently.
Forward derivative of a single eigenpair¶
Let’s start with the eigendecomposition expression for one eigenvector and eigenvalue,
where the eigenvector is normalized,
Applying first order derivative to the equations above we obtain,
and
Applying \(\mathbf{x}^T\) on both sides of equation (3), we obtain
Substituting \(\mathbf{x}^T\mathbf{Mx}\) from equation (2) and \(\mathbf{x}^T\mathbf{A}\) from the transposed equation (1), we get the derivative of the eigenvalue,
To obtain the derivative of the eigenvector, we substitute (6) to (3) and rearrange it to obtain,
The matrix \((\mathbf{A} - \lambda \mathbf{M})\) is not a full rank matrix, so when multiplied to \(\mathbf{x'}\), some of its component is lost. To solve this, we split \(\mathbf{x'}\) into 2 components, orthogonal (\(\mathbf{x_M'}\)) and parallel (\(\mathbf{x_{-M}'}\)):
where
Simple arrangement of the equations above yields
Using the equations (10) in equation (4) and (7) produces
Multiplying the first equation above with \(\mathbf{x}\) and using the second equation from (10), we obtain,
Moving the matrix \((\mathbf{A} - \lambda \mathbf{M})\) on the second equation of (11) to the right hand side gives us
where the symbol \(\mathbf{C}^{+}\) indicates the pseudo-inverse of the matrix. The additional term \((\mathbf{I} - \mathbf{xx}^T\mathbf{M})\) is to make sure the result is orthogonal. The calculation of the pseudo-inverse can be obtained using standard linear equation solver.
To summarize, the forward derivatives are given by
Backward derivative¶
From the forward derivatives, it is relatively straightforward to get the backward derivatives. Using the relation
we get
For cases with multiple eigenpairs, the contributions should be summed from all eigenvalues and eigenvectors,
where \(\circ\) indicates element-wise multiplication and
Given the gradient of each elements in the linear operator, the gradient with respect to the parameters of \(\mathbf{A}\) and \(\mathbf{M}\) are
or more conveniently written as
In PyTorch, the terms above can be calculated by propagating the gradient from \(\mathbf{AX}\) or \(\mathbf{MX}\) with initial gradient given on the left term, e.g. \((\mathbf{X\overline{E}} - \mathbf{\overline{Y}})\) for \(\overline{\theta_A}\).