Multi-Output Kernel Adaptive Filtering with Reduced Complexity

In this paper, two new multi-output kernel adaptive filtering algorithms are developed that exploit the temporal and spatial correlations among the input-output multivariate time series. They are multi-output versions of the popular kernel least mean squares (KLMS) algorithm with two different sparsification criteria. The first one, denoted as MO-QKLMS, uses the coherence criterion in order to limit the dictionary size. The second one, denoted as MO-RFF-KLMS, uses random Fourier features (RFF) to approximate the kernel functions by linear inner products. Simulation results with synthetic and real data are presented to assess convergence speed, steady-state performance and complexities of the proposed algorithms.


INTRODUCTION
Nowadays, many modern machine learning applications require solving several decision making or prediction problems and, in many cases, the key to obtain better results and cope with a lack of data consists of exploiting the existing dependencies between those problems [1][2][3][4], which is often broadly referred to as multitask learning [5][6][7].
This paper focuses on the development of new kernel adaptive filtering (KAF) algorithms for multitask learning and, more specifically, for multi-output online regression [8]. Kernel adaptive filters that perform adaptive filtering in a high-dimensional reproducing kernel Hilbert space (RKHS) have been successfully applied over the past decades to a variety of nonlinear signal processing problems [9][10][11]. These algorithms have been extensively studied for noise cancellation, channel estimation and nonlinear system identification in online manner due to its universal modeling capabilities and modest computational complexity.
However, most of the multi-output KAF algorithms proposed in the recent literature are based on Gaussian Processes or on kernelized versions of the recursive-least square algorithm [12][13][14], which entail a high computational complexity. In this paper, we attempt to fill this gap by studying multi-output versions of the popular KLMS that exploit the temporal (intra) and spatial (inter) correlations among the input-output multivariate time series. A straightforward multi-output KLMS filter is derived by concatenating the timeembedded input time series and performing a matrix-valued kernel This work was supported by the Ministerio de Ciencia, Innovación y Universidades and AEI/FEDER funds of the E.U., under grant PID2019-104958RB-C43 (ADELE). expansion to obtain the multivariate output. This is equivalent to a multi-output LMS operating in a RKHS. We explore two sparsification methods to curb the linear growth of the KLMS with the number of training data. The first one uses the coherence criterion to limit the dictionary size and leads to a multi-output version of the quantized KLMS [15] (MO-QKLMS). The second one uses random Fourier features (RFF) [16] to approximate the kernel functions by linear inner products and leads to a multi-output version of the RFF-KLMS [17] (MO-RFF-KLMS). Simulation results with synthetic and real data are presented to assess convergence speed, steady-state performance and complexities of the proposed algorithms.

MIMO REGRESSION
We consider a multiple-input multiple-output (MIMO) real nonlinear system whose input consists of M time series, x m n , m = 1, . . . , M , and produces P output signals, d p n , p = 1, . . . , P . We assume the nonlinear MIMO system is causal with memory (timeembedding) L taps. Let Existing kernel least-mean square (KLMS) algorithms are highly efficient learning machines for the identification of singleoutput nonlinear systems, but their MIMO counterparts have not yet been widely researched. Nevertheless, a suboptimal approach to the problem could be to use P independent KLMS filters, each predicting or identifying one component of the P -dimensional output vector dn. That is, we could apply standard KLMS filters to learn independent single-input single-output regression models. However, this approach does not exploit the existing correlation among the input time series and therefore is clearly suboptimal. In the following section, we propose two multi-output KLMS-like algorithms that fully exploit the inter-series correlation with reduced complexity.

MULTI-OUTPUT KLMS
Kernel methods are based on a nonlinear transformation of the data xi into a high-dimensional feature space. In this feature space, inner products can be calculated by using a positive definite kernel function satisfying Mercer's condition [18]: κ (xi, xj) = Φ (xi) , Φ (xj) . This simple idea, also known as the kernel trick, 2021 IEEE Statistical Signal Processing Workshop (SSP) 978-1-7281-5767-2/21/$31.00 ©2021 IEEE 306 allows us to perform inner-product based algorithms implicitly in feature space by replacing all inner products by kernel functions. Many kernel functions exist, though the most commonly used is the Gaussian kernel (1) Thanks to the Representer Theorem [19], the output to a new input at time instant n + 1, xn+1, for a single-output KLMS can be expressed as a kernel expansion in terms of the training data D = A straightforward multi-output KLMS generalization to (2) is where represents the kernels vector.
It is well known that the functional representation of the KLMS algorithm grows linearly with the number of processed data or dictionary size, leading to a heavy computational burden and huge memory requirements. Therefore, various online sparsification criteria have been developed to curb the growth of the kernel expansion [11,20]. In the following subsections, we explore two sparsification approaches that are particularly well-suited to multioutput scenarios.

MO-QKLMS
A single-output KLMS algorithm, named Quantized Kernel Least Mean Squares (QKLMS), which applies the coherence criterion as a way to achieve sparsification was proposed in [15]. Here we propose a multi-output generalization of the QKLMS termed MO-QKLMS.
The MO-QKLMS sparsification procedure uses coherence as a measure to characterize the dictionary, which is defined in a kernel context as Using the unit-norm Gaussian kernel defined in (1) to compute the inner products, (4) simplifies to When the coherence between the new datum xn+1 and the dictionary elements at time n, Dn, is below a given threshold then, the MO-QKLMS includes xn+1 into the dictionary and the filter coefficients are updated as where en+1 = dn+1 − yn+1 is the multivariate error vector. When the coherence is above the threshold, the new datum is not included in the dictionary and the coefficients corresponding to the dictionary element closest to xn+1, say xj, are updated as An+1,j = An,j + µe n+1 , where µ is the step-size and An,j denotes the j-th column of An. The MO-QKLMS algorithm is summarized in Algorithm 1.

MO-RFF-KLMS
Another alternative to limit the growth of the multi-output KLMS thus reducing its computational complexity, called random Fourier features KLMS (RFF-KLMS), was proposed in [17]. Here we propose a multi-output generalization of the RFF-KLMS named MO-RFF-KLMS. This KLMS-type algorithm is based on finding good finite-dimensional approximations of the kernel functions. That is, it obtains a mapping Ψ : R LM → R D , with D > LM , such that [16] Ψ (xi) , Ψ (xj) ≈ κ (xi, xj) .
The feature vectors Ψ (x) are obtained using random Fourier features (RFF) maps. The underlying idea is based on Bochner's theorem, which guarantees that the Fourier transform of an appropriately scaled, shift-invariant kernel is a probability density function [16], [21] Therefore, Ψω (xi) Ψω (xj) * is an unbiased estimate of κ (xi, xj) when ω is drawn from p(ω). To reduce the variance of this estimate, a sample average of D randomly chosen Ψω (·) is used. Hence, the D-dimensional inner product 1 D D k=1 Ψω k (xi) · Ψω k (xj) is a low variance approximation to the kernel evaluation κ (xi, xj). This approximation improves exponentially fast in D [16].
The vector Ψ (x) = [Ψω 1 (x) , Ψω 2 (x) , . . . , Ψω D (x)] is thus a D-dimensional RFF of the input vector x. This mapping satisfies the approximation in (8). For approximating the Gaussian kernel of width σ k , we therefore draw ωi, i = 1, . . . , D, from the normal N 0, I d /σ 2 k distribution. Since the RFF space is finite dimensional, it now becomes possible to directly work with the filter weights Ω in this space. Therefore, the kernel expansion (3) in this case yields Notice that the update equations in the RFF space become exactly the same as the classical multi-output LMS algorithm where en+1 = dn+1 − yn+1 is the multivariate error vector and µ is the step size. The MO-RFF-KLMS algorithm is summarized in Algorithm 2.

SIMULATION RESULTS
In this section, we provide two simulation examples to assess the performance of MO-QKLMS and MO-RFF-KLMS in comparison to their single-output versions in stationary and non-stationary environments. The first set of experiments is conducted using a synthetic dataset and the second one using real data. The optimal kernel width σ k for both datasets is estimated from the first 200 samples using the parameter estimation script included in the toolbox KAFBOX [22]. The curves for the synthetic dataset are averaged over 500 independent Monte Carlo runs and the ones for the real dataset over 50 runs. The mean squared error (MSE) shown in figures corresponds to the MSE averaged over the P output time series. All the experiments are conducted using MATLAB that runs on a laptop with an Intel(R) Core(TM) i7-7500U 2.70GHz CPU and 8GB RAM.

Nonlinear MIMO System Identification
In the first experiment, we consider a MIMO 3x3 nonlinear system. Firstly, 3 Gaussian white time-series of 4000 samples each are generated as input signals. Secondly, these signals are filtered through a MIMO linear system that introduces intra-and inter-correlation among the time-series. To achieve this, the MIMO filter is H = C ⊗ h where h = [1, 0.6] determines the temporal (intra-) correlation and the covariance matrix C, whose elements are cij = 1 if i = j and cij = ρ if i = j, determines the spatial (inter-) correlation. The time-embedding is L = 2. Finally, we apply a memoryless quadratic nonlinearity with coefficient γ = 0.3 (fixed) and add white Gaussian noise with variance σ 2 = 0.01 that models the measurement noise.
The values of the parameters used in the different adaptive filters under comparison are listed in Table 1, where µ is the step size, m is the final dictionary size, u is the coherence criterion threshold and D is the RFF dimension. In order to make a fair comparison between the different KAF algorithms, SO-QKLMS and MO-QKLMS parameter u is chosen so that m = 1000.

Algorithm
Parameters Table 1. Algorithms' parameters used in nonlinear MIMO system identification example. Figure 1 shows the convergence curves obtained for the nonlinear MIMO system with spatial correlation coefficient ρ = 0.3. As expected, MO-QKLMS and MO-RFF-KLMS provide much lower steady-state MSE than their single-output counterparts. Moreover, the two proposed algorithms outperform a multi-output version of the standard KLMS without sparsification (denoted as MO-KLMS) and its corresponding single-output version. Further, in this example MO-QKLMS is slightly better than MO-RFF-KLMS. In this case, the optimal value chosen via cross-validation for the kernel width is σ k = 2.23. Figure 2 shows the convergence curves obtained for the nonlinear MIMO system when we increase the spatial correlation coefficient to ρ = 0.6. For this system, the optimal value chosen via cross-validation for the kernel width is σ k = 1.67. Again, MO-QKLMS and MO-RFF-KLMS show a better performance in terms of steady-state MSE than that of their single-output counterparts and the MO-KLMS algorithm.
Meanwhile, it should be pointed out that the gap between the multi-output algorithms and their single-output versions increases as the spatial correlation coefficient ρ is increased from 0.3 to 0.6. This result supports the intuitive idea that multi-output algorithms perform better than their single-output versions because they take advantage of the spatial correlation among time series.

Lorenz Chaotic Time-Series Prediction
In the second experiment, we consider the Lorenz chaotic system determined by the differential equations [23]: with σ = 10, R = 28 and b = 8/3. A prediction horizon of h = 1 and an embedding of L = 3 are used. The values of the parameters used in the different algorithms are listed in Table 2. In this case, the optimal value chosen for kernel width is σ k = 1.94.     Table 3 shows a comparison between the two proposed algorithms in terms of training time, computational complexity (measured in FLOPS) and storage requirements for the Lorenz chaotic time-series dataset. It can be observed that MO-RFF-KLMS outperforms MO-QKLMS in terms of training time and computational complexity, thus making MO-RFF-KLMS our preferred option for real-time applications or systems with limited computing capabilities.

Algorithm
Training  Table 3. Training time and complexity for Lorenz time-series example.

CONCLUSIONS
Two multi-output KLMS-like algorithms are proposed in this paper: MO-QKLMS, which uses coherence as sparsification criterion to limit the growth of the dictionary size, and MO-RFF-KLMS, which uses random Fourier features to limit the complexity of the algorithm. Both algorithms outperform their single-output counterparts in terms of steady-state MSE and convergence speed. Both algorithms provide similar performance in terms of steady-state MSE, but the MO-RFF-KLMS proves to be a better choice in terms of computational complexity and training time. Future work can harness the proposed algorithms to solve multitask learning problems over an underlying graph as in [24,25].