Generalized Convolution Spectral Mixture for Multitask Gaussian Processes

Multitask Gaussian processes (MTGPs) are a powerful approach for modeling dependencies between multiple related tasks or functions for joint regression. Current kernels for MTGPs cannot fully model nonlinear task correlations and other types of dependencies. In this article, we address this limitation. We focus on spectral mixture (SM) kernels and propose an enhancement of this type of kernels, called multitask generalized convolution SM (MT-GCSM) kernel. The MT-GCSM kernel can model nonlinear task correlations and dependence between components, including time and phase delay dependence. Each task in MT-GCSM has its GCSM kernel with its number of convolution structures, and dependencies between all components from different tasks are considered. Another constraint of current kernels for MTGPs is that components from different tasks are aligned. Here, we lift this constraint by using inner and outer full cross convolution between a base component and the reversed complex conjugate of another base component. Extensive experiments on two synthetic and three real-life data sets illustrate the difference between MT-GCSM and previous SM kernels as well as the practical effectiveness of MT-GCSM.

against overfitting in the presence of little evidence [1], [2]. A GP can model a large class of phenomena through the choice of its kernel, which characterizes one's assumption on how the unknown function autocovariances. The choice of the kernel is a core step when designing a GP since the posterior distribution can significantly vary for different kernels. As a consequence, various kernels, such as squared exponential (SE), periodic, and Matérn, as well as methods for kernel design, have been proposed (see [2]). The extension of GPs to multiple sources of data is known as multitask GPs (MTGPs). MTGPs model temporal or spatial relationships among infinitely many random variables, as a scalar (that is, single source) GPs do, but they also account for statistical dependencies across different sources of data (or tasks) [3]- [8]. How to choose an appropriate kernel to jointly model the cross covariance between tasks and autocovariance within each task is the core problem of MTGPs design [4], [9]- [13]. Early approaches for this problem, such as the linear model of coregionalization (LMC) framework [3], [7], [14], consider only linear combinations of independent single-source GPs.
More expressive approaches, such as the multikernel method [11] and the convolved latent function framework [15]- [18], consider convolution to construct cross covariance functions and assume that each task has its kernel. The use of spectral mixture (SM) kernels has further boosted the development of MTGP methods. First, the SM-LMC kernel was proposed [19], [20], which uses independent SM components and global linear weights for all tasks; then, the cross-SM (CSM) kernel [21], a more flexible kernel, considers power and phase correlation between tasks. However, CSM cannot capture complicated cross correlations because it only considers phase dependencies. The multioutput SM (MOSM) kernel [22], [23] addressed this limitation. MOSM has three limitations: 1) all tasks have the same number of components; 2) components in different tasks should be aligned; and 3) dependencies between components within each task are ignored [24], [25]. In an MTGP context, SM-LMC and CSM involve a linear combination of multiple independent components, where each component has its weight encoded in an LMC matrix to quantify the amplitude of tasks' correlation. Both SM-LMC and CSM also have an interpretation as a sum of separable kernels [21], [26]. The GCSM of coupling coregionalizations (GCSM-CCs) kernel [27] extends previous approaches by modeling nonlinear correlations between tasks and dependencies between components. In GCSM-CC, coupling coregionalization (CC) is used to learn task-level correlations to address the third limitation of MOSM.
In this article, we overcome limitations of current MTGPs methods, such as statistical independence of components within and between tasks and alignment of components between tasks, and propose the multitask generalized convolution SM (MT-GCSM) kernel. MT-GCSM accounts for two types of dependencies, namely, at the task level and the component level, by using the GCSM [28]. This approach overcomes the restriction that all tasks should have the same number of components since task-level correlations and component-level dependencies are fully convolved with time and phase delays. In MT-GCSM, each task has its GCSM kernel characterized by the task structure. Moreover, without increasing the hyperparameter space, MT-GCSM can involve a high number of convolution structures. When only a single task is available, MT-GCSM reduces to the GCSM kernel, which has been shown to be a generalization of the standard SM kernel.
We can summarize the questions addressed here as follows: 1) how to model component-level dependence within a task? 2) how to build generalized cross components between tasks for modeling task-level correlation? 3) how to incorporate inner component-level dependence and cross-task-level correlation simultaneously? and 4) what is the relationship between MT-GCSM and other (SM-based) kernels?
There are two main novelties of the proposed approach: 1) capturing the statistical dependence between components within and between tasks and 2) allowing tasks to have a different number of unaligned components. By considering dependencies between components within and between tasks, improved performance can be achieved without increasing the number of hyperparameters.
The remainder of this article is organized as follows. Background on GPs and SM kernels is given in Section II. In Section III, the underlying motivation to model statistical dependencies between kernel components within and between tasks in the proposed kernel is provided by using the framework by [29]. The proposed MT-GCSM kernel is described in Section IV. Section V compares MT-GCSM with SM-based kernels for MTGP in terms of hyperparameter space and degrees of freedom. Section VI analyzes the beneficial aspects of MT-GCSM. Section VII describes the results of experiments on synthetic and real-world data sets. A summary of this article's contributions, concluding remarks, and future work are given in Section VIII.

II. BACKGROUND
We start with some background information on GPs, MTGPs, and SM kernels. We follow the standard notation by [2].

A. Gaussian Processes
GPs are any distribution over functions f (x) such that any finite set of function values has a joint Gaussian distribution. A GP is specified by its mean function m( (1) Without loss of generality, we assume the mean of a GP to be zero. The covariance function is applied to construct a positive definite covariance matrix on the set X of n training points, denoted by K = K (X, X). By placing a GP prior over functions through the choice of a kernel and parameter initialization, from the training data X, we can predict the unknown function valueỹ * and its variance V[ỹ * ] (that is, its uncertainty) for a test point x * using the following key predictive equations for GP regression [2]: where σ 2 n is the noise level, k * is the covariances' vector between x * and X, and y are the observed values corresponding to X. Typically, GPs contain free parameters , called hyperparameters, which can be optimized by minimizing the negative log marginal likelihood (NLML) This formulation follows directly from the fact that y ∼ N (0, K + σ 2 n I ). In MTGP, we have multiple sources of data from related tasks. The construction of the MTGP covariance function k MTGP models dependencies between pairs of points from two tasks.

B. Spectral Mixture Kernels
Choosing an appropriate kernel function and its initial hyperparameters based on prior knowledge from the data is the core step for designing a GP. Various kernel functions have been proposed [2], such as SE, periodic (PER), and general Matérn (MA). Recently, new covariance kernels have been proposed in [19] and [30], called SM kernels. An SM kernel, here denoted by k SM , is derived through modeling spectral densities (Fourier transform of a kernel) with the Gaussian mixture. A desirable property of SM kernels is that they can be used to reconstruct other popular standard covariance kernels. Indeed, for stationary kernels, which are functions of τ = x − x , that is, are invariant to translation of the inputs, the seminal Bochner's theorem [31] can be applied. Bochner's theorem [31] states that the properties of a stationary kernel entirely depend on its spectral density. With enough components, k SM can approximate any stationary kernel [30]. An SM kernel can be written as where Q is the number of components, k SMi is the i th component, P is the dimension of input, and w i , (1) , . . . , (σ 2 i ) (P) ]) are weight, mean, and variance of the i th component in the frequency domain, respectively. The variance (σ 2 i ) (P) can be thought of as an inverse length scale, μ (P) i as a frequency, and w i as a contribution. Bochner's theorem [31], [32] indicates a direction on how to construct a valid kernel from the frequency domain. We will usek(s) to denote the spectral density of a covariance function k(τ ) in the frequency domain. For SM kernel [19], by using the inverse Fourier transform of the spectral densityk SMi is a symmetrized scale-location Gaussian in the frequency domain, we obtain in the time domain where F −1 s→τ denotes the inverse Fourier transform.

C. Generalized Convolution Spectral Mixture Kernel
The GCSM kernel extends the SM kernel with crosscomponent terms [25], [28]. The construction of this kernel relies on the fact that any stationary covariance function can be represented as a convolution form on R P [33]- [35] k(x, x ) which results ink(s)=(ĝ(s)) 2 for the kernel in the frequency domain. On the other hand, based on the Fourier transforms of time and phase delays between the time and frequency domains, the weighted GCSMk GCSMi (s) extends the Gaussian spectral density of weighted SM (w ikSMi (s)) to include time delay θ i ∈ R P and phase delay φ i ∈ R P simultaneously as follows:k where j denotes the imaginary number. Fork GCSMi (s), we havê g GCSMi (s) = (k GCSMi (s)) 1/2 , which can be seen as the basis function ofk GCSMi (s). The cross correlation between g GCSMi (τ ) and g GCSM j (τ ) is equivalent to the cross convolution of g GCSMi (τ ) and g GCSM j (−τ ) (the overline symbol denotes the complex conjugate operator), which we call cross GCSM component k i× j GCSM (τ ). By considering the symmetric property and the Fourier transform of convolution (k i× j with the following parameters: Using GCSM, we can model the time and phase delays' dependence between components within a single task.
Before extending the GCSM kernel for multitask problems, in Section III, the underlying motivation to model statistical dependencies between kernel components within and between tasks in our new kernel is provided by using the framework by [29].

III. STATISTICAL DEPENDENCE BETWEEN COMPONENTS
IN MTGPs Current SM kernels for MTGPs, such as SM-LMC, CSM, and MOSM, are additive and consist of a sum of independent components. In this case, any function f drawn from an With a slight abuse of notation, we denote by f i also the function values at training location X and by f * i both the function and the function values at some set of queries' location X * .
From the additivity of the SM-LMC, CSM, and MOSM kernel, it follows that the f i 's are a priori independent. Then, by using the formula for the Gaussian conditionals, we can give the conditional distribution of a GP-distributed function where f i+ j = f i + f j and K i+ j = K i + K j . The reader is referred to [29,Sec. 2.4.5] for the derivation of these results. In our MTGP setting, K i = B i ⊗ k i for SM-LMC and CSM and K i = k m,m MOSM,i for MOSM. The Gaussian conditionals express the model's posterior uncertainty about the different components of the signal, integrating over the possible configurations of the other components. In particular, we have In general, when statistical dependencies between components are present, we have We can also compute the posterior covariance between the height of any two functions, conditioned on their sum [25], [29] Cov For SM-LMC, CSM, and MOSM kernels for MTGPs, the posterior covariance indicates the possible presence of statistical dependencies between components and provides a principled justification for modeling such dependencies within a kernel.

IV. MULTITASK GENERALIZED CONVOLUTION SM KERNELS
In order to capture statistical dependencies between components in a kernel for MTGPs, we will now extend the GCSM kernel for multitask problems. In a multitask context, a stationary kernel can depend on each pair of tasks. In the stationary setting, the covariance between point x (m) from task m and x (m ) from task m can be written as While in a single task, GP's time and phase delays' dependencies exist within the task, in MTGPs, time and phase delays' dependencies exist also between different components from different tasks. Let τ = x (m) −x (m ) . Then, the cross component k i× j MT-GCSM (τ ) in multitask GCSM is based on the cross convolution between g (m) GCSMi (τ ) in task m and the reversed complex conjugate g (m ) GCSM j (−τ ) in task m , which, in the frequency domain, can be written aŝ In our kernel for MTGPs, we aim to capture statistical dependencies between components within a task as well as between tasks (see Fig. 1). The amount of dependence between components is determined by the intersection of their spectral density in the frequency domain (see Fig. 2). Therefore, by considering the symmetry of spectral density [2], we can construct a kernel between task m and task m as the sum of cross GCSM components k i× j GCSM (τ ) between m and m as follows: where Q (m) and Q (m ) are, respectively, the number of components in tasks m and m , and i ∈ {1, . . . , Q (m) } and j ∈ {1, . . . , Q (m ) } are, respectively, the indices of components in tasks m and m . Note that Q (m) and Q (m ) may be different.
In the most general form, where the parameterization of each task is independent, this leads to the multitask GSCM kernel, and we obtain The cross weight for a pair m, m of tasks and components i and j is w m,m In the proposed kernel, time and phase delays' dependencies are considered not only inside each task but also across tasks. The convolution relation is shown in Fig. 1. The single-task GCSM can be seen as an inner full cross-convolution SM in the multitask GCSM kernel. By analyzing the spectral density of a task in the frequency domain, one can gain insight into its complexity since the GCSM kernel of a task has its number of components, and different tasks may have a different number of components depending on the task's complexity. In this way, we increase in flexibility over existing SM kernels for MTGPs, notably SM-LMC, MOSM, and CSM, which assume that all tasks to have the same number of components and components between tasks to be aligned.
The proposed MT-GCSM kernel is illustrated in Fig. 1. Each connection in MT-GCSM represents a convolution structure. The dashed line in Fig. 1  When task m is equal to task m , there is no crosscomponent-level convolution between tasks, and MT-GCSM reduces to GCSM. Furthermore, without considering component dependencies, GCSM reduces to the ordinary SM kernel.
MT-GCSM kernels for MTGP with an arbitrary number of tasks fulfill the positive semidefinite condition (proof is given in the Appendix). The proof of the positive semidefinite condition for MT-GCSM differs from that for GCSM because of the presence of outer full cross-convolution terms between tasks in MT-GCSM.

V. RELATION TO OTHER KERNELS
Here, we address the last question mentioned in Section I and investigate the relation between MT-GCSM and other SM-based kernels. Significant progress in the theory and applications of MTGPs has been achieved in previous works, e.g., [3], [7], [11], [15], [20]- [22], [36]. Since the introduction of SM kernels [30], [37], MTGPs with SM kernels [30], [37]- [42] showed a strong learning ability and interpretation. Here, we focus on MTGP methods based on such kernels [20]- [22]. The first MTGP using an SM kernel is based on the LMC framework [20] to construct a GP regression network (GPRN) The B i term in K SM-LMC encodes cross weights to represent task correlations and involves a linear combination of components. The CSM kernel [21] improved the expressiveness of SM-LMC: it contains a cross phase spectrum and is defined within the LMC framework as where k SGi (τ ; i ) is the i th phase parameterized, complexvalued spectral Gaussian function, and the operator Re returns the real component of its argument. The kernels k SGi (τ ; i )s used in the CSM are, however, only phase dependent. The MOSM kernel [22] provides a principled framework to construct multivariate covariance functions with a better interpretation of cross relationships between tasks. MOSM has the form A more recent SM-based kernel for MTGP employs GCSM-CC [27]. The GCSM-CC kernel is where C i C j term is the CC term and k i× j GCSM is the single-task GCSM kernel. Although GCSM-CC considers both task-level correlations and component-level dependencies, using CC has the limitation that all tasks share the same kernel. This is a common drawback of MTGPs involving coregionalization [3]. There are some differences between the LMC framework and the multikernel framework for multiple tasks. Thus, it is difficult to find a unifying view incorporating both LMC and multikernel frameworks. Table I summarizes the characteristics of these kernels in terms of parameters and degrees of freedom. In Table I, θ f and θ are length scale and x-scale in SE and Matérn kernel, respectively, and we use q instead of i to indicate (the index of) components.

VI. INTERPRETATION OF THE DEPENDENCIES MODELED
IN MT-GCSM In this section, we illustrate the dependencies modeled in MT-GCSM. Fig. 2 shows the four outer cross-convolution structures of two tasks in the time domain and its corresponding spectral densities in the frequency domain. Here, each task has Q = 2 base components. According to (19), there are in total eight convolution structures (four inner cross-convolution structures plus four outer cross-convolution structures). Here, we only show the four outer cross-convolution structures [see (19)] because the four inner convolution structures are given by single-task GCSM [see (10)] [28]). The first row in Fig. 2 shows the four cross-convolution structures (solid lines with different colors) in MT-GCSM for different time and phase delays' settings. The second row in Fig. 2 shows the corresponding cross-spectral densities in the frequency domain. From (19) and from the definitions of cross weight, cross amplitude, cross mean, and cross covariance, we can observe that closer frequencies (μ (m)

and weights (w (m)
i , w (m ) j ) between tasks m and m in MT-GCSM induce stronger cross dependencies.

VII. EXPERIMENTS
We compare MT-GCSM with published MTGP methods, namely, SM-LMC, CSM, MOSM, and GCSM-CC. First, we consider a mixed-signal sampled from a Gaussian distribution specified by GP(0, K SM (Q = 3)), its integral and derivative, and its time-delayed signal simultaneously. Next, we consider prediction tasks on a real-world problem with two sensor array data sets: humidity monitoring related to climate change and nitrogen dioxide (NO 2 ) concentration related to air pollution. 1 For the synthetic and real-world data  sets, we follow the experimental setting used in [22] and [15], add a task with the integral signal, and increase the difficulty of learning to cover interpolation, forward extrapolation, back extrapolation, and middle extrapolation simultaneously. We implemented our models in Tensorflow [43] and GPflow [44] to improve scalability and to facilitate gradient computation. The open-source codes will be made available on request. In all experiments, we use the mean absolute error MAE = n i=1 |y i −ỹ i |/n as performance metric, whereỹ i is the predicted value.
In general, SM-based kernels are sensitive to the initialization of its hyperparameters, which can lead to a local optimal solution for NLML [see (4)]. MT-GCSM kernel has the same initialization problem. In MT-GCSM, the degrees of freedom and parameter space are M times as large compared to using GCSM with a single task. Hyperparameters' initialization has a direct impact on the ability to discover and extrapolate patterns, especially in the presence of complex multiple tasks. Therefore, we apply an initialization strategy that uses the empirical spectral densities, which has been shown to be effective in other contexts [30]. The empirical spectral density is, however, often noisy, so its direct use is not possible. Past research suggests that sharp peaks of the empirical spectral density are near the true frequencies [30], [45]. We make use of this observation and apply a Bayesian Gaussian mixture model p(|s) = Q i=1w i N (μ i , i ) on the empirical spectral density in order to get the Q cluster centers of the Gaussian spectral density. We use the expectationmaximization clustering algorithm (see [46]) to estimate the parameters w i , μ i , and i . The resulting estimatesw i ,μ i , and i are used as initial values in each task. The time and phase delays' parameters {θ i , φ i } are randomly initialized for each task in MT-GCSM. We use this initialization procedure in all our experiments on artificial and real-world data, with all considered kernels SM-LMC, CSM, MOSM, and GCSM-CC. The coregionalization terms in GCSM-CC are randomly initialized. We run the experiment ten times with multiple hyperparameter initializations and report average MAE and standard deviation.

A. Learning Signal, Integral, Derivative, and Time-Delayed Signal Simultaneously
We design an artificial experiment in order to validate the interpolation, extrapolation, signal recovery, and block miss filling ability of MT-GCSM and compare its performance with that of other MTGP methods. We generate four nonlinear correlated tasks: a mixed signal, its integral, its derivative, and its time delay, respectively. Specifically, we sample a Gaussian signal f (x) ∼ GP(0, K SM (Q = 3)) with length 300 in the interval [−10, 10], numerically compute its first integral and derivative, and delay the signal with f (x) = f (x +t), (t = 2).
From the signal f (x) [see Fig. 3(a)], we randomly choose half of the data as training data and the rest as test data. The integration signal points [see Fig. 3 The performance of MT-GCSM on the generated signal is shown in Fig. 3(a). As shown in Table II, all considered GP methods have comparable performance; they learn the covariance between tasks and are able to interpolate missing values well. The second task, i.e., the integral of the signal, is shown in Fig. 3(b). In this case, its inherent patterns are more difficult to recognize and extrapolate. Here, MT-GCSM and GCSM-CC perform better than other methods, and MT-GCSM achieves the lowest MAE as well as the smallest confidence interval. Both MT-GCSM and GCSM-CC excel also in the last two tasks. For instance, on the derivative signal [see Fig. 3(c)] and the time-delay signal [see Fig. 3(d)], MT-GCSM and GCSM-CC show pattern learning and extrapolation capability, with MT-GCSM having the best performance. Predictions obtained using SE-LMC and Matérn-LMC kernels are of low quality, especially for the long-range extrapolation tasks (integral, derivative, and time-delay signals); it is very hard for them to find valid patterns in the data, such as the change of trend over time. Overall, results indicate the capability of MT-GCSM to capture the integration and differentiation patterns of the generated signal simultaneously. Note that here we have Q = 5 base components for each task, for SM-LMC, CSM, MOSM, and GCSM-CC, with a total of 5, 5, 5, and 25 components, respectively. In MT-GCSM, we also consider the same number of base components for each task (here, Q (1) = 5, Q (2) = 5, Q (3) = 5, and Q (4) = 5). Thus, in MT-GCSM, in total, we have 4 i=1 4 j =1 Q (i) Q ( j ) = 400 convolution structures (100 inner convolution structures and 300 outer convolution structures), which can capture involved structure in the data without the need for using extra hyperparameters. As a result of applying full cross convolution, MT-GCSM can better learn patterns and dependencies. Indeed, Table II shows that on all tasks, MT-GCSM yields MAE smaller than that of other kernels.

B. Unstable Low Peaked Humidity Pattern Extrapolation
Sensor networks monitoring climate change in Stockholm city provide historical analysis and information about the future evolution of the regional environment. The recording allows us to investigate long time range extrapolation of meteorological parameter values, such as humidity, which could guide environmental policy making. Meteorological parameters are highly relevant for local air pollution monitoring because they determine how air pollutant spreads. The humidity monitoring recordings are recorded from a number of stations (Torkel Knutssonsgatan, Marsta, Norr Malma, and Högdalen) in Stockholm and outside, for instance, Torkel Knutssonsgatan's measurement at the urban background, Marsta's measurement and Högdalen's measurement at a high-altitude tower, and North Malma's measurement at the regional background. Since Stockholm is a seaside city, the change of humidity in the city not only depends on weather conditions but also on its geographical location and surroundings. For example, stations located in the same area may have the same weather condition but totally different humidity values because stations located nearby river or seaside have higher humidity values. These factors are time and phase dependent on different scales.
In our experiment, we use humidity time series from November 5, 2017, to November 25, 2017, in 1-h intervals. We assign a different complexity between tasks by varying their number of components Q (1) = 4, Q (2) = 5, Q (3) = 5, and Q (4) = 4 for tasks 1-4, respectively. For the other baseline kernels, we set Q = 5 and consider also a variant of MT-GCSM with five components for each task for a direct comparison with other kernels. We randomly choose half of the humidity data in Torkel Knutssonsgatan (task 1), the first half of humidity data in Marsta (task 2), the last half of humidity data in Norr Malma (task 3), and the first quarter and last quarter of humidity data in Högdalen (task 4) for training, while the last half of humidity data in Marsta and the middle part of humidity data in Högdalen are used for testing. We aim to extrapolate the long-range missing values of humidity data in Marsta and middle block missing values of humidity data in Högdalen. From Fig. 4, we can see that there are no stable background environment values in the humidity recording. Interestingly, over time high peaks in humidity become more stable than low peaks. From Fig. 4, in addition to time-and phase-dependent patterns within tasks (local patterns depending on surroundings) and time-and phase-dependent patterns across tasks (global patterns depending on seasonal or yearly factors), we observe that the low peaks in humidity appear irregularly. The change in humidity is complicated and caused by the nonlinear interaction of local and global patterns. Therefore, data from multiple stations should help when used to model long-range extrapolation trends. Results indicate that all SM-based kernels can extrapolate the humidity well, with MT-GCSM consistently achieving better performance in terms of MAE and predicted confidence interval (see Fig. 4 and Table III). Also, results indicate the performance improvement by using a different number of components for each task.

C. Sharp and High Peaked Nitrogen Dioxide Concentration Extrapolation
As the second real-world experiment, we use NO 2 concentration recordings in sensor networks. The nitrogen dioxides (NO 2 ) are an important air pollution parameter reflecting the increase of industry fossil fuels' emission and automobile exhaust, which causes inflammation of the airways if humans suffer from long-term exposure at a high concentration. If we can discover the variation in NO 2 concentration, using pattern recognition, this may suggest how to control and prevent the negative effects on our health and the environment. In this case, the NO 2 concentration recordings are collected from four stations in Stockholm city: Essingeleden's measurement at the open path, Hornsgatan's, Sveavägen's, and Norrlandsgatan's measurement at a street. All recordings cover 24 h at 1-h interval, and missing values are filtered. Each station corresponds to a task: Essingeleden as task 1, Hornsgatan as task 2, Sveavägen as task 3, and Norrlandsgatan as task 4. NO 2 evolution has time-and phase-related patterns. Different stations have different local patterns that depend on the station's surroundings. Still, these tasks have shared global trends because of the global seasonal change and periodic characteristics of human and industrial activities. The evolution of NO 2 concentration in each task is a result of nonlinear interaction of time-and phase-dependent local and global patterns.
In this case, we consider the recording time for NO 2 concentration from July 25, 2017, to August 15, 2017. From Fig. 5, in addition to time-and phase-dependent global trends (global patterns depending on seasonal or yearly factors) across tasks and local patterns (local patterns depending on local human activities and surrounding) within the task, we observe that the high peaks in NO 2 appear nonperiodically. Here, we randomly choose half of NO 2 data in Essingeleden, the first half of NO 2   for the all other kernels. Results indicate that all SM-based kernels can extrapolate the future NO 2 concentration well, with MT-GCSM achieving the best performance (see Fig. 5 and Table III). In this experiment, GCSM-based kernels perform better than other kernels, which means that there is indeed dependent structure in the NO 2 concentration data.

D. Multidimensional Heavy Metal Concentration Prediction
In this section, we compare our baselines on the multivariate Jura data set. In order to obtain a more comparable setting, here, we use the same experiment setting reported in [15], [20] and [21]. The concentrations of cadmium and copper can be dependent on other variables that are easy to measure, such as nickel and zinc for cadmium, and lead, nickel, and zinc for copper.
Specifically, we aim to estimate concentrations of cadmium and copper at 100 test locations within a 14.5-km 2 region of the Swiss Jura. The training points are sampled from the other 259 neighboring locations. In these tasks, usually, a standard and independent GPs regression model is only able to make use of measurements of one kind of metal, while MTGPs can exploit the dependencies between cadmium and nickel, cadmium and zinc, and the dependencies between copper and nickel, and copper and zinc to enhance cadmium and copper concentration predictions, respectively. Results indicate the improved performance of MT-GCSM (see Table III), notably when using a different number of components for each task.

VIII. CONCLUSION
We proposed MT-GCSM to capture statistical dependencies within tasks and across tasks. MT-GCSM is able to extrapolate over multiple complex tasks simultaneously by using inner and cross convolution of time-and phase delay-dependent components. Results of extensive experiments on artificial and real-world data sets indicated the capability of MT-GCSM to exploit dependencies within tasks and across tasks as well as the diverse complexity of the tasks (through a different number of components per task) and improve performance over baseline kernels. Results of experiments also indicated that MT-GCSM can recognize and model complex structure in data, discover a nonlinear correlation between tasks, and can make long-term extrapolations.
A limitation of the proposed kernel is the intrinsic inefficient inference in MTGPs using a multikernel framework; the main computational cost is due to the Cholesky decomposition of the covariance matrix, which requires O(m 3 n 3 ) computation time and O(m 2 n 2 ) memory. Sparse approximation and the Bayesian optimization can be adopted for efficient inference and better hyperparameters' initialization [41], [47]- [49], in order to improve the scalability of MT-GCSM. However, current efficient inference approximation methods, such as FITC and PITC [15], [16], [50]- [52], are not very effective for MTGPs using a multikernel framework because the labels of the inducing points are fixed and all outputs share the same inducing points. Future research needs to be performed on sparse representations and efficient inference for MT-GCSM.

APPENDIX PROOF OF THE POSITIVE SEMIDEFINITE CONDITION FOR MT-GCSM
The proof of positive semidefinite condition for MT-GCSM is different from single-task GCSM because of the introduction of outer cross-convolution terms between tasks. Kernel k m×m MT-GCSM (τ ) is positive semidefinite if and only if its spectral densityk m×m MT-GCSM (s) is positive semidefinite [31], [32]. Here, given any finite set of nonzero vectors [z 1 , . . . , z N ] ∈ C N ×P with complex entries, s ∈ R P , according to (18), the same result holds fork m×m MT-GCSM (−s). Thus, the sum of cross-spectral densities satisfies the positive definite condition. Therefore, from (24), the proposed MT-GCSM kernel k MT-GCSM (τ ) must be positive semidefinite. In (24), each task has free number of components. Arbitrary MTGP kernel k MT-GCSM (τ ) constructed by MT-GCSM with arbitrary number of tasks fulfills the positive semidefinite condition. In (24) GCSMi (s) 2 ≥ 0. (24)