Well, you have inadvertently hit on an iceberg indeed...
As a prelude, let's make clear that the concepts of variance & standard deviation are defined only for scalar variables; for vector variables (like your own 3d output here), the concept of variance is no longer meaningful, and the covariance matrix is used instead (Wikipedia, Wolfram).
Continuing on the prelude, the shape of your sigma
is indeed as expected according to the scikit-learn docs on the predict
method (i.e. there is no coding error in your case):
Returns:
y_mean : array, shape = (n_samples, [n_output_dims])
Mean of predictive distribution a query points
y_std : array, shape = (n_samples,), optional
Standard deviation of predictive distribution at query points. Only returned when return_std is True.
y_cov : array, shape = (n_samples, n_samples), optional
Covariance of joint predictive distribution a query points. Only returned when return_cov is True.
Combined with my previous remark about the covariance matrix, the first choice would be to try the predict
function with the argument return_cov=True
instead (since asking for the variance of a vector variable is meaningless); but again, this will lead to a 16x16 matrix, instead of a 3x3 one (the expected shape of a covariance matrix for 3 output variables)...
Having clarified these details, let's proceed to the essence of the issue.
At the heart of your issue lies something rarely mentioned (or even hinted at) in practice and in relevant tutorials: Gaussian Process regression with multiple outputs is highly non-trivial and still a field of active research. Arguably, scikit-learn cannot really handle the case, despite the fact that it will superficially appear to do so, without issuing at least some relevant warning.
Let's look for some corroboration of this claim in the recent scientific literature:
Gaussian process regression with multiple response variables (2015) - quoting (emphasis mine):
most GPR implementations model only a single response variable, due to
the difficulty in the formulation of covariance function for
correlated multiple response variables, which describes not only the
correlation between data points, but also the correlation between
responses. In the paper we propose a direct formulation of the
covariance function for multi-response GPR, based on the idea that [...]
Despite the high uptake of GPR for various modelling tasks, there
still exists some outstanding issues with the GPR method. Of
particular interest in this paper is the need to model multiple
response variables. Traditionally, one response variable is treated as
a Gaussian process, and multiple responses are modelled independently
without considering their correlation. This pragmatic and
straightforward approach was taken in many applications (e.g. [7, 26,
27]), though it is not ideal. A key to modelling multi-response
Gaussian processes is the formulation of covariance function that
describes not only the correlation between data points, but also the
correlation between responses.
Remarks on multi-output Gaussian process regression (2018) - quoting (emphasis in the original):
Typical GPs are usually designed for single-output scenarios wherein
the output is a scalar. However, the multi-output problems have
arisen in various fields, [...]. Suppose that we attempt to approximate T outputs {f(t}, 1 ≤t ≤T , one intuitive idea is to use the single-output GP (SOGP) to approximate them individually using the associated training data D(t) = { X(t), y(t) }, see Fig. 1(a). Considering that the outputs are correlated in some way, modeling them individually may result in the loss of valuable information. Hence, an increasing diversity of engineering applications are embarking on the use of multi-output GP (MOGP), which is conceptually depicted in Fig. 1(b), for surrogate modeling.
The study of MOGP has a long history and is known as multivariate
Kriging or Co-Kriging in the geostatistic community; [...] The MOGP handles problems with the basic assumption that the outputs are correlated in some way. Hence, a key issue in MOGP is to exploit the output correlations such that the outputs can leverage information from one another in order to provide more accurate predictions in comparison to modeling them individually.
Physics-Based Covariance Models for Gaussian Processes with Multiple Outputs (2013) - quoting:
Gaussian process analysis of processes with multiple outputs is
limited by the fact that far fewer good classes of covariance
functions exist compared with the scalar (single-output) case. [...]
The difficulty of finding “good” covariance models for multiple
outputs can have important practical consequences. An incorrect
structure of the covariance matrix can significantly reduce the
efficiency of the uncertainty quantification process, as well as the
forecast efficiency in kriging inferences [16]. Therefore, we argue,
the covariance model may play an even more profound role in co-kriging
[7, 17]. This argument applies when the covariance structure is
inferred from data, as is typically the case.
Hence, my understanding, as I said, is that sckit-learn is not really capable of handling such cases, despite the fact that something like that is not mentioned or hinted at in the documentation (it may be interesting to open a relevant issue at the project page). This seems to be the conclusion in this relevant SO thread, too, as well as in this CrossValidated thread regarding the GPML (Matlab) toolbox.
Having said that, and apart from reverting to the choice of simply modeling each output separately (not an invalid choice, as long as you keep in mind that you may be throwing away useful information from the correlation between your 3-D output elements), there is at least one Python toolbox which seems capable of modeling multiple-output GPs, namely the runlmc (paper, code, documentation).