Flow Prediction in Ungauged Catchments Using Probabilistic Random Forests Regionalization and New Statistical Adequacy Tests

Flow prediction in ungauged catchments is a major unresolved challenge in scientific and engineering hydrology. This study attacks the prediction in ungauged catchment problem by exploiting advances in flow index selection and regionalization in Bayesian inference and by developing new statistical tests of model performance in ungauged catchments. First, an extensive set of available flow indices is reduced using principal component (PC) analysis to a compact orthogonal set of “flow index PCs.” These flow index PCs are regionalized under minimal assumptions using random forests regression augmented with a residual error model and used to condition hydrological model parameters using a Bayesian scheme. Second, “adequacy” tests are proposed to evaluate a priori the hydrological and regionalization model performance in the space of flow index PCs. The proposed regionalization approach is applied to 92 northern Spain catchments, with 16 catchments treated as ungauged. It is shown that (1) a small number of PCs capture approximately 87% of variability in the flow indices and (2) adequacy tests with respect to regionalized information are indicative of (but do not guarantee) the ability of a hydrological model to predict flow time series and are hence proposed as a prerequisite for flow prediction in ungauged catchments. The adequacy tests identify the regionalization of flow index PCs as adequate in 12 of 16 catchments but the hydrological model as adequate in only 1 of 16 catchments. Hence, a focus on improving hydrological model structure and input data (the effects of which are not disaggregated in this work) is recommended.

Traditionally, regionalization proceeds in terms of model parameters, which are calibrated in gauged catchments and then transferred to ungauged catchments according to assumed relationships between model parameters and catchment characteristics (see Pechlivanidis et al., 2010;Samaniego et al., 2010;Wagener et al., 2004;see He et al., 2011, for a review of the corresponding methods). Parameter regionalization has several drawbacks, including the following: (i) hydrological model parameters often suffer from poor identifiability and strong interdependencies McIntyre et al., 2005) and (ii) parameter regionalization relationships are difficult or impossible to derive , in no small part due to difficulties in establishing correspondence between model parameters and physical catchment attributes (Duan et al., 2006;Koren et al., 2003; see also Fenicia et al., 2014). hydrological (flow) indices or signatures calculated in gauged catchments from observed data (Almeida et al., 2013;Almeida, 2014;Almeida et al., 2016;Blöschl et al., 2013;Bulygina et al., 2009Bulygina et al., 2012;Gupta et al., 2008;Hrachowitz et al., 2013;Hrachowitz et al., 2013;Olden & Poff, 2003;Sawicz et al., 2011;Wagener & Montanari, 2011;Yadav et al., 2007;Zhang et al., 2008). Examples of flow indices include average annual and monthly flows (Peñas et al., 2014), runoff coefficient (Almeida et al., 2016), quantiles and slope of the flow duration curves (Westerberg et al., 2014;Yilmaz et al., 2008), and the base flow index (Bulygina et al., 2009). The regionalized relationship is used to estimate flow indices in an ungauged location based on its catchment descriptors, and these estimated indices are then used to infer (condition) both hydrological model parameters and predictions. This regionalization strategy is based on the assumption that catchments that are similar physiographically and climatologically are also similar hydrologically.
Regionalized indices are affected by several sources of uncertainties (Almeida, 2014;Westerberg et al., 2016), due to the limited number of gauged catchments, limited quantity and quality of both dynamic observations and catchment descriptors, and the simplified nature of the regionalization model relationships. The uncertainty in the regionalized flow indices results in uncertainty in the estimated hydrological model parameters and predictions. Hence, reliable and precise regionalization of flow indices is recognized as one of the main challenges of the prediction in ungauged catchment (Hrachowitz et al., 2013;Sivapalan et al., 2003). If such regionalized indices were available, the Bayesian paradigm offers promising ideas and techniques to quantify and reduce the uncertainty in the hydrological parameters and flow predictions Bulygina et al., 2009;Singh, 2013;Yadav et al., 2007).
Like any inference technique, Bayesian methods rely on multiple modeling choices, including the specification of (1) flow indices, (2) regionalization procedure, and (3) hydrological model. From the multitude of possible indices, the hydrologist must select those that capture best the key characteristics of the flow regime. For example, Olden and Poff (2003) used 171 indices that represent five aspects of the flow regime: average annual and monthly flows, high and low flows, duration and frequency of high flows, rate of change in flows, and time of maximum and minimum flow events. They demonstrated the ability of principal component (PC) analysis (PCA) to efficiently summarize the information (variability) in the observed indices. More recently, Yadav et al. (2007) considered 39 indices, which were divided into seven classes by means of linear and Spearman rank correlation coefficients; Coxon et al. (2014) used 3 signatures to evaluate modeled flow behavior over decadal, annual, and monthly time scales; Westerberg et al. (2016) used 15 flow indices to quantify the uncertainty coming from the discharge data and propagated these uncertainties into the regionalization of the indices; Almeida et al. (2012Almeida et al. ( , 2016 combined information from multiple regionalized signatures to condition rainfall-runoff models in ungauged catchments. To regionalize flow indices to ungauged catchments, linear regression relationships are usually fitted between flow indices (e.g., runoff ratio, base flow index, flow elasticity, slope of flow duration curve, high pulse count, etc) and catchment characteristics (e.g., average annual flow, average annual precipitation, average annual potential evapotranspiration [PET], aridity index, average elevation, etc) in gauged catchments (Almeida, 2014;Almeida et al., 2016;Yadav et al., 2007;Zhang et al., 2008). As the regionalization relationships are usually nonlinear, regression models based on the random forests (RF) technique have been proposed (Peñas, 2013;Snelder et al., 2009Snelder et al., , 2013. RF regression extends the concept of a regression tree (Breiman et al., 1984), a machine learning technique that can be used to relate a set of predictors (here catchment descriptors) to a predictand (here a single flow index), from a single tree (Breiman et al., 1984) to a set of trees (Breiman, 2001). RF regression retains the advantages of a single regression tree (flexibility to accommodate different data patterns and error distributions), offers an improvement in accuracy, and is more robust with respect to the selection of predictors than a single regression tree (Snelder et al., 2013). RF regression has been successfully applied by the ecohydrological community to regionalize flow indices and to explain variations in hydrological patterns (e.g., Booker, 2013;Peñas, 2013;Snelder et al., 2013). It has also proven effective for predicting combinations of flow indices compared with other machine learning algorithms (Peñas et al., 2014) and physically based approaches (Booker & Woods, 2014). However, the previous studies have not explored the use of RF regression within a probabilistic framework to predict flow dynamics in ungauged catchments.

Water Resources Research
The use of hydrological models conditioned on regionalized flow indices rests on the following two assumptions: (i) the regionalization model is capable of estimating flow characteristics from other sources of information and (ii) the hydrological model is capable of characterizing hydrological dynamics in the catchment of interest (Yadav et al., 2007;Bulygina et al., 2009Almeida et al., 2012;Almeida, 2014). Such modeling assumptions require careful testing, as demonstrated in previous studies that explored model structure adequacy (Bulygina & Gupta, 2010;Clark et al., 2008;Clark et al., 2011;Fenicia et al., 2008;Fenicia et al., 2011;Wagener et al., 2001), the quality of hydrological observations used to drive and evaluate models Kavetski et al., 2002Kavetski et al., , 2006McMillan et al., 2012;Renard et al., 2011;Westerberg & Birkel, 2015), and the quality of regionalization procedures (Beven, 2000;Wagener & Montanari, 2011). A key complication in estimating the quality of flow predictions in ungauged catchments is that observed flow data typically used in posterior diagnostics and model verification (e.g., Gneiting et al., 2007;McInerney et al., 2017;Schoups & Vrugt, 2010, and many others) are not available. Hence, practical methods for flow prediction in ungauged catchments must either (i) assume error characteristics estimated using leave-one-out strategies on gauged sites are applicable at the ungauged site of interest (Almeida et al., 2016) or (ii) employ performance tests based solely on the conditioning data available (e.g., regionalized flow indices) and assume that these tests are indicative of the quality of predicted flow time series (e.g., Gupta et al., 2008). In addition, for practical reasons, it is of interest to develop model tests that are applied a priori rather than a posteriori with respect to the conditioning of hydrological model parameters-that is, being able to assess the adequacy of a model before estimating its parameters. If a model is found inadequate a priori and rejected, the modeler is spared the effort in conditioning the model parameters, which can be a substantial saving when the conditioning is implemented using computationally expensive Monte Carlo techniques. The ability to diagnose model adequacy a priori can hence help inform the selection of models for specific sites, help identify dominant sources of uncertainty, and so forth.
This study has the following objectives: 1. to incorporate PC-based methods, as well as RF regression techniques for regionalization, into a Bayesian framework to condition hydrological model parameters and flow predictions in ungauged catchments; 2. to develop statistical tests to evaluate a priori the adequacy of hydrological and regionalization models in representing flow indices, in PC space, available in a catchment; and 3. to empirically assess the behavior of the proposed model adequacy tests under different model quality and data availability scenarios and explore the extent to which model performance in flow index PC space is representative of the model's ability to predict flow time series.
The paper is organized as follows: Section 2 presents theoretical developments; section 3 describes the case study setup; section 4 presents the case study results, which are then discussed in section 5; and section 6 summarizes the key conclusions.

Theoretical Development
The proposed approach for flow prediction in ungauged catchments using a hydrological model comprises the following key steps. First, a set of flow indices available in gauged catchments is summarized in PC space. Second, a regionalization model that relates catchment descriptors to catchment flow indices (in PC space) is developed using the RF regression technique supplemented with a residual error model to describe regionalization model uncertainty. Third, the regionalization model is applied to ungauged locations to predict flow index PCs from available catchment descriptors. Fourth, the combination of the regionalization model and the hydrological model is evaluated in flow index PC space via new proposed statistical adequacy tests, which are assumed to provide an indirect indication of potential model performance in the flow time series space. The term model includes the model structure, parameters, and inputs. In the specific case of the RF model, the term model refers to its functional form, parameters, and the set of selected predictors (here the catchment descriptors). Fifth, if the adequacy tests are passed, the hydrological model parameters are conditioned on the regionalized flow index PCs and the model used for flow prediction; conversely, if the adequacy tests are failed, the model cannot be expected to provide trustworthy flow predictions and should be replaced or enhanced. This section provides a detailed description of each step above; Figure 1 provides a summary illustration.   (Figure 1a): The influence of s 0 is minimized using a warm-up period; s 0 is hence not inferred and is excluded from subsequent notation.
In ungauged catchments, observed flow data q obs typically used to condition parameters θ are not available. In this work, the hydrological parameter estimation problem is informed by the PCs of flow indices, which are estimated at ungauged locations using a regionalization model (section 2.2).
Note that the term hydrological model will be used to refer to the combination of a hydrological model structure and a transformation of observations of hydrological drivers to hydrological model inputs (e.g., for rainfall, using Thiessen polygons, nearest neighbor, areal average, etc.). In the absence of more detailed information about the input data and their associated uncertainty (e.g., Renard et al., 2011), it is not possible to distinguish between these two sources of uncertainty.

Flow Indices and PCs
Let w = {w i ; i = 1, N w } denote a set of flow indices computable from a flow time series ( Figure 1b): For example, this work makes use of indices such as mean annual and monthly flows, timing of events and so forth (see section 3.2).
The set of flow indices w can be transformed into a set of uncorrelated (orthogonal) PCs via PCA and the subset of dominant components z = {z i ; i = 1, N z } selected using a technique such as the broken stick method (Jackson, 1993;Peres-Neto et al., 2005;Figure 1c).
where the prefix "d" denotes the dominant PCs.
The quantity z will be referred to as "flow index PCs." Flow index PCs computed from observed flows q obs are denoted z obs , while those computed from simulated flows q sim are denoted z sim . In addition, the notation z reg is used to refer to the flow index PCs estimated by regionalization (see section 2.4.1).
Note that earlier work (Peñas et al., 2014) has suggested that applying additional transformations (such as dividing by the mean annual flow) to the flow series prior to calculating the flow indices and applying the PCA procedure can be advantageous. In addition, the selection of flow indices might be affected by the purpose of the application (e.g., high flow characteristics might be selected in preference, and/or given particular weight, if the model is intended for flood analysis). Although such flow index transformations or weights are not applied in this paper's case study, the procedures are general and can be applied with or without additional transformations and for any selection of flow indices.

Regionalization of Hydrological Information 2.2.1. Regionalization Model Structure
The probabilistic regionalization model for the flow index PCs is denoted by R(d,θ R ). It is constructed to estimate a set of flow index PCs z from a set of catchment descriptors d such as catchment area, climate, and topography (section 3.2), as schematized in Figure 1d.
where r(d,θ r ) is a deterministic regionalization model with parameters θ r and ε(θ ε ) is a (random) residual error model with parameters θ ε intended to represent all sources of uncertainty in the deterministic model r. The complete set of parameters of the regionalization model is θ R = {θ r , θ ε }. Note that the number (dimension) of models R and r and the dimension of ε(θ ε ) are equal to N z (i.e., R = {R i ; i = 1, N z }, etc.).

Water Resources Research
The number of parameters in the residual error model within the regionalization model is denoted by N θ ε , that is, θ ε ¼ fθ ε j ; j ¼ 1 N θ ε g. The regionalization model R is estimated using z obs and observed catchment descriptors d obs derived from available data from n gauged catchments, as follows: 1. The deterministic term r is estimated using the RF regression technique as described in section 2.2.2. 2. The uncertainty in each regionalized flow index PC is estimated using a jackknife strategy followed by parametric distribution fitting, as described in section 2.2.3.
Once the regionalization model is constructed, it is used to estimate the flow index PCs z reg in an ungauged catchment of interest ( Figure 1e); in turn, z reg is used in the model adequacy tests (section 2.3) and to condition hydrological model parameters (section 2.4).
Similarly to section 2.1.1, the term regionalization model will be used to refer to the combination of the regionalization model structure and its fixed inputs (catchment descriptors), as it is not possible to distinguish between these two sources of uncertainty given the available data. There are several publications that demonstrate the difficulties of separating model structure uncertainty and input uncertainty in the rainfall-runoff model community, for example, Renard et al. (2010Renard et al. ( , 2011, Beven and Westerberg (2011), Beven and Smith (2015), and others. Similar reasoning suggests difficulties in the separation of uncertainties due to regionalization model structure and its inputs.

RF Regression Model for Regionalization
The deterministic term r in the regionalization model in equation (4) is constructed using the RF regression technique from the machine learning community (Liaw & Wiener, 2002;Snelder et al., 2012). A separate model r i is built for each flow index PC z i . The RF algorithm resamples (with replacement) an ensemble of regression "trees" to create a "forest." Each tree relates predictors (catchment descriptors) to the predictand (a flow index PC). The trees "grow" so that combinations of multiple catchment descriptors are randomly sampled at each node, and the combinations providing the lowest mean square error in the predictands are retained (Liaw & Wiener, 2002;Snelder et al., 2012). The resampling of predictors introduces randomness into the regression model built using RF (Breiman, 2001), in contrast to the single regression tree approach (Breiman et al., 1984). The model prediction is computed as the expected value of all individual predictions from each tree in the forest. Note that once the deterministic term of the RF model is estimated and the selected set of parameters are specified, the resulting RF model is deterministic, in the sense that given the same input, the model will always produce the same numerical results. In addition, although the RF model is flexible, it still incurs model structural error even after it is estimated.
The RF technique offers many theoretical and practical benefits compared with traditional linear regression techniques (Breiman, 2001), including the following: 1. flexible model structure that does not require preselecting a model equation form or preselecting predictors from the available set of candidate predictors; and 2. relatively low computational cost (see Peñas, 2013, for a comparison of machine learning algorithms for regionalization).
This work uses the RF algorithm implementation in the R package "randomForest v4.6.7" (Liaw & Wiener, 2002). The deterministic term in the regionalization model cannot be expected to reproduce the observed flow index PCs exactly: (i) the relationship between catchment descriptors and flow index PCs is unlikely to be deterministic, especially given a limited set of catchment descriptors; (ii) RF regression can provide only an approximate representation of data relationships, especially when estimated from a finite set of samples (catchments); and (iii) observed flow index PCs and catchment attributes differ from their "true values" due to observation errors in the underlying data. To characterize regionalization uncertainty, a residual error model is constructed as described next.

Uncertainty Characterization in the Regionalization Model
The uncertainty in each regionalized flow index PC is estimated using a jackknife strategy, followed by parametric distribution fitting (Almeida et al., 2016), as follows: (i) leave out a single catchment from the n gauged catchments, (ii) use the remaining (n − 1) catchments to estimate the regionalization model r as described in section 2.2.2; (iii) use the model r to estimate ("regionalize") the flow index PCs in the left-out catchment; (iv) compute the residual error vector of the regionalized flow index PCs using the observed flow index PCs (available in the left-out catchment); (v) repeat steps (i)-(iv) for each catchment, resulting in a set of n residual error vectors; and (vi) fit a parametric joint probability distribution ε(θ ε ) to the set of residuals from step (v).
The parametric joint distribution is fitted to the residual errors of the regionalization model as follows: (i) estimate the cross-correlation structure using the Pearson correlation between all pairings of the residual errors of individual flow index PCs; (ii) hypothesize a particular parametric distribution for the residual errors-for example, the distributions considered in this work include the Gaussian, extreme value type 1 (also known as the Gumbel distribution), and generalized extreme value distributions; (iii) check the hypothesized distribution against the actual residuals using the χ 2 , Lillietest (Lilliefors, 1967(Lilliefors, , 1969, Jbtest (Jarque & Bera, 1987), and Mardia statistical tests (Mardia, 1970;Trujillo-Ortiz & Hernandez-Walls, 2003).
The parameters of the residual error model reflect the quality of the regionalization model. For example, location parameters such as the mean of the residual errors provide a characterization of bias in the regionalization model, whereas scale parameters such as the standard deviation of residual errors are indicative of the typical magnitude of random errors. To facilitate their interpretation, the estimated residual error parameters of the regionalization model are averaged and normalized by the range of observed or simulated PCs of the flow indices, as described next. Note that because the residual error model parameters are estimated separately for each flow index PC, the auxiliary notation θ ε u ð Þ i;j is introduced to refer to the jth parameter of the residual error model for the regionalization of the ith flow index PC in ungauged catchment u.
When the normalization is done using the range of PCs based on observed data, equation (5) is applied: is the average normalized value of the jth parameter of the residual error model for the are, respectively, the largest and smallest values of the ith flow index PC across N cat case study catchments; and N u is the number of ungauged catchments.
When the normalization is done using the range of PCs simulated by the hydrological model H, equation (6) is applied:

Model Adequacy Tests in Flow Index PC Space
This section introduces two model tests to scrutinize the quality of the hydrological and regionalization models in the flow index PC space: 1. "DistanceTest", which quantifies the ability of a model (hydrological or regionalization) to reproduce the set of flow index PCs (denoted by z) in a catchment; and 2. "InfoTest", which quantifies the information added by a model (hydrological or regionalization) over prior knowledge about flow index PCs z in a catchment.
A model (hydrological or regionalization) is considered "adequate" if it passes both DistanceTest and InfoTest, as described in the following sections.

10.1029/2018WR023254
Water Resources Research 2.3.1. Ability to Reproduce Flow Index PCs: DistanceTest DistanceTest evaluates the hypothesis that a model (hydrological or regionalization) is statistically likely to reproduce a given set of flow index PCs z in a catchment. A DistanceTest p value ψ(z) is calculated as follows (see Figure 1f): 1. Construct the distribution of z, p(z| •), from the model being tested. When testing the hydrological model, p(z| H) is constructed by propagating the prior parameter distribution through the hydrological model in equation (1); when testing the regionalization model, p(z| R) is constructed by sampling residual errors within the regionalization model in equation (4). 2. Approximate p(z| •) by a mixture of K Gaussian distributions (components) with mean μ k and covariance matrix Σ k for k = 1, K (Bulygina & Gupta, 2011;Muller et al., 1996). 3. Calculate the probability mass A k of the confidence region associated with the value z being tested, within each Gaussian component k: Þis the inverse cumulative distribution function of the χ 2 distribution with m degrees of freedom, and transp denotes the vector transpose.
Equations (7) and (8) can be derived by considering that elliptical contours in z-space equidistant in Mahalanobis distance from μ k represent Gaussian equidensity contours and define the most compact confidence region with probability mass A k (Gallego et al., 2013;Ribeiro, 2004). Assuming z represents a sample from the kth Gaussian component, z~N(μ k , Σ k ), the quantity D 2 k follows a χ 2 distribution with N z degrees of freedom (Gallego et al., 2013).
4. The DistanceTest p value ψ over all K Gaussian components is defined as where A max = max{A k ; k = 1, K}. DistanceTest is passed if ψ > ψ*, where ψ* is a prescribed significance level; in this work, ψ* = 0.05 is used.
DistanceTest can be used in the following two ways: i The main purpose of DistanceTest is to test the combined regionalization and hydrological model under "real operating" conditions in an ungauged catchment, where observed flow is not available. In this case, DistanceTest is applied with z = z reg . This setup tests whether the hydrological model is likely to reproduce the flow index PCs estimated using the regionalization model, so that the test outcome is affected by deficiencies in the regionalization and/or hydrological models (which as noted in sections 2.1.1 and 2.2.1 include the respective model structures and forcing inputs). To allow for regionalization uncertainty, DistanceTest is conducted using replication as follows: draw a sample z reg(i) from the distribution of regionalized flow index PCs, compute its p value ψ (i) , repeat N sam times, and compute the average In this study, N sam = 10,000 draws are used.
ii DistanceTest can also be used to test an individual model-hydrological or regionalization-under "verification" conditions when z obs is available. In this case, z obs can be used in DistanceTest either directly as z = z obs (no replication) or indirectly by constructing "synthetic" scenarios (using replication when testing the hydrological model). In this work, such verification is conducted in scenario 0 and scenarios 2-4 in section 3.5.

Model Informativeness About Flow Index PCs: InfoTest
InfoTest quantifies how much information about catchment flow index PCs is added by a model (hydrological or regionalization) over prior knowledge. In other words, it quantifies the added value of using a model to predict the dominant flow characteristics as represented by the flow index PCs. The test is defined via the Bayes Factor (BF; Gelman et al., 2013; see Figure 1g). For a hydrological model, the Bayes Factor BF H is defined as where p(z| H) is the probability density function (pdf) of the distribution of flow index PCs given the model H with parameters sampled from the prior distribution p(θ). The term p(z) denotes the pdf of the prior distribution of observed flow index PCs (see below).
For the regionalization model, the Bayes factor BF R is defined as where p(z| R) is the pdf of the distribution of flow index PCs associated with the probabilistic regionalization model R defined in section 2.2.1.
In this work, the prior density p(z) is set as uniform over the range of flow index PCs observed across the available gauged catchments: where N z is the number of PCs retained for the analysis and z max respectively, the largest and smallest values of the ith PC across N cat case study catchments.
The pdf p(z| H) is approximated by fitting a mixture of Gaussians to the set of flow index PCs generated using the hydrological model with parameters sampled from the prior (same as in DistanceTest). The pdf p(z| R) is obtained as described in section 2.2.3. Table 1 provides a qualitative interpretation of BF values. BF = 1 indicates that the model provides the same information as the prior, BF < 1 indicates that the prior provides more information than does the model (i.e., the model is not informative), and BF > 1 indicates that the model adds information beyond the prior. BF can be interpreted quantitatively, for example, BF = 10 indicates that the model provides "10 times more information" than the prior, and so forth. While other choices can be made, the study considers InfoTest "passed" when BF > 1.
Similar to DistanceTest, InfoTest can be used in two ways: i InfoTest can be used to test the combined regionalization and hydrological models under "real operating" conditions, that is, when observed flow is not available, and z is estimated via regionalization. In effect, this tests the added information value of the hydrological model in reproducing the regionalized flow index PCs, as compared with prior estimates of observed flow index PCs. If the test is failed, it could be due to deficiencies in the regionalization and/or hydrological models (including their structures and inputs). To allow for regionalization uncertainty, InfoTest uses a replication similar to that in DistanceTest, drawing from the distribution of regionalized flow index PCs, computing the BF using equation (10) and averaging over multiple draws to get the overall BF. ii InfoTest can also be used to test an individual model-hydrological or regionalization-under "verification" conditions, when z obs is available (scenarios 0 and 2-4 in section 3.5).

Practical Usage of Adequacy Tests
The use of DistanceTest and InfoTest relies on the assumption that model adequacy in the space of flow index PCs is at least broadly indicative of model performance in predicting flow time series. This assumption is necessary given that ungauged catchments do not have observed flow time series available for model verification; its validity will be appraised in the empirical case study (sections 3 and 4).
The adequacy tests are carried out prior to any conditioning of hydrological model parameters. If the combined regionalization/hydrological model passes the adequacy tests, the modeler can proceed to condition

10.1029/2018WR023254
Water Resources Research the hydrological model parameters θ on the regionalized flow index PCs and produce flow predictions (next section). If the adequacy tests are failed, the model cannot be expected to provide trustworthy flow predictions and should be replaced or enhanced. Note that the proposed adequacy tests represent necessary but insufficient conditions for a model to be considered capable of predicting streamflow. The tests are a priori necessary because the model must be able to represent the set of streamflow indices (e.g., annual and monthly flows), before being used to predict detailed flow dynamics (hydrograph) in the ungauged catchment. However, the tests are not sufficient on their own, because even if the model can reproduce the streamflow indices, it might not be able to reproduce the full dynamics. Á is given by Bayes equation (Bulygina et al., 2009;see Box & Tiao, 1973, for basic theory; Figure 1h).
The likelihood function p z reg jθ; H; x ð Þ ¼ p z reg jz sim θ À Á describes the statistical relationship between regionalized and simulated flow index PCs (see section 2.2); z sim θ = dPCA(q sim ) = dPCA(H(θ,x)) denotes flow index PCs simulated by model H with input x and parameters θ.
Under the assumption that the errors of the regionalization model dominate the errors of the hydrological model, the likelihood function can be constructed using the same probability distribution as estimated for the residual errors of the regionalization model in section 2.2.3. This assumption follows published work on conditioning of hydrological parameters to streamflow statistics (e.g., Almeida et al., 2016;Bulygina et al., 2009Bulygina et al., , 2012Yadav et al., 2007), though it can be questioned (see section 6).
Unless specific prior information is available, p(θ| H,x) can be specified as uniform over the feasible parameter ranges. The denominator in equation (13) is a normalizing constant and is not required explicitly by many sampling algorithms. The posterior distribution in equation (13) is sampled as described in section 2.4.2. The predictive distribution of flows is then constructed by running the hydrological model H(θ, x) with the posterior parameter samples θ (Figure 1i).

Posterior Distribution Sampling
The posterior parameter distribution in equation (13) is approximated using importance sampling (Doucet et al., 2000; see also Kuczera & Parent, 1998), as follows: 1. Draw S parameter sets {θ i ; i = 1, S} from the uniform prior distribution p(θ| H,x) using the Latin hypercube method (S = 1,000 in the study). 2. Run the hydrological model (with fixed inputs and initial conditions) with each parameter set θ i to generate S flow time series {q sim i ; i = 1, S}. 3. Calculate the flow index PCs z sim i for each simulated flow time series. 4. Compute the (unscaled) weight w i for each parameter set using the likelihood function in equation (13); each parameter set θ i generated in step 1 is assigned a weight based on the likelihood function value of its corresponding flow index z sim i computed in step 3. 5. Scale the weights w i so they add up to 1.
The parameter sets and their weights, {θ i , w i ; i = 1, S}, provide an approximation to the posterior parameter distribution in equation (13) (Doucet et al., 2000).

Case Study Catchments
The case study is based on a set of 92 catchments in northern Spain (Figure 2) selected from the larger set of 156 catchments used by Peñas et al. (2014). The selected catchments are characterized by a "natural" hydrological regime as defined by the European Water Framework Directive (European Commission, 2000). From the selected catchments, 62 catchments drain into the Cantabrian Sea, and the remaining 30 catchments drain into the Mediterranean Sea. The climatic characteristics described below are derived from monthly climate series calculated from a 1 km × 1 km grid map developed by the Centre for Hydrographic Studies using data from more than 5,000 weather stations across Spain (CEDEX, Ministry of Public Works and Ministry of Agriculture, Food and Environment, 2013, Spain). Topography and catchment geometry are derived using a 25-m-resolution digital elevation model. Land use is derived from the Soil Occupancy Information System (in Spanish "SIOSE") at a 1:25,000 scale developed by the National Geographic Institute of the Spanish Government. The geological variables are derived from the lithostratigraphic and permeability maps at scale 1:200,000, developed by the Spanish Geologic and Mining Institute (in Spanish "IGM") of the Spanish Government.
The selected catchments exhibit a wide variety of geologies, soils, topographies, land uses, and climatic conditions. Dominant lithological groups in the catchments draining into the Mediterranean Sea are clay, sand, and gravel; from these catchments, those in the Pyrenees also contain some siliceous and calcareous rock. The western catchments draining into the Cantabrian Sea are composed primarily of slates, while the eastern catchments are dominated by calcareous rock (Geological and Mining Institute of Spain, 2013). In each catchment, urbanized zones comprise less than 8% of the total area, with land cover dominated by pastures, broadleaf forests, and coniferous forests. Table 2 reports the ranges of average catchment elevation, slopes of main river channels, catchment areas, annual average rainfall, annual average PET, aridity index, annual average flows, annual runoff coefficients, and annual average temperatures. Figure 2 classifies the catchments based on their aridity index (Arora, 2002) and minimum monthly average temperatures (snowfall is likely below 0°C). In this work the aridity index is defined as average annual PET divided by average annual precipitation; dry and humid climatic conditions are described, respectively, by aridity indices above and below 1 (Arora, 2002).
A subset of 16 catchments (out of the 92 catchments) is selected and treated as "ungauged" for the purposes of evaluating the proposed prediction methods and adequacy tests. These catchments have synchronized daily data of rainfall, daily flow, and daily PET (estimated from monthly PET) of sufficient length (at least 8 years). Such data are not available in other catchments, making it impossible to implement complete leave-one-out cross-validation ( Figure 2 and Table 3). See Peñas et al. (2014) for detailed information. Daily precipitation data for the catchments are provided by the Spanish Meteorological Agency (AEMET), while daily flow data are provided by different Spanish water agencies and regional governments.

. Catchment Flow Indices and Catchment Descriptors
The case study uses the same 103 flow indices w and 16 catchment descriptors used by Peñas et al. (2014). Here a brief summary is provided; for more details, see Peñas et al. (2014). The following 16 catchment descriptors d are used (Peñas et al., 2014): area, climate (mean annual precipitation and PET, and ratio of minimum quarterly precipitation to maximum quarterly precipitation), topography (average catchment elevation and gradient), catchment geometry (drainage density and number of river confluences), land use (area covered by agricultural land, broadleaf forest, coniferous forest, bare land, pasture, and urban areas), and geology (average rock density and permeability). These catchment descriptors are the least correlated from a larger set of catchment descriptors, with Pearson correlation coefficients below 0.7 (Peñas et al., 2014).

Hydrological Model
The conceptual rainfall-runoff model used in this paper is the Probability Distributed Model (PDM) (Moore, 2007). This model has a simple structure and is used widely across the world, including in the United Kingdom Pechlivanidis et al., 2010), Europe (Arnell, 1999;Cabus, 2008;Willems et al., 2014), the United States (Kollat et al., 2012), Southeast Asia (Thompson et al., 2013), Southern Africa (MacKellar et al., 2013), and Australia (Srikanthan et al., 2007). Here PDM is used as a lumped model over each catchment and run on a daily time step.
A schematic of PDM is shown in Figure 3. Spatial variability of soil moisture storage capacity is represented using a Pareto distribution, and runoff routing is represented using two linear reservoirs in parallel. The model has a total of five parameters, namely, the Pareto distribution parameters C max and b (which control storage capacity and its variability, respectively), parameter α PDM (which controls the split of effective rainfall into quick flow and slow flow), and two routing parameters T q and T s (which control the residence time of the quick flow and slow flow reservoirs, respectively). A more detailed model description can be found in Moore (2007). Prior parameter ranges are adapted from Kollat et al. (2012) and shown in Figure 3. The initial soil moisture storage is set to 0, and the first year of the simulations is used for a model warm-up.  (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) 92 catchments Min  (1) (2) (3) (4) (5) (6) (7) (8) (9)

Analysis 2. Evaluation of the Probabilistic RF Regionalization Model
This analysis comprises an exploration of the probability distribution fitted to the residual errors of the regionalization model. The following aspects are considered: 1. The first aspect is regionalization model bias, quantified by the average of the normalized mean parameter of the residual errors of the regionalization model. Two normalization approaches are considered: (i) by the range of observed flow index PCs across all catchments and (ii) by the range of simulated flow index PCs in the given catchment, obtained using the hydrological model with parameter sets sampled from the prior p(θ). A normalized mean of 0 corresponds to an unbiased regionalization. 2. The second aspect is regionalization model precision, quantified by the standard deviation parameter of the residual errors, using the two normalizations described above. The average of the normalized standard deviation quantifies the relative errors of the regionalization model with respect to the range of flow indices PCs. A narrow spread indicates lower uncertainty (higher precision).
In addition to the parameter analysis above, the fitted residual error distribution is plotted for one of the catchments.

Analysis 3. Evaluation of the Correspondence Between Model Adequacy in the Flow Index PC Space and Model Performance in the Flow Time Series Space
This analysis investigates the behavior of the adequacy tests DistanceTest and InfoTest, which are applied in the flow index PC space yet are intended to provide at least an indirect indication of model ability to predict flow time series. The following investigations are carried out over the 16 "ungauged" catchments: 1. Apply the adequacy tests under five scenarios of model quality and data availability (section 3.5). 2. Evaluate flow predictions using the probabilistic Nash-Sutcliffe efficiency (probabilistic NSE) Φ NSP and the 95% posterior probability limits γ 95% (section 3.6). Results are then briefly discussed in the context of operational predictions in the same geographical area. 3. Carry out an analysis of variance (ANOVA) to quantify whether improvements in the hydrological and/or regionalization models make a statistically significant difference on the quality of the predictions as assessed using the Φ NSP and γ 95% performance metrics.

Analysis 4. Illustration of Model Performance in Specific Catchments
This analysis reports observed and simulated flow time series for two selected catchments: 1. Leizarán River Basin (code c8z1 in Figure 2), which is located in the Basque Country and has one of the highest data quality of the available catchments. This catchment is typical of north of Spain, with a humid climate and no snow; PDM is expected to perform well in this catchment.

10.1029/2018WR023254
Water Resources Research 2. Ara River Basin (code X9040 in Figure 2), which is located in the Pyrenees and has a glacial valley at its higher elevations. This catchment experiences snow melt in April-July; PDM is expected to perform poorly in this catchment.

Description of Scenarios in Analysis 3
The studies in Analysis 3 are carried out for the following five scenarios, intended to represent different levels of model quality and available data.
• Scenario 0. This scenario assumes that z obs is available, which allows a direct assessment of the adequacy of the regionalization and hydrological models but in real practice will not be possible in ungauged catchments. The regionalization model is tested by estimating z reg from the catchment descriptors in the "ungauged" catchment and applying DistanceTest and InfoTest with z = z obs . The hydrological model is tested by applying the adequacy tests with z = z obs . • Scenario 1. This scenario represents the intended usage of the proposed framework for flow prediction in ungauged catchments, where observation-based flow index PCs are not available. Flow index PCs for an ungauged catchment are estimated using the regionalization model as described in section 2.2, and regionalized flow index PCs are used to condition hydrological model parameters/simulations via Bayes equation (13). Adequacy tests are performed only for the hydrological model, with z = z reg . • Scenario 2. This scenario is devised such that the regionalization model is accurate and has low noise (i.e., near exact). Specifically, with reference to equation (4), the deterministic term is set equal to the observed values of flow index PCs in the catchment treated as "ungauged" (rather than estimated using RF regression), and the random noise term is set to a Gaussian distribution with zero mean and a standard deviation equal to 5% of the full range of observed flow index PCs across the 92 catchments treated as "gauged." Though other values of the standard deviation might be used, a standard deviation equal to 5% is used in this paper. The purpose of using this reduced noise is to explore the impact of improvements in the regionalization model on the adequacy tests and predictive performance. The adequacy tests are applied only to the hydrological model, because the regionalization model is adequate by construction (p value = 1 and BF > 1). • Scenario 3. In this scenario the hydrological model is replaced by the observed time series (and hence the hydrological model reproduces the flow index PCs exactly) in the ungauged catchment, but the regionalization model has a "realistic" error (based on scenario 1). The flow index PCs in a given ungauged catchment are generated synthetically, as follows. First, a "reference" synthetic flow time series q ref = H(θ) is generated using the hydrological model with a known "reference" parameter set θ ref .
This reference parameter set is obtained by minimizing the normalized distance between the N z dominant PCs calculated for the observed and simulated flows in the ungauged catchment, ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi  (4). Fourth, the random term in equation (4) is set to have the same characteristics as estimated in scenario 1. Given this synthetic model setup, adequacy tests of the regionalization model produce the same results as in scenario 0. However, as will be discussed in section 4.3.1, an "exact" hydrological model might be not "adequate" in simulating flow index PCs if the latter are biased and/or noisy. • Scenario 4. In this scenario the hydrological model is replaced by the observed time series and the regionalization model is near perfect (accurate and with little noise in the space of flow index PCs). Synthetic z obs are generated as in scenario 3, and then the regionalization model is constructed as in scenario 2.

Model Performance Metrics for Flow Time Series
In Analyses 3 and 4, model performance in the flow time series space is appraised in terms of three attributes of a probabilistic prediction: accuracy, precision, and reliability. Accuracy quantifies the distance between the central values of a predictive distribution (e.g., its expectation) and the observed values; precision characterizes the spread of the predictive distribution; and reliability quantifies the degree of statistical consistency of the predictive distribution with the observed data.
Accuracy and precision are quantified jointly using a probabilistic extension of the original (deterministic) NSE (Nash & Sutcliffe, 1970). In this metric, denoted as Φ NSP , accuracy is penalized by (poor) precision as follows (Bulygina et al., 2009): where q t is the set of simulated flows at time step t generated using the set of sampled hydrological model parameters, q obs t is the observed flow, E[.] denotes the expected value, var [.] denotes the variance, and T is the total number of time steps. The accuracy term corresponds to the deterministic NSE of the expected values of simulated flows. The precision term is given by the sum of variances of predicted flows, scaled by the sum of squared residuals (higher variance corresponds to lower precision). When Φ NSP = 1, the predictions provide a perfect match to the data (i.e., both perfectly accurate and precise).
Note that except for the case of predictions with no uncertainty (not relevant here), Φ NSP always has lower values than has the deterministic NSE, due to the (nonnegative) precision penalty term in equation (14). Therefore, the probabilistic NSE is a more stringent metric than is the deterministic one. It is also a more complete metric because it accounts for the width of probability limits (precision).
Predictive reliability is quantified using the 95% coverage interval γ 95% , defined as the percentage of observations that fall into the 95% posterior probability limits (Yadav et al., 2007). A reliable prediction is characterized by γ 95% values close to 95%; larger and smaller values indicate, respectively, underestimation and overestimation of predictive uncertainty. A more comprehensive measure of reliability is given by the predictive QQ plot (Renard et al., 2010), but is not carried out in this work.

Analysis 1: Selection of Flow Index PCs
From the complete set of 103 flow indices w obs in each of the 92 catchments, the application of PCA and the broken-stick method identifies four flow index PCs z obs that collectively explain 87% of the variability in the flow indices (individual contributions of 63%, 12%, 7%, and 5%, respectively). These dominant PCs appear to be made up by combination of multiple flow indices, with no single index having much larger weight than the others.

Analysis 2: Evaluation of the Probabilistic Regionalization Model
The four-dimensional likelihood function p z reg jz sim θ À Á in equation (13) is estimated by fitting a parametric pdf to the distribution of residuals z reg −z sim θ of the RF regionalization model (section 2.2). Pearson's linear correlation test indicates a significant correlation (p value of 0.03) between the residuals of z 2 and z 3 , and no significant correlation (at the 0.05 significance level) between other pairs of residuals. Hence, the likelihood function is constructed as a product of marginal probability distributions of regionalization residual errors for z 1 and z 4 and joint probability distribution for z 2 and z 3 , as follows: The χ 2 test (Pearson, 1900) (Pearson, 1900) and Mardia tests (Mardia, 1970) Table 4 shows the ranges of the mean and standard deviation of the residuals distribution of the regionalization model (columns 2 and 3 in Table 4) and averaged normalized mean and standard deviation of the residual errors (columns 4 to 7 in Table 4) across all 92 catchments. The mean and standard deviation are normalized by the range of flow index PCs from observations (columns 4 and 5 in Table 4) or from hydrological model simulations (columns 6 and 7 in Table 4).
When normalized by the range of observed PCs, the regionalization model residuals have a bias (mean) of 0-12% and a spread (standard deviation) of 12-16%. The regionalization of z 2 , z 3 , and z 4 appears relatively unbiased (mean of residual errors close to 0), whereas the regionalization of z 1 has a (normalized) bias of 12%. The regionalization of z 1 and z 4 is the most precise (narrowest spread), while the regionalization of z 3 is the least precise (widest relative spread).  When the parameters of the regionalization error model are normalized by the range of simulated PCs, the apparent bias and spread increase, due to a tighter normalization range. The biggest effect is seen for the z 1 residuals-normalized bias and spread increase to 40% and 42%, respectively. Figure 5 shows DistanceTest and InfoTest results for the five scenarios of Analysis 3 in each of the 16 catchments treated as ungauged. The scenarios are "S0:R," "S0:H," "S1:H," "S2:H," "S3:R," "S3:H," and "S4:H," where "R" and "H" refer, respectively, to whether the regionalization or hydrological model is being tested. Note that the regionalization model can only be tested in scenarios 0 and 3. The complete set of numerical values is reported in the supporting information. The results are summarized below for each scenario.

Analysis 3: Understanding the Properties of Adequacy Tests 4.3.1. Adequacy Test Results
• Scenario 0: z obs available and used to check the hydrological and regionalization models individually Figure 5 (S0:R) shows that the RF regionalization model passes DistanceTest in all 16 catchments but passes InfoTest in only 12 catchments. In other words, the regionalization model is able to adequately reproduce z obs in 12 of 16 catchments (75% of the catchments). Figure 5 (S0:H) shows that the hydrological model PDM passes DistanceTest in 6 catchments and passes InfoTests in 3 catchments, so that PDM is able to adequately reproduce z obs in only 3 of the 16 catchments (c8z1, X1353, and X9257). The large number of catchments where the regionalization model is adequate, but the hydrological model is not, suggests that the hydrological model tends to be the dominant source of error, at least in the case study area.
• Scenario 1: Real operating conditions where the hydrological model is conditioned on regionalized flow index PCs Figure 5 (S1:H) shows that the combined regionalization/hydrological model is adequate in only a single catchment, c8z1, where it reproduces regionalized flow index PCs with high probability and adds information over the prior knowledge about the flow index PCs.
• Scenario 2: Condition an "inexact" hydrological model on flow index PCs with small regionalization error In this scenario, the regionalization model is devised to be adequate (section 3.5); hence, its adequacy tests results are not shown in Figure 5. The hydrological model is inadequate in representing the regionalized information in all catchments ( Figure 5, S2:H). More specifically, the hydrological model passes DistanceTest in 3 of 16 catchments but fails InfoTest in all of them (BF < 0.5).
• Scenario 3: Condition an "exact" hydrological model on flow index PCs with "realistic" regionalization errors. Figure 5 (S3:H) shows that even an "exact" hydrological model conditioned on inaccurate and noisy z reg fails DistanceTest and InfoTests in 9 of 16 catchments.
• Scenario 4: Condition an "exact" hydrological model on flow index PCs with small regionalization error In this scenario, the regionalization model is devised to be adequate (section 3.5), the same as in scenario 2 (results hence not shown in Figure 5). The focus is hence on the hydrological model, which in this synthetic scenario is "exact" (section 3.5). Interestingly, Figure 5 shows that the hydrological model fails DistanceTest in 1 of 16 catchments and fails InfoTest in 4 of 16 catchments (in different catchments than in scenario 0). The single DistanceTest failure is not unexpected: As the p value threshold to pass DistanceTest is set to 0.05 (5%), even an adequate model is expected to fail DistanceTest on average once for every 20 catchments.

Performance of Flow Time Series Predictions in the Ungauged Catchments
This section reports model performance in terms of flow time series predictions and relates it to model performance in terms of the adequacy tests. Figure 6 reports the Φ NSP and γ 95% performance metrics (section 3.6) in all 16 "ungauged" catchments.
First, consider changes in model performance from scenario 1 to scenario 2. Figures 6a and 6c show that in the single catchment where the hydrological model is found adequate in representing regionalized information (i.e., catchment c8z1; see section 4.3.1, scenario 1), improving the regionalization model results in improved flow predictions: Φ NSP improves slightly from 0.68 to 0.71, and γ 95% improves from 70% to 74%. In contrast, in the 15 catchments where the hydrological model is found inadequate, improving the regionalization model as achieved "synthetically" in scenario 2 does not systematically improve prediction quality. For example, in catchment AN313 Φ NSP worsens from 0.3 to 0.25 and γ 95% worsens from 77% to 75%, but in catchment X1353 Φ NSP improves from 0.6 to 0.7 and γ 95% improves from 45% to 51%.
Second, consider changes in model performance from scenario 3 to scenario 4. Figure 6 shows improvements in Φ NSP and γ 95% in all catchments. For example, in catchment AN313 Φ NSP improves from 0.77 to 0.99 and γ 95% remains at 100% (indicating an overestimation of predictive uncertainty). This result suggests that when the hydrological model is exact, improving the regionalization model leads to better flow predictions. This finding is also supported by the ANOVA, which rejects the null hypothesis that "when the hydrological model is exact, improving regionalization has no effect on Φ NSP and γ 95% ."

Analysis 4: Detailed Illustration of Model Performance in Selected Catchments
Figures 7a-7d show the predicted flow time series (95% probability limits) for scenarios 1-4 in catchment c8z1, where PDM is adequate. In scenarios where regionalization quality is representative of real conditions (scenarios 1 and 3), the adequate hydrological model provides predicted flow time series of similar quality to the exact hydrological model. Further, improving the regionalization model improves the quality of predicted flow time series generated by the adequate hydrological model-both when switching from scenario 1 to scenario 2 and when switching from scenario 3 to scenario 4.
Figures 7e-7h show the predicted flow time series (95% probability limits) for scenarios 1 to 4 in catchment X9040, where neither the hydrological nor regionalization models are adequate. Note that catchment X9040 is affected by snow melt, which is not represented by PDM (April-July period in scenarios 1 and 2 in Figures 7e and 7f). Results in scenario 2 illustrate that when the hydrological model is inadequate, its error precludes the reproduction of flow time series even when the regionalization model is adequate. Moreover, results in scenario 3 show the importance of regionalization model adequacy: Even when the hydrological model is exact, conditioning on poor-quality flow index PCs yields low-quality flow predictions; for example, the 95% probability limits envelop just 14% of observed flows (γ 95% = 14%).

Selection of Flow Indices PCs
Section 4.1 shows that the first 4 flow index PCs collectively explain 87% of the variability in the flow indices. This indicates that most of the information contained in the hydrological indices can be captured by just a few quantities, which is expected to be useful in ungauged catchment applications (Wagener & Montanari, 2011). Further, section 4.3 shows that in some catchments, the most significant PCs regionalized by RF are sufficient to obtain good performance of the hydrological model predictions. Note that the

Evaluation of the Probabilistic Regionalization Model
Section 4.2 shows that when the parameters of the regionalization model are normalized by the range of most significant PCs computed from observed data, the spread is relatively narrow. The spread is comparable with the case in which the normalization range is based on the range of most significant PCs computed from model simulations. The only exception is z 1 , for which the regionalized value has the highest bias and spread when normalized by the range of simulated PCs (see Table 4). This behavior indicates that the hydrological model provides only a limited coverage of z 1 value range generated by the regionalization model and suggests that the regionalization error is large relative to the range of z 1 values simulated by the hydrological model. Therefore, z 1 might be of limited utility for conditioning the selected hydrological model parameters despite being the dominant (most informative) flow index PC from the complete set of flow indices w, as conditioning on z 1 might not be efficient in reducing PDM parametric uncertainty or generalized z 1 might be outside of z 1 range produced by PDM.
Nonparametric distributions (e.g., mixtures of Gaussians) might be advantageous when working with larger data sets (e.g., flow indices from more than the 90 catchments used in this study). For shorter data sets, nonparametric distributions could be difficult to fit parsimoniously. For example, to model the four-dimensional vector of flow index PCs considered in the paper using a single joint mixture of Gaussians distribution, one needs 4 parameters to define a Gaussian kernel mean, 10 parameters to define the covariance matrix, and 1 kernel weight coefficient, hence 15 parameters in total. If the mixture distribution were to comprise several such kernels, say, 6 kernels, the number of parameters (6 × 15 − 1 = 90 − 1 = 89) would approach the number of data points (90 sets of PCs in our case). This might result in poor identifiability and overfitting.

Adequacy Tests in Analysis 3
Scenario 0 in section 4.3.1 shows that catchments where the regionalization model fails InfoTest are harder to model using the combined regionalization/hydrological model, because the information available to condition the hydrological model is less reliable. In addition, results in scenario 0 indicate that there is a large number of catchments where the regionalization model is adequate, but the hydrological model is not (see scenario 0 in Figure 5). In other words, in the case study area, the hydrological model tends to be the dominant source of error in comparison with the regionalization model.
Moving from the adequacy test for the hydrological model in scenario 0 (S0:H in Figure 5) to scenario 1 (S1:H in Figure 5), the number of catchments where the hydrological model is adequate reduces from 3 to 1 (see section 4.3.1). This finding can be attributed to the uncertainty due to the regionalization model, especially as the uncertainty in z reg can often be expected to exceed the uncertainty in z obs . More generally, the combined regionalization/hydrological model could be inadequate for several potential reasons, including model structure deficiencies and data errors. For example, for the hydrological model, the following potential sources of error are noted: 1. The PDM model structure does not represent snow melt (e.g., catchment X9040 in Figure 7), or Hortonian runoff process/intermittent rivers (e.g., catchment X9257), or deep aquifers (e.g., catchment AN439-Larraún Irutzun River Basin; J. L. Navarra, personal communication, March 15, 2018). 2. The rainfall data are uncertain due to a limited number of rain gauges and their nonuniform coverage, while rainfall variability is expected to be high due to orographic precipitation in catchments draining into the Cantabrian sea and convective precipitation in catchments draining into the Mediterranean sea. Daily PET values are calculated for each month and distributed uniformly to produce daily values, which might be inaccurate for dry catchments. 3. The flow data (and hence flow index PCs) are expected to be affected by discharge gauging errors, rating curves extrapolation, flow regime hysteresis , and channel hydraulic property changes due to the lack of a control section in the majority of the catchments (Kuczera, 1996;Renard et al., 2011).
Potential sources of error in the regionalization model have been noted in section 2.2.2.

Water Resources Research
Results in scenario 2 are generally consistent with results in scenario 0: In both scenarios the hydrological model struggled to match the observed flow index PCs. Scenario 2 adds small noise to the observed flow index PCs, and the hydrological model adequacy deteriorates further, as compared with scenario 0.
Results of the adequacy test for the hydrological model in scenario 3 are reassuring as they imply that the hydrological model will not reproduce erroneous flow index PCs. However, it is not clear if this finding is specific to PDM, nor is it clear if a more heavily parameterized model might behave differently because of its increased flexibility in matching conditioning data.
Finally, there are some catchments that fail to pass InfoTest in scenario 4. This might be due to the 5% noise in the regionalization model. Note that InfoTest is only indicative of whether the model generates z sim close to z reg more frequently than the prior (see section 2.3.2), and achieving a high frequency of such occurrences is not guaranteed for the hydrological model parameter set used in this scenario. Hence, unless the hydrological model is conditioned to additional data (e.g., regionalized flow index PCs beyond those identified as dominant by the broken stick method; see section 4.3.2), the frequency at which the model will reproduce the exact flow index PCs will be relatively low. This question links to information quality and quantity of information, that is, what additional data/information are needed for those catchments that fail in scenario 4 and what is the size of error that can be propagated into the hydrological model. These research questions are recommended for future work.

Performance of Flow Time Series Predictions
The analysis of flow time series predictions in ungauged catchments (section 4.3.2) showed that moving from scenario 1 to scenario 2-that is, improving only the regionalization model while the hydrological model is kept as is-does not systematically translate into an improvement in the flow hydrograph (little improvement is seen in Φ NSP and γ 95% ). In other words, the error in the hydrological model (which includes its structure, inputs and parameters) is precluding an improvement in flow predictions even if the regionalization model improves (see scenario 2).
However, when the hydrological model is exact, improving the regionalization model (i.e., moving from scenario 3 to scenario 4) leads to better flow predictions. This finding is supported by the ANOVA, which rejects the null hypothesis that "when the hydrological model is exact, improving regionalization has no effect on Φ NSP and γ 95% ." Consequently, an improvement in the regionalization leads to improved Φ NSP and γ 95% over most catchments only if the hydrological model error is eliminated/reduced first (see scenario 4). This finding is consistent with the results in section 4.3.1 (scenarios 2 and 4).
Therefore, given their definition based on flow index PCs, the model adequacy tests proposed in this work can be expected to become better indicators of flow time series performance if the following enhancements are undertaken in the regionalization process: (i) more diverse and representative sets of hydrological (flow) indices w and catchment descriptors d obs are used and (ii) more PCs are included in z obs .

Benefits and Limitations of the Adequacy Metrics
This section compares the adequacy metrics proposed in section 2.3 with traditional metrics of hydrological model performance.
First, and most importantly in the context of prediction in ungauged catchments, the adequacy metrics are applied a priori using regionalized information. In contrast, conventional application of NSE-type metrics (deterministic or probabilistic) and prediction interval coverage to predicted flow time series represent a posteriori diagnostic metrics and cannot be applied in ungauged applications (where observed flow data are unavailable).
Second, the adequacy metrics are applied to flow index PCs, which allows quantifying the degree to which the model reproduces dominant flow characteristics (Fenicia et al., 2018;Westerberg & McMillan, 2015;Yilmaz et al., 2008). This study uses indices characterizing annual and monthly flows, high and low flows, hydrograph timing, and so on (section 3.2), and the modelers could refine their selection if guided by particular operational priorities. In contrast, conventional application of NSE and prediction interval coverage to streamflow time series yields highly aggregated metrics (e.g., Clark et al., 2011;Gupta et al., 2008;Seibert, 2001). For example, NSE and similar metrics are most sensitive to high flow values, which often results in poor prediction of low flows and is undesirable when modeling low-yield ephemeral catchments (e.g., Ye et al., 1998). In the context of this work, Φ NSP and γ 95% do not provide any indication of the added value of using a model to predict the dominant flow characteristics (as represented by the flow index PCs).
Third, the adequacy metrics are probabilistic in nature, whereas many traditional hydrological metrics, most notably the original NSE (Nash & Sutcliffe, 1970), cater solely to deterministic predictions. In this respect, the adequacy metrics contribute toward better probabilistic prediction assessment, which currently includes the probabilistic NSE (Bulygina et al., 2009), the continuous rank probability score (Hersbach, 2000), reliability and precision metrics (McInerney et al., 2017), and other metrics.
However, like any other diagnostic metric, the adequacy tests have their limitations. As highlighted in section 2.2.3, passing the adequacy tests for the flow indices does not guarantee that the model is able to reproduce the flow time series.
Overall, these considerations make the adequacy metrics highly attractive for assessing model performance, especially in ungauged applications.  -7h), where the hydrological and regionalization models are inadequate, the hydrological model error precludes the reproduction of flow time series even when the regionalization model is adequate. In this catchment, an improvement in the regionalization model will lead to better predictions only if the hydrological model is improved first.

Model Performance in the Context of Operational Predictions
The hydrological prediction results obtained in this study can be put in the context of operational work in the same geographical area. In scenario 1, where real data/models are used, the probabilistic NSE of predictions in catchments (treated as ungauged) was in the range [−0.65, 0.68], as seen in Figure 6. In operational studies in the north of Spain, representative ranges of deterministic NSE of predictions in gauged catchments were found to be [0.25, 0.72] for a study in Cantabria (García et al., 2008;Gobierno de Cantabria, 2005) and [−0.4, 0.9] for a study in the Basque Country (Agencia Vasca del Agua, 2003). Considering that this research study calibrated to regionalized flow indices, whereas the operational studies cited above calibrated to observed flow time series data, and that the probabilistic NSE is lower than the deterministic NSE (e.g., by 0.02-0.2 units in scenario 1), it can be seen that comparable or somewhat better performance is achieved in this research study. According to the classification suggested by Foglia et al. (2009) to interpret deterministic NSE values of flow prediction in gauged catchments, the probabilistic NSE values in ungauged catchments achieved in this work are in the range of "sufficient" to "very good" in 13 of the 16 catchments. While this classification is obviously subjective, taken together with the quantitative results cited above, it highlights the promise offered by the prediction in ungauged catchments methods developed in the present paper.

Future Research
First, our application assumes that hydrological model errors are small/negligible compared with regionalization model errors. This assumption follows published work on the conditioning of hydrological parameters to streamflow statistics (e.g., Almeida et al., 2016;Bulygina et al., 2009Bulygina et al., , 2012Yadav et al., 2007). However, this assumption is questionable (see the conclusions in Almeida et al., 2016;Bulygina et al., 2009, and it is of interest to investigate ways to relax it, for example, by considering an ensemble of (sufficiently different) hydrological models. Multiple models can yield insight into hydrological model errors (e.g., Clark et al., 2008;Clark et al., 2015;Fenicia et al., 2011;Wagener et al., 2004;Wagener & Montanari, 2011), and in some cases model ensembles have been shown to reduce predictive errors (e.g., Georgakakos et al., 2004;Hsu et al., 2009;Shamseldin et al., 1997). Future work could consider the use of adequacy metrics on model ensembles.
Second, the error in the regionalized estimates of z could be estimated by expanding the number of residuals draws and looking into individual tree predictive errors (not the entire forest average). This approach would reduce the computational cost of the estimation at the expense of larger regionalization errors.
Computational cost is not a major aspect in this work (less than a second per catchment) but could be important in some other applications.
Third, the sensitivity of the predictions of the flow dynamics (hydrograph) to the quantity and quality of information in z warrants further investigation. For example, the relationship between the variance represented by z and the information needed to reproduce the hydrograph is of interest, as well as understanding how this relationship depends on the type of catchment.
Finally, the analyses presented in this paper could be extended to a larger set of catchments, and the model results verified more comprehensively using cross-validation over more than the 16 catchments used here. This was not possible in this study due to the lack of synchronized data.

Conclusions
This study offers two advances to flow prediction in ungauged catchments: (1) combination of a regionalization method, implemented using the machine learning technique RF augmented with a probabilistic residual error model, with a Bayesian inference formulated for regionalized PCs of a set of flow indices, and (2) development of model adequacy tests, namely, DistanceTest and InfoTest, computed using the regionalized flow index PCs, to provide an a priori indication of the ability of a hydrological model to predict flow time series in an ungauged catchment. More specifically, in a given ungauged catchment, DistanceTest quantifies whether a model (hydrological and/or regionalization) is likely to reproduce regionalized flow index PCs, and InfoTest quantifies whether a model adds information about flow index PCs beyond what is already known from prior knowledge (here the ranges of flow index PCs in gauged catchments).
An empirical case study based on 92 catchments in northern Spain is presented, where the proposed methods are tested on 16 catchments treated as ungauged. The following findings are obtained: 1. The high-dimensional space of flow indices is reduced substantially via PC analysis and the broken-stick method, from 103 indices to just 4 dominant PCs that explain 87% of the variance in the indices. 2. The errors in the regionalized flow index PCs estimated using RF regression are of the order of 12-16% of the observed values. Regionalization via RF regression is adequate in 12 of the 16 catchments (75% of catchments). 3. The first four PCs regionalized via RFs provide sufficient information for predictions of flow time series in many (though not all) of the case study catchments (treated as ungauged), with probabilistic NSE values ranging from −0.65 to 0.68. A preliminary comparison with selected operational water resources studies on gauged catchments in the same geographic area suggests broadly comparable performance, with opportunities for further improvement. 4. The adequacy tests in the flow index PC space are indicative of model performance in the flow time series space. The following insights are attained: a. When a hydrological model is adequate (i.e., passes the adequacy tests), improvements in the regionalization model translate into improvements in flow predictions. b. When a hydrological model is inadequate, improvements in the regionalization model do not systematically propagate into improved flow predictions. The model adequacy tests can be considered a prerequisite for a hydrological model to attain meaningful and high-quality flow time series predictions in ungauged catchments. 5. The adequacy tests yield useful insights that can help identify dominant sources of predictive error: hydrological model, regionalization model, or both. This error attribution can help prioritize future studies and developments.
An important limitation identified using model adequacy tests and flow time series metrics is the poor performance of the hydrological model PDM forced with observed rainfall and estimated PET in the case study catchments. In these catchments, PDM struggles to reproduce the hydrological characteristics given by the observed flow index PCs. Note that this study has not attempted to separate total flow error into individual contributions due to model structural errors versus the effects of observational data errors. Overall, current results suggest that priority should be given to improving the hydrological model structure and its inputs (including a better characterization of predictive uncertainty) and then to improving the regionalization model.