Consistency of climate change projections from multiple global and regional model intercomparison projects

We present an unprecedented ensemble of 196 future climate projections arising from different global and regional model intercomparison projects (MIPs): CMIP3, CMIP5, ENSEMBLES, ESCENA, EURO- and Med-CORDEX. This multi-MIP ensemble includes all regional climate model (RCM) projections publicly available to date, along with their driving global climate models (GCMs). We illustrate consistent and conflicting messages using continental Spain and the Balearic Islands as target region. The study considers near future (2021–2050) changes and their dependence on several uncertainty sources sampled in the multi-MIP ensemble: GCM, future scenario, internal variability, RCM, and spatial resolution. This initial work focuses on mean seasonal precipitation and temperature changes. The results show that the potential GCM–RCM combinations have been explored very unevenly, with favoured GCMs and large ensembles of a few RCMs that do not respond to any ensemble design. Therefore, the grand-ensemble is weighted towards a few models. The selection of a balanced, credible sub-ensemble is challenged in this study by illustrating several conflicting responses between the RCM and its driving GCM and among different RCMs. Sub-ensembles from different initiatives are dominated by different uncertainty sources, being the driving GCM the main contributor to uncertainty in the grand-ensemble. For this analysis of the near future changes, the emission scenario does not lead to a strong uncertainty. Despite the extra computational effort, for mean seasonal changes, the increase in resolution does not lead to important changes.


Introduction
Regional climate change information is increasingly being demanded by different vulnerability, impact and adaptation (VIA) research communities (Hewitson et al. 2013). This information is required to feed models which, eventually, will produce information for specific sectors (health, energy, food availability, risk management, water resources) and enter decision-making processes at different levels (Huynen and Martens 2015;Koutroulis et al. 2015;Giannini et al. 2016). The distillation of information out of the huge amount of available data is a technical and ethical challenge (Hewitson et al. 2013). A key approach to produce future climate scenarios is the use of numerical models. A recent series of international projects and initiatives have produced, using both global (GCM) and regional (RCM) climate models, a huge ensemble of future climate projections, which samples most of the uncertainties affecting climate change. The model development cycle, along with the periodic Intergovernmental Panel on Climate Change (IPCC) reporting cycle, impose a rhythm in climate scenario production which can hardly be followed by the scientific community feeding from them. Specific technological infrastructures, such as the Earth System Grid Federation (ESGF; Williams et al. 2015)), have been developed to face the data storage and discovery challenge posed by the 5th Coupled Model Intercomparison Project (CMIP5) feeding the last IPCC Assessment Report (AR5) (Stocker et al. 2013). Each new initiative increases model complexity and/or spatial resolution, resolving more and more processes relevant for the anthropogenic climate change assessment. Data users switch to the latest available products, without exhausting previously existing databases. In contrast, this work combines 6 ensembles of future climate projections from different model types and generations, showing common and conflicting messages that arise from very simple analyses.
VIA research communities are usually advised to consider an ensemble of model projections in order to propagate the uncertainty arising from different greenhouse gases (GHG) scenarios and climate simulation tools. In the past, data availability usually drove the selection of models, e.g. researchers used models from their own institution (Easterling et al. 2001;Knowlton et al. 2007), from a project consortium (Dessai 2003;Sima et al. 2015), or data from the few models openly available (Kalkstein and Greene 1997;Rötter et al. 2013). Since 2004, the advent of the ESGF, climate change scenario model output data can be accessed homogeneously, through a single access infrastructure and in a common data format which eases the processing of any number of models and ensemble members. The use of the ESGF is still complex and slow for an average user, and a number of tools have been developed to bridge the gap with the users; for example, climate4impact (Plieger et al. 2015) or ESGFToolsUI (Terry 2014). Despite these technical advances to access data, fundamental transdisciplinary issues might still hamper a proper usage of regional climate data (Rössler et al. 2017;Hewitson et al. 2017).
An important gap in the provision of regional climate change information is the scale mismatch between GCM model output and the spatial scale of most climate impact applications. Downscaling techniques (Wilby and Wigley 1997) have been developed in the last decades to bridge this gap by means of two main approaches: dynamical (RCMs; (Laprise 2008;Rummukainen 2010;Giorgi and Gutowski 2015) ) and empirical-statistical (Maraun et al. 2010;Gutiérrez et al. 2012). The empirical-statistical downscaling (ESD) builds empirical relationships between large-scale and local observations, which are exploited to generate local climate information out of coarse GCM projections. Unfortunately, the availability of comprehensive ensembles of future climate projections derived from ESD is very scarce to date, although initiatives such as VALUE (Maraun et al. 2015) and the COordinated Regional climate Downscaling EXperiment (CORDEX; (Giorgi and Gutowski 2015)) are currently devoting a strong effort to this approach. Dynamical downscaling through regional climate models, on the other hand, has been coordinated during the last two decades, particularly over certain regions. In Europe, a series of chained projects (Regionalization (1993(Regionalization ( -1994, RACCS (1995RACCS ( -1996, MERCURE (1997MERCURE ( -2000, PRUDENCE (2001PRUDENCE ( -2004 and ENSEMBLES (2004ENSEMBLES ( -2009) coordinated a regional climate modelling community, nowadays under the umbrella of CORDEX forming the EURO-CORDEX ) and Med-CORDEX (Ruti et al. 2016) communities.
In parallel to international initiatives, some national efforts are also producing downscaled regional climate projections. Some examples are the DRIAS project for France (Lémond et al. 2011), KNMI'15 for the Netherlands (Tank et al. 2015), UKCP for the United Kingdom (Murphy et al. 2009) or PNACC-2012 for Spain (Gómez et al. 2016;San-Martín et al. 2017). We included dynamical downscaling (ESCENA project, see Sect. 2.2) results from the latter, given that it covers our target region. Other regional, national climate change projection studies rely on data from international initiatives and usually focus on a single data source: Soares et al. (2014) for Portugal/ENSEMBLES, Ouzeau et al. (2016) for France/EURO-CORDEX or Kis et al. (2017) for Carpathian region/ENSEMBLES. Some studies combine a couple of initiatives, but this is not common practice. For example,  and Tolika et al. (2012) consider ENSEMBLES andPRUDENCE, Jacob et al. (2014) compare EURO-CORDEX and ENSEMBLES. Data from the global models used to drive regional projections are seldom considered (Turco et al. 2013;Kjellström et al. 2017).
The objective of this work is twofold: 1. Compare climate change signals arising from different recent ensembles of regional climate projections produced by numerical climate models. The comparison will also consider other factors sampled in the grandensemble, such as different GCMs, RCMs, RCM spatial resolutions, GHG scenarios, etc. 2. Provide basic climate change scenarios for continental Spain and the Balearic Islands. The body of the paper focuses on specific results illustrating the grand-ensemble heterogeneity, while the Electronic Supplementary Material provides results for other seasons, variables or factors.
The paper is structured as follows: Sect. 2 summarizes the data gathered for this study. The data processing and analysis methodology is described in Sect. 3. The results Sect. 4 introduces spatially-averaged delta changes for precipitation and near surface temperature (Sect. 4.1), which are decomposed in different uncertainty sources in Sect. 4.2. Cautionary remarks on the use of delta changes (Sect. 4.3) and conflicting projections (Sect. 4.4) are provided next. The paper closes with a discussion (Sect. 5) and a brief summary of the main conclusions (Sect. 6).

Data
This study uses only dynamical (global and regional) model output from transient climate change simulations. Observations and evaluation simulations (using reanalysis data as boundaries) were not considered, but all models presented have been evaluated elsewhere (see below). A total of 196 future climate projections ( Fig. 1) for Spain have been collected from the global (CMIP3 and CMIP5) and the regional (ENSEMBLES, ESCENA, EURO-COR-DEX and Med-CORDEX) initiatives described below. This is, to our knowledge, the largest ensemble of scenarios ever considered for a region.

ENSEMBLES
As part of the ENSEMBLES FP6 project (van der Linden and Mitchell 2009), a multi-model ensemble of RCMs was produced. An evaluation of this ensemble forced by boundary conditions from reanalysis data for precipitation and maximum and minimum temperature over Europe can be found in Kjellström et al. (2010). The quality of the climate change simulations for these variables has been reported in several project technical reports (http://ensem bles-eu.metof fice.com/tech_repor ts.html) and elsewhere for particular RCMs or variables (Boberg et al. 2009;Kjellström et al. 2011). 13 RCMs from ENSEMBLES were used in this study, mostly at a horizontal resolution of about 25 km, but also at about 50 km. Boundary conditions for these RCMs were taken from 7 CMIP3 GCMs forced by the A1B emissions scenario. Publicly available data were obtained from the ENSEMBLES RCM data archive at the Danish Meteorological Institute (DMI). Project ESCENA (2008 was part of the Spanish Strategic action on energy and climate change (http://meteo .unica n.es/proje cts/escen a). Funded by the Spanish government, this call aimed at providing the scientific basis for the assessment of regional climate change impacts over Spain (including continental Spain, the Balearic and Canary Islands). This project agglutinated the regional climate modelling efforts of this action and coordinated four different institutions using four different RCMs: Universidad de Castilla-La Mancha (PROMES model), Universidad de Murcia (MM5 model), Universidad de Alcalá de Henares (REMO model) and Universidad de Cantabria (WRF model). Two versions of the WRF-v3.1.1 model were considered, WRA and WRB, which differ in the planetary boundary layer physics scheme (using a local vs. non-local closure, respectively). The boundary forcing came from CMIP3, as in ENSEMBLES, and the horizontal resolution was also about 25 km. The historical and scenario simulations span the period 1951-2050. A brief description of the RCMs used in the project and an evaluation of their performance under present climate conditions in terms of precipitation and temperature can be found in Jiménez-  and Domínguez et al. (2013). Gómez et al. (2016) evaluated surface wind according to this data set and investigated future changes.

EURO-CORDEX
As part of the global CORDEX initiative, EURO-COR-DEX provides climate scenarios for Europe mainly at two resolutions, ∼ 12 km (EUR-11) and ∼ 50 km (EUR-44), complementing previous data sets from ENSEMBLES. The regional simulations are based on CMIP5 global climate projections (Taylor et al. 2011), forced by different representative concentration pathways (RCPs; van Vuuren et al. 2011). In this case low (RCP 2.6), midrange (RCP 4.5) and high (RCP 8.5) emissions scenarios are considered depending on the model. Previous studies have evaluated the EURO-CORDEX results in present day climate Kotlarski et al. 2014) or compared the climate change projections to those from ENSEMBLES  for two key variables for impact assessment studies: temperature and precipitation. Kjellström et al. (2017) consider, additionally, near-surface wind changes. Overall, the EURO-CORDEX evaluation highlights the general ability of the RCMs to represent the basic spatio-temporal patterns over Europe with some limitations for selected metrics, regions and seasons . It was also found an agreement between EURO-CORDEX and ENSEMBLES results .
10 RCMs from EUR-44 resolution and 8 from EUR-11 were considered in this analysis (Fig. 1). A few more simulations were considered since the beginning of this study, but during this time they were removed from the server due to different problems detected (modellers communication to EURO-CORDEX community). The volatility of these data is discussed later on in this work. This study finally considers only those RCM simulations available through the ESGF as of April 30th, 2017.

Med-CORDEX
Med-CORDEX is a coordinated contribution to the COR-DEX initiative focused on the Mediterranean basin. The particularity of Med-CORDEX is that it includes regional atmospheric, land surface, river and oceanic climate models and coupled RCMs (Ruti et al. 2016). Evaluation simulations from Med-CORDEX are analysed by Dell'Aquila et al. (2016) and compared to the previous generation simulations from the ENSEMBLES project. Ayar et al. (2016) evaluate the precipitation of three RCMs from Med-CORDEX, two from EURO-CORDEX and six statistical downscaling methods. There are a few other evaluation studies for particular Med-CORDEX models and/or regions (Tramblay et al. 2013;Flaounas et al. 2013;Stéfanon et al. 2015;Liguori et al. 2017).
Currently, most of the simulations available at the Med-CORDEX database (http://www.medco rdex.eu) are from atmospheric models. We considered three RCMs and one fully-coupled (atmosphere-land-ocean) model for 50 km resolution (MED-44) and two RCMs for 12 km resolution (MED-11).

Global model data
Finally, we considered also the output from the GCMs used as driving boundary conditions for the RCM simulations in the previous projects/initiatives. They were downloaded from different repositories. CMIP3 data (Meehl et al. 2007) was mainly obtained from the Climate and Environmental Retrieval and Archive (CERA) database . This is the driving data for the ENSEMBLES and ESCENA projects. CMIP5 data were obtained through the ESGF. Model data from this experiment drove EUROand Med-CORDEX regional simulations. Some driving models were not publicly available and, therefore, are missing in this study (See empty positions in the first column of Fig. 1).

Methodology
There is a high number of dimensions involved in regional climate change projections. Giorgi et al. (2008) envisaged them as components of a regional climate change "hypermatrix", which is now at the core of the CORDEX framework. One can (should) consider different: Concentration scenarios, GCMs, RCMs, RCM resolutions, GCM (RCM) realizations, GCM (RCM) versions, etc.; each spanning a different uncertainty, which depends on the projection horizon (Hawkins and Sutton 2009;Déqué et al. 2012). Uncertainties can be properly evaluated using ensembles of simulations (Palmer 2000), if these are sufficiently large and well designed (Daron and Stainforth 2013). However, even when projecting the available ensemble on just two dimensions, such as GCM and RCM, the phase space spanned is relatively empty (Fig. 1). Moreover, the analysis can focus on different variables, seasons, future time-slices, statistics (mean, variability, extremes,…).
With this perspective, we focus on a simple approach. We examine only precipitation and mean surface temperature, usually considered in the applications and climate change projection summaries, although we discuss (Sect. 5.2) the limited view these two variables provide. We consider only the mean climate, in particular, seasonal climatologies of 30-year periods.
We concentrate on a near future (2021-2050), which maximises the ensemble members available (limited by ESCENA and some ENSEMBLES RCMs). We will see that this also has the advantage to consider only weak GHGconcentration differences. The response to these weak differences is smaller than the internal model variability. Pooling these projections is, therefore, closer to a random sample from the same population (of realizations from the same model climate), than in far future periods, when the different scenarios clearly represent different populations and cannot be easily merged (what is the weight of each GHG scenario?).
We selected the reference historical period 1971-2000 and consider the standard delta-change approach (Räisänen, 2007) to explore future climate change projections. We highlight some of the caveats behind this approach in Sect. 4.3.
Even though we pool the ensembles from different sources in a grand-ensemble, in recognition of potentially different populations being merged, we avoid grand-ensemble summaries (means, PDFs) and illustrate individual members factorized by the many dimensions considered (GCM, RCM, project/initiative, resolution,… ). We warn against grand-ensemble summaries by showing the split of bimodal PDFs under some factorizations. The lack of independence (complex and unknown dependence among projections) and the incompleteness of the interactions between dimensions prevents the use of many standard statistical tools. Instead of summary/factorial statistics, we favour graphical displays of the full grand-ensemble.
For brevity, we show mainly summer, June through August (JJA), results. The Electronic Supplementary Material provides equivalent results for other seasons, which are commented where appropriate and, in any case, left as reference for downscaled scenario users interested in this region. Figure 2 shows that in JJA occur the largest differences among projects.
All data were interpolated by a nearest neighbour algorithm from their different projected grids to a regular 0.2 • latitude-longitude grid spanning continental Spain and the Balearic Islands. Spatial averages (e.g. those in Sect. 4.1) consider only these land areas.

Projected delta changes for precipitation and temperature
Projected delta changes for precipitation and temperature are summarized for the different model intercomparison projects (MIP) in Fig. 2. To emphasize the inherent uncertainty in these projections, no central estimate is provided. The ranges encompassing 50 and 90% of the projections for each MIP are provided instead, split into forcing GCMs and dynamically downscaled (RCMs) results. Overall, the different MIPs agree, showing the most prominent change during summer (JJA) for both variables. Some differences can also be identified across MIPs, e.g. the extended upper end of the temperature change range during spring (MAM) for the CMIP3-based projections (ENSEMBLES and ESCENA), or the less dry summer projection in ESCENA. However, one of the most striking features of Fig. 2 is the systematic reduction of the temperature change by the RCM ensembles with respect to their driving GCMs. No MIP projected a single range limit higher than their driving GCM ensemble. For precipitation, this is not systematic, and RCMs even show opposite effects on the deltas in different seasons: drier summers and wetter winters than in the driving GCMs. We look into individual ensemble members to trace the origin of these summary statistics. The analysis of the full 196-member ensemble ( Fig. 3a) shows that 50% (90%) of the simulated summer near surface temperatures project an increase of 1.3-2.1 K (1.0-2.6 K) by 2021-2050, with respect to the average of the 1971-2000 period, on average for continental Spain and the Balearic Islands. The median increase is about 1.7 K, but this value is misleading, since the sample of projections is slightly bimodal, and there are relatively few projections close to this value.
Marginal results for precipitation show that more than 75% of the models project a decrease in summer rainfall, with a median decrease of 11%. Even though this decrease could be not statistically significant, what is clear is the inverse relationship between precipitation and temperature, typical of moisture-limited land-atmosphere feedback regimes (Seneviratne et al. 2010;Jerez et al. 2013;Stegehuis et al. 2012), which respond to increasing temperatures by drying out the soils and limiting moisture availability for summer precipitation. Causality is always difficult to establish (a decrease in precipitation also leads to increasing temperature by reducing soil moisture, thus increasing the Bowen ratio), but the relationship in Fig. 3a is clear. The relationship also emerges during spring (MAM) and slightly in autumn (SON) but, as expected, is absent during winter (DJF) (Fig. ESM2). In winter, an energy-limited regime prevails and frontal precipitation feeds from moisture advected from the Atlantic ocean and less from local evaporation (Rios-Entenza et al. 2014).
The bimodality of the projected near surface temperature can be partly explained by the driving GCM. Moreover, GCM "families" tend to cluster and provide high or low delta values (Fig. 3b). In particular, due to their different climate sensitivities, models developed by the EC-EARTH international consortium and the Max Plank Institute (MPI) for Meteorology in Germany give rise to deltas in the lower range (median about 1.5 K), while Hadley Centre MetOffice (UK) models project the largest deltas (median close to 2.3 K). The "Other" family labelled in Fig. 3b includes the rest of the models, which contribute fewer members to the grand-ensemble, and consist of driving GCMs developed in research centres across the world (France, Norway, Canada, USA, Japan, Australia). The clustering in temperature change induces a clustering in precipitation according to the relationship previously mentioned. Most Hadley Centrederived projections show a precipitation decrease, while the EC-EARTH family shows mixed sign projections. Interestingly, the MPI model family shows lower end temperature deltas, but mostly precipitation decreases. The leading role of GCM family in partitioning the delta range is also clear in other seasons (not shown).
Other factors, such as source project ( Figure ESM1b) or model resolution ( Figure ESM1c) do not lead to specific clusters. The GHG scenario leads to some clusters (Figure ESM1d), but not consistent with their radiative forcing. A high-emission scenario, such as SRES A2, lies at the lower end of the temperature delta range. Very low concentration scenarios (RCP2.6) do not rank at the lower end and intermediate-to high-concentration scenarios (A1B, RCP4.5, RCP8.5) span the whole temperature delta range. Central 50% (25th to 75th percentile) and 90% (5th to 95th percentile) ranges for the delta changes spatially averaged over Continental Spain and the Balearic Islands. Changes are shown for nearsurface temperature (left, in K) and precipitation (right, as percent change). The changes projected by different sub-ensembles are shown in rows. Each project (ENSembles, eSCeNa, EuroCordeX, MedCor-deX) is divided into the GCMs used to drive the regional projections and the RCM results. The last rows (All) show grand-ensemble results, separately for GCMs and RCMs Incidentally, the coldest and the warmest member of the ensemble is driven by the same scenario (A1B). Therefore, scenario forcing does not seem to be playing an important role in the near future projections.
Global vs. regional climate model projections show differences in near future temperature which are likely statistically significant (Fig. ESM1f). They arise from an upper end of the temperature delta range mostly populated by GCM projections and a lower end populated by RCMs. In precipitation, the driest future projections are provided by RCMs. Statistically sound tests could be applied to assign a likelihood to the occurrence of such differences in two random samples if their population means were equal. However, we will see in the next section that the deltas are not a random sample, but the result of very specific choices in the driving GCMs used in the different projects.

Uncertainty cascade
The distribution of temperature delta projections is clearly different in the direct GCM model output (median above 2 K) and the dynamically downscaled projections (median 1.6 K; see empirical PDF estimates in Fig. 4). Given the non-systematic GCM-RCM combinations in our ensemble ( Fig. 1), the behaviour of individual nestings is of interest.
In order to decompose the uncertainty cascade down to individual projections, we start decomposing by GCM, given that this is the main source of uncertainty identified so far. Figure 4 (top) shows the JJA near future temperature uncertainty cascade considering only GCM projections. The mean JJA delta derived from the ensemble of GCM projections (47 members) is 2K. Ensembles of GHG scenario forcing for each GCM are averaged next and coloured according to their mean JJA temperature delta. The GCM delta ranking shows the 4 Hadley Centre models in the top 8, while BCM projects the smallest delta change (0.75 K, the only one below 1 K). Individual members are split when reaching the scenario step. Although, in general, higher radiative forcing scenarios from a single GCM lead to warmer projections, this is not systematic. This reinforces the view, for this time horizon, of GHG scenario as a smaller source of uncertainty as compared to internal model variability. For instance, EC-EARTH r1 and r12 members forced by RCP4.5 show larger differences than EC-EARTH-r12 forced by RCP2.6 and RCP4.5.
Dynamically-downscaled projections usually lead to smaller JJA temperature deltas than their driving GCM projections in this area (Fig. 4, bottom). This is also true for other European regions (Kjellström et al. 2017). Specific GCM choices tend to enhance this effect. For instance, the GCM showing less warming (BCM) was downscaled by three different RCMs, increasing its weight in the ensemble. The GCM delta ranking is only slightly altered by (a) (b) Fig. 3 Precipitation vs mean surface temperature delta changes spatially averaged over Continental Spain and the Balearic Islands for each ensemble member. In a, marginal probability density functions are shown for each variable pooling the whole 196-member ensemble, using a Gaussian kernel density estimator. Different shading shows the ranges from the 5th to 95th percentile (90% of the sample) and from the 1st to 3rd quartile (50%). The median is also shown as an inner tickmark. In b, the scatter plot and PDFs are factorized by GCM family (see text). In this case, only the quartiles and median are shown (as tickmarks). Raw GCM output deltas are shown as empty circles downscaling. RCMs nested into the same GCM tend to produce similar deltas (small spread at the RCM step), although the are exceptions. WRF-and MM5-based projections tend to give rise to smaller delta changes than other models nested in the same GCM. In the next section, we will see that this is linked to absolute temperature biases.
Scenario and RCM resolution (0.44, 0.22 or 0.11) make little difference (small spread at those steps) in the countrylevel-averaged projected delta changes. The only exceptions, where the scenario step shows some spread, are inherited from the GCM (e.g. from large differences between EC-EARTH-r12 RCP2.6/4.5 and RCP8.5).

Delta method caveats
The so-called delta method (Räisänen 2007) has been used as a standard method to present future projections from our ensemble. In the simple terms applied here, we assume that the difference in future minus reference mean climate will cancel out likely model errors. This is related to bias correction methods. In particular, delta changes are insensitive to local shift bias correction methods, the simplest correction applied to temperatures to match the modelled and observed climatology. Relative delta changes are insensitive to local scaling, usually applied to precipitation due to its statistical distribution and to preserve its lower bound. More sophisticated bias correction methods could be applied, such as quantile mapping. However, for those methods, bias corrected and delta change projections differ (Ho et al. 2012;Räisänen and Räty 2013), leading to a new source of uncertainty.
The contribution of sophisticated bias correction methods to climate change signals is still a field of active study (see e.g. (Gobiet et al. 2015;Casanueva et al. 2017;Turco et al. 2017), and references therein). However, there are clear indications that the simple delta method has a limited accuracy. An illustration is shown in Fig. 5, where JJA near surface temperature deltas are plotted as a function of mean near surface temperature (relative to the coldest estimate) during the reference period. One striking fact is that the 30-year temperature average, spatially averaged over a relatively large region, shows a range of 10 K across the grand ensemble. This range is not the result of some outliers, but it is densely filled at least up to 7 K. The . From top to bottom, for each line plot, members forced by the same GCM are averaged (and coloured by GCM). Next, each line splits into the mean for each RCM nested into that GCM. Next, each line splits into the mean for each scenario. Finally, those lines corresponding to RCMs nested with two different horizontal resolutions in the same scenario and GCM are split. Therefore, at the bottom of each plot, all individual member projections can be found. The right legends show each individual GCM ranked by their projected JJA temperature delta. Additionally, a kernel density estimation of the PDF for each ensemble is shown, shading 90 and 50% sample ranges and marking the ensemble median. Cascades follow the visualization proposed by Ed Hawkins on his Climate Lab Book http://www.clima te-lab-book. ac.uk/2014/casca de-of-uncer taint y range is smaller for GCMs, which start with a minimum anomaly of 1.5 K, probably due to the lower orography. Decadal variability cannot explain this spread, as illustrated by the GCMs with multiple realizations present in our ensemble (Fig. 5). The spread of these sub-ensembles, started from different initial conditions in 1850, is well below 0.5 K after more than a century of simulation. Differences among models dominate the spread in simulated present climate.
Another evident feature is the non-random distribution of delta changes. Warm models during the reference period tend to produce stronger delta changes. This could be explained by soil moisture-temperature feedbacks. Warm (cool) models during summer in this area are likely to be due to dry (wet) soils. The extra radiative forcing from future scenarios goes directly into sensible heat (i.e. temperature increase) for those models with dry soils, while it can be partly used to evaporate soil water (latent heat) in the models that preserve wet soils in the future. The relationship shown in Fig. 5 is, therefore, typical of regions/seasons with transitional soil moisture-temperature feedback regimes between energy-and moisture-limited. Models have shown problems reproducing this transitional regimes (Stegehuis et al. 2012) and slight changes can have drastic consequences on temperature. For instance, the cold summers produced by WRF are due to excessive rainfall and soil moisture (Knist et al. 2017).
The delta change dependence on absolute temperature has been known for some time and there are several proposals to correct (constrain) the projections based on it (Boberg and Christensen, 2012) or some other physically relevant processes (Hall and Qu, 2006;Bellprat et al. 2013;Stegehuis et al. 2013). These corrections are subject to their own uncertainties, since they are based on additional models for the relationship and involve observations, with their own uncertainties.

Conflicting messages
What to expect from downscaling? GCMs provide coarse resolution climate change projections. Individual grid points should not be considered representative of their exact location due to the so-called skilful scale (Grotch and Mac-Cracken 1991;von Storch et al. 1993). Moreover, many important features affecting the climate of a region could be entirely absent in a GCM. For example, the Pyrenees mountain chain or the Balearic Islands are simply not there in many of the GCMs considered in this study. The point of downscaling is combining the climate information at skilful scales of the GCM with local, non-resolved features (orography, land-sea contrasts, etc.). As such, downscaling is not expected to change GCM projections overall for extended regions, but smaller scale changes could appear, as a result of the interaction of the GCM dynamics with the regional characteristics. The ability, or even possibility, of RCMs in changing (improving?) the GCM large-scale circulation is debated (Diaconescu and Laprise 2013;Xue et al. 2014;Hall 2014). The principle of downscaling is to estimate local climate subject to the large-scale climate generated by the GCM. In this sense, strong changes of the overall climate change pattern should be analysed with care. Also, conflicting messages could arise between the GCM and RCM or between several RCMs nested into the same GCM (Turco et al. 2013). These should also be analysed to discern genuine modelling uncertainty from unrealistic response. The line between these two might be difficult to draw, though.
In this work, no attempt is made to assess the added value of downscaling. Added value could be behind some of the RCM projected changes shown next that differ from those projected by their driving GCM. Some of these differences could be interpreted as added value, but this would require a more in-depth, case-by-case analysis and a clear definition of added value (see e.g. for a recent review (Rummukainen 2016)). Figure 6 shows some spatial responses to climate change forcing scenarios by several GCMs and RCMs nested into them. They were selected to illustrate the plausibility of the downscaled response. Each row shows the forcing GCM JJA precipitation relative delta change (first column) and the dynamical downscaling by different RCMs. For example, the top row shows the climate change response by the CNRM-CM5 model (realization r1) to the RCP8.5 scenario. There is a slight southwest-tonortheast dipole, from a summer precipitation decrease to an increase. An overall similar pattern can be discerned in the CCLM-4.8.17 RCM nested into it. However, small scale features appear. CCLM projects a stronger precipitation decrease over southern parts of the domain (especially over the Alboran Sea, just east of the Gibraltar Strait). In EC-EARTH-r12 (second row), the dipole is northwestto-southeast, and this pattern is, again, put into regional context by CCLM, which projects a precipitation increase west of the Balearic Islands, consistent with a similar but coarser pattern in the GCM. This Mediterranean precipitation increase also appears in CCLM when nested into HadGEM2-ES-r1 (third row). However, this global model does not show a trace of this precipitation increase. This increase in summer precipitation over the western Mediterranean Sea could be interpreted either as added value of the dynamical downscaling, responding more properly to the large scale forcing, or as a specific bias of this RCM.
Other RCM (RCA4, in the third column) nested into the same GCMs shows overall similar patterns, but regional details differ instead of reinforce. For instance, a GCMconflicting response appears north of the Alboran Sea when nested into HadGEM2-ES-r1, with a centre of strong precipitation increase where CCLM shows a decrease. Other simulations nested into HadGEM2-ES-r1 also show inconsistent responses so it is likely that a problem exists in the coupling with this GCM. All KNMI runs nesting the RACMO RCM into this GCM have been withdrawn from the ESGF and re-run during the development of this study. The new version (v2 in ESGF, shown in Fig. 6) shows a moderate summer rainfall increase over the Mediterranean Sea, unlike the initial run, which showed larger and more extended increases (not shown).
Several other examples of conflicting messages are shown in Fig. 6 and need to be analysed in detail to find out the causes behind the differences (this is out of the scope of this work). Some of them could be regional added value of the downscaling, while others might simply reflect downscaling deficiencies, which artificially enlarge uncertainties. See, e.g. the increased precipitation developed over the north African coast by CCLM nested into MPI-ESM-LR-r1, over the Cantabrian Sea (north of the Iberian Peninsula) by WRFv3.3.1-F nested into the IPSL-CM5A or the conflicting pattern between WRFv3.4.1-I and RCA4, nested into CanESM2.
An example of physically consistent, but unplausible, delta changes is shown in Fig. 7. RCA4 at 0.11 resolution provides a very local pattern of change (reaching 4K) over the Pyrenees. This elevation-dependent warming (Pepin et al. 2015) pattern appears in every RCA4 0.11 projection, and it is systematically absent in the lower resolution runs (0.44) or in any other RCM projection at 0.11 resolution (CCLM-4.8.17 shown as an example). This pattern is typical, in most models and resolutions, during winter (not shown) and is related to reduced snow cover due to warming and a positive snow-albedo feedback. We checked the snow depth (not shown) and confirmed that RCA4 produces significant snow cover in the Pyrenees during summer when the resolution resolves high terrain. The resolution increase makes the modelled surface temperature cross the threshold to support snow cover, but slightly enough to lie back behind the threshold in the near future projections. Since extended snow cover in the Pyrenees during summer is unrealistic under present conditions, so it is the small-scale climate change pattern depicted by RCA4 0.11 over this area. This feature appears also over the Alpine region (Frei et al. 2018). Of course, during winter, when similar small-scale features are produced by many RCMs, this is a clear added value of downscaling. Given that the smooth orography of GCMs cannot resolve snow accumulating at mountain tops, they miss important local delta temperature changes arising from changes in the snowline. This is another example of a threshold-dependent, nonlinear process breaking the delta method basic assumption that model errors remain constant. Processes depending on absolute thresholds give rise to delta changes which cannot be improved by assuming that model error evolves as some smooth function.

Lack of ensemble design
The data shown in Fig. 1 are essentially all future projections publicly available for our target region in the last decade. GCM output is restricted to those models used as boundary conditions in the downscaled projections. Therefore, the first column could be greatly enlarged (only from CMIP5, there are more than 200 other projections available through the ESGF), but the interior of the matrix would remain empty.
The matrix does not explore RCM internal variability, which could be behind some of the conflicting RCM responses shown in Fig. 6. GCM internal variability is only slightly sampled, even though recent studies show that e.g. the climate change signal for precipitation extremes cannot Fig. 6 Summer (JJA) precipitation relative delta changes (in %) for several members of the EURO-CORDEX ensemble, along with the direct model output (in the corresponding first column) in which the RCMs were nested. Dotted grid cells indicate significant changes with 90% confidence after a t test on the absolute mean difference ◂ be determined reliably from individual simulations nested into different GCMs (Aalbers et al. 2017).
The GCM-RCM matrix is not only mostly empty, but heterogeneously filled, with preferred GCMs and RCMs. A few GCMs were selected by most regional climate modelling groups. For instance: MPI-ESM-LR was downscaled 23 times, ECHAM5 (20), CNRM-CM5 (19), EC-EARTH (18). Few RCMs contribute a large fraction of the grandensemble, notably RCA4 with 36 scenario simulations.
The decision of the GCM to downscale usually falls within two categories: These can also be combined: e.g. in category 1, apart from the obvious institutional interest, we can assume motivations regarding the early and ease of access to full resolution GCM data. RCM simulation consortia including institutions running global simulations (e.g. ENSEMBLES) have traditionally eased the access to GCM boundary conditions. GCM selection criteria in ESCENA also followed ease of GCM data availability by several participating groups. Incidentally (not by design), the GCMs in ESCENA span a wide range of future projections, including the ECHAM and Hadley Centre families. As a result of the above, none of the ensembles considered in this study fully explores future climate change uncertainties. Thanks to GCM boundary sharing within consortia, some uncertainties can be partially explored: In ENSEMBLES, RCM uncertainty can be explored in the ECHAM5-r3 sub-ensemble (five RCMs) or in the Had-CM3Q0 sub-ensemble, with a different set of four RCMs. These are the most populated sub-ensembles to explore RCM uncertainty and their intersection is empty: none of the RCMs downscaled both GCMs.
ESCENA was designed to have a GCM (ECHAM5-r2) downscaled by all RCMs and two RCMs (PROMES and MM5) downscaling all GCMs. Unfortunately, a different ECHAM5 realization (r2) from that in ENSEMBLES (r3) was selected, preventing the exploration of domain position (in REMO) or the inclusion of an RCM (PROMES) in both, the ECHAM5-r3 and HadCM3Q0 sub-ensembles.
A 3-way analysis of variance using full factorial GCM-RCM-Scenario sub-ensembles within ESCENA ( 4 × 2 × 2 ) and EURO-CORDEX ( 4 × 2 × 2 ) shows conflicting results due to the very partial representativity of the full matrix. No significant interaction is found for either two-or three-factor interaction terms. However, in ESCENA, the only significant (confidence above 99.99%) main factor is the RCM, while in EURO-CORDEX the GCM shows the main significant effect on summer temperature deltas (above 99.99% confidence), along with the emissions scenario (above 95% confidence).
The problem to fill a GCM-RCM matrix is not the computational cost. A single EURO-CORDEX 0.11-resolution simulation requires, at least, 8 times as much computer resources as a 0.22 ENSEMBLES simulation. The full GCM-RCM matrix from ENSEMBLES (or an updated version using CMIP5 input and the latest generation RCMs) Fig. 7 Summer (JJA) near surface temperature delta (K) patterns for the direct global model output of EC-EARTH (r12) forced by the RCP8.5 scenario and downscaled by  consists of 143 (11×13) simulations, accounting for an effort of less than 18 EURO-CORDEX 0.11 simulations (of which, at least, 42 have been produced with little chance for systematic comparison). At a much lower cost, a full 0.44-resolution matrix could have been filled.
There are some examples of ensemble design out of Europe: for instance, the NARCliM project in Australia (Evans et al. 2014) or the NARCCAP program in North America (Mearns et al. 2013). These two examples use different design criteria to maximize the resulting information according to practical limitations defined in each case. The selection of models in NARCliM is based on different evaluation metrics and model independence, although some subjective choices were also required. In case of NARCCAP, the selection of 24 nesting combinations was based on a balanced RCM-GCM design.
Up to now, there is no multi-RCM/multi-GCM experiment where all cross-combinations have been simulated (full matrix). In this situation, the effect of having relatively empty GCM-RCM matrices to explore uncertainties cannot be quantified. There are several attempts in the literature to use non-designed ensembles to explore uncertainties (Déqué et al. 2012). Strong assumptions on the independence of the contribution of different uncertainty sources need to be taken.

Observational constraints
Observational constraints are increasingly used to reduce the uncertainty in future projections. In essence, the idea is to use a plot such as Fig. 5, along with the observed anomaly (it would be a vertical line in this Figure, showing the observed temperature anomaly with respect to the coldest simulation) to produce a distribution of future delta changes conditional on the observations. This is a sort of model weighting, favouring the models which better reproduce the reference climate for the particular variable used as abscissa.
Of course, this introduces plenty of new uncertainties. The abscissa (predictor) should be able to provide a relationship as clear as possible with the delta change being corrected. Figure 5 is probably not enough and introduces more errors than improvements in the delta. The predictor should also be a densely observed variable (to be able to produce an observational product comparable to the grid-cell output of a model), and ideally with a low observational uncertainty (otherwise, the vertical line becomes a wide distribution, and the conditional delta distribution tends back to that of the unweighted ensemble). Up to now, observational constraints have relied on predictors such as: temperature trends (Boberg and Christensen 2012), the seasonal cycle of snow albedo (Hall and Qu 2006), precipitation from previous seasons (Quesada et al. 2012), sensible heat flux (Stegehuis et al. 2012), or soil moisture (Bellprat et al. 2013). There is no overall solution, and these procedures work only where and when the relationship found is strong enough.
In any case, these works illustrate that the assumption that delta change cancels out model error is ill posed. Moreover, observational constraints are likely to favour models which reproduce the right predictor for the wrong reason (Räisänen 2007). For instance, radiation biases can easily be compensated by soil moisture to produce reasonable surface temperatures (García-Díez et al. 2015). Therefore, model evaluation and future projection analyses should extend beyond temperature and precipitation to include variables controlling key surface, radiative and dynamical processes that drive the changes. Data availability, both from models and observations, is a challenge for this kind of analyses, which are a natural continuation of this basic study.

Earth System Grid Federation
The ESGF infrastructure (Williams et al. 2015) is a breakthrough in climate data availability and it makes possible unprecedented analyses of multi-model ensembles of climate projections, like the one presented in this work. Based on the experience retrieving these data, we found that the ESGF data discovery services are powerful, but the access and retrieval is still cumbersome because of the file granularity of the data. Accessing and retrieving datasets effectively requires some advanced skills in data analysis and management, including metadata standards and file formats like NetCDF, and handling a huge batch of datasets.
Despite the strong quality controls on metadata prior to data publishing, there is a lack of quality control regarding the match of data and metadata (e.g. units not matching the data order of magnitude) or the completeness of the dataset, either in time (missing years) or variables (despite mandatory variable lists). Many of such mismatches have been reported back to the modelling teams during the development of this study.
These metadata mismatches take time to realize and correct for every user, and mine the core of the ESGF concept of providing homogeneous access to data. Times or variables missing prevent models to enter studies, thus artificially weighting the ensemble. As an example, the direct output for EC-EARTH-r1 RCP4.5 was missing for years 2027 and 2029 and seasonal means used in this work did not consider these years. Currently, one of the priorities for CMIP6 data distribution is the deployment of a new errata service in ESGF to provide a central repository for reporting and accessing documentation related to problems with the data.
ESGF data un-publishing also poses some challenges on reproducibility. During the development of this study, data from several RCMs were un-published. These simulations were excluded from this study, but it is likely that similar problems arise in the future and, then, the work will no longer be reproducible. Also, a mean to link ESGF entries with scientific literature analysing them would ease quality assessment by data users and scientific development. Related to this, Hewitson et al. (2017) have recently shown some key issues in climate information websites, including ESGF, highlighting the need to include user guidance to overcome the confusion arising from the varied messages of different data products. This relevant issue is part of the current priorities for future developments of ESGF metadata services. They include persistent identifiers to each dataset to trace its version history, an early citation service or some provenance capture to ease the traceability and reproducibility of results. See Williams et al. (2017) for more details on the current status and developments of ESGF.

Sub-ensemble selection
Users of future climate change projections are commonly advised to use as many projections as possible, in order to sample the strong uncertainties present. This advise is hardly followed, and recent analyses show that the sub-ensembles selected by users of climate information can be near-random and not fit for their purpose (Hewitson et al. 2017).
Even though ESGF homogenizes the access to data, data availability can still be an issue. There are now dozens of RCM projections over Europe and more are still to come. Studies usually rely on the first models publicly available, leading to a race to publish data, ultimately driven by IPCC assessment report cycles, which drive CMIPs, which drive CORDEX and regional climate change studies.
Despite the diversity of publicly available projections, there are plenty of recent works still relying on a few models (Bavay et al. 2013;Ruelland et al. 2015) or even just one (Tramblay et al. 2013;Lemesios et al. 2016). Burke et al. (2015) show the underestimation of the range of projected climate impacts in 7 well-cited articles that failed to account for multi-model uncertainty. Commonly, there is no mention of the motivation behind the selection of a particular sub-ensemble (e.g. (Räisänen and Räty 2013;Filahi et al. 2017).) The motivation for particular sub-ensembles usually includes (1) good performance in current climate against observations (e.g. (Herrera et al. 2010)) or (2) span a large range of possible future scenarios (e.g. (Bavay et al. 2013).) Both criteria can hardly be met simultaneously, given that the largest spread usually comes from outliers (Fig. 5), and opposite outliers cannot simultaneously fit the observations.
The assumption that good performing models under current climate will project better climate changes can easily be challenged due to differing relevance of parameterized processes in the present and in the future (Jerez et al. 2013) and a potential good performance due to the wrong reason (Räisänen 2007;García-Díez et al. 2015). In a recent work, Monerie et al. (2016) show that present-day biases in precipitation and temperature are not good metrics for the credibility of future projections and propose other methods based on clustering.
Sub-ensemble selection is an open and complicated problem. Unfortunately, the alternative-use all projections available (see e.g. (Burke et al. 2015))-is no better: as we show in this study, the full ensemble is dominated by somewhat arbitrary decisions on which GCMs were downscaled or which RCMs produced more projections, thus weighting the full ensemble towards particular GCM/RCM deficiencies. Our recommendation would be to cherry-pick GCM/ RCMs based on their ability to represent the process of interest in current climate. Not just the variable of interest (e.g. precipitation or temperature) but the mechanisms behind their variability (circulation, water/energy budgets,… ). The credibility of the dynamical downscaling response (Sect. 4.4) can also be used as guidance.

Summary and conclusions
We presented an ensemble of future climate change projections of unprecedented size (196 members) combining several dynamical downscaling projects and initiatives. All data used are publicly available and include not only the high resolution RCM projections, but also their driving GCMs. For simplicity, this initial study has been limited to precipitation and temperature changes in the near future period 2021-2050. The target region selected is continental Spain and the Balearic Islands.
The GCM-RCM combination matrix is sparse and all initiatives, except ESCENA, lacked a systematic design to explore the uncertainty sources (GCM, RCM, scenario, resolution,… ) and their relative contribution to the total uncertainty range. In this scenario, we avoided summary statistics and favoured graphical displays of the full ensemble.
Near-future projections of precipitation and temperature essentially agree across the different initiatives, based on different model generations. The summer season shows the largest differences across initiatives and we focused on it. There is a common tendency in the different initiatives to project smaller temperature changes by the RCMs than their driving GCMs. This is partly explained by an uneven downscaling, favouring low climate sensitivity GCMs, and some particularly cold-biased RCMs (WRF, MM5).
The main contributor to the uncertainty range in this area is the GCM, which dominates the resulting delta changes. This agrees with previous studies (Déqué et al. 2012;Rajczak et al. 2013), although they establish that this result depends on the region (southwestern Europe), and also on the season (Christensen and Christensen 2007;Mearns et al. 2013). We also found particular full factorial sub-ensembles (e.g. in ESCENA [CNRM-CM,ECHAM5]× [PROMES,MM5]×[A1B,B1]) where the RCM dominates the uncertainty range. Emissions scenario and RCM resolution have a much lower contribution to uncertainty; the former should take prevalence on longer-term projections (Hawkins and Sutton 2009).
This large ensemble shows a clear relation between projected temperature changes and present-day temperature biases, indicating an overestimation of the uncertainty range by the standard delta method. Several methods have been proposed in the literature to alleviate this problem (Hall and Qu 2006;Quesada et al. 2012;Boberg and Christensen 2012;Stegehuis et al. 2012;Bellprat et al. 2013) and this ensemble is an ideal testbed for their comparison. However, this is out of the scope of the current study.
Focusing on specific projections, we found several conflicting results between RCM projections and their driving GCM, and between RCMs nested to the same GCM. These conflicts need to be further analysed to identify the causes behind them, and discern if they consist of added value of downscaling and a genuine contribution to uncertainty. We checked one of these conflicts, linking it to unrealistic summer snow cover on the Pyrenees, but a full identification of the causes behind conflicting messages is beyond this study.
Finally, we discuss on the process to obtain and homogenize the large ensemble of projections used in the study, which took a large amount of time, unveiling the potential of the ESGF, but also drawbacks which might affect the reproducibility of results. This initial study just covers an overview of the potential of this multi-initiative ensemble, which can feed many future studies, e.g. on the impact of sub-ensemble selection strategies, observational constrain of the projected uncertainty, explanation of conflicting messages, etc., which are currently planned and ongoing.