Show Summary Details

Page of

Printed from Oxford Research Encyclopedias, Climate Science. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 22 June 2021

Clustering Techniques in Climate Analysislocked

Clustering Techniques in Climate Analysislocked

  • David M. StrausDavid M. StrausGeorge Mason University


This is an advance summary of a forthcoming article in the Oxford Research Encyclopedia of Climate Science. Please check back later for the full article.

Clustering techniques are used in the analysis of weather and climate to identify distinct, discrete groups of atmospheric and oceanic structures and evolutions from observations, reanalyses, and numerical model simulations and predictions. The goal of cluster analysis is to provide physical insight into states (and trajectories) that are preferred and also possibly unusually persistent, when such states can be identified and distinguished from the continuous background distribution of geophysical variables. “Preferred” states (or evolutions) are those that are significantly more likely to occur than would be predicted by a suitable background distribution (such as a multivariate Gaussian distribution), while “persistent” states are those with lifetimes distinctly longer than those of the background states.

The choice of technique depends to a large extent on its application. For example, the identification of a small number of distinct patterns of the seasonal mean mid-latitude response to large seasonal mean shifts in tropical diabatic heating (perhaps due to the El Niño–Southern Oscillation) can be accomplished with the use of either a partitioning or hierarchical cluster analysis. The partitioning cluster method groups all states (maps of a given variable) into clusters so as to minimize the within-cluster variance, while the hierarchical analysis merges fields into groups based on their similarities. The identification of preferred patterns (whether or not they are tropically forced) on intra-seasonal time scales can also be accomplished in this way. The partitioning approach can easily be adapted to include multiple variables, and to describe tracks of localized features (such as cyclones). A variant of the partitioning cluster analysis, the “self -organizing map” approach, allows for a greater richness in cluster patterns and so can be useful on shorter, weather-related time scales.

In either the partitioning or hierarchical analysis, each state (map) is identified uniquely with a given cluster. However, in certain applications it may be desirable to allow a given state to belong to multiple clusters with differing probabilities. In such cases one can estimate the underlying probability distribution function with a mixture model, which is a sum of a (usually small) number of component multivariate Gaussian distributions.

The partitioning, hierarchical, and mixture model approaches, applied to a sequence of maps, all have one common feature: the sequencing (order in time) of the maps is not taken into account. This is not the case with the hidden Markov method, an approach that identifies not only preferred states but ones that are also unusually persistent. This approach, based on a simple neural network approach, makes use of an underlying “hidden variable” whose evolution is modeled by a Markov process. Each state is assigned to a number of clusters with a certain probability, but the most likely evolution of states from one cluster to another can be estimated. This approach can be generalized by letting the evolution of the hidden state be governed by a nonstationary multivariate autoregressive factor process. The resulting cluster analysis can then also detect long-term changes in the population of the clusters.


  • Climate Systems and Climate Dynamics

Overview of Cluster Analysis

The Use of Cluster Analysis in Climate Science

Cluster analysis is used in the analysis of weather and climate to identify distinct, discrete groups of atmospheric circulation states, and sequences of states (also called trajectories), from observations, reanalyses, and numerical model simulations and predictions (Ghil & Robertson, 2002; Hannachi, Straus, Franzke, Corti, & Woollings, 2017; Straus, Molteni, & Corti, 2017). (Reanalyses are sequences of full three-dimensional atmospheric and oceanic states constructed by combining observations with short-range model forecasts.)

One of the goals of cluster analysis is to provide physical insight into the states that are preferred and also possibly unusually persistent (Kimoto & Ghil, 1993a), when such states can be identified and distinguished from the continuous background distribution of geophysical variables. Preferred states (and their sequences) are those that are significantly more likely to occur than would be predicted by a suitable background distribution, such as a multivariate Gaussian distribution (Toth, 1993), while persistent states are those with lifetimes distinctly longer than those of the background states (Dole & Gordon, 1983; Vautard & Legras, 1988; Toth, 1992).

Preferred states arise due to mechanisms that operate on multiple timescales, as reviewed by Hannachi et al. (2017). On seasonal and inter-annual timescales these states are related to the interaction of Rossby waves forced by topography (mountain heights) and the distribution of land and ocean surface temperatures, and in midlatitudes to the remote forcing by large anomalies in seasonal mean tropical diabatic heating, such as those due to the El Niño Southern Oscillation. On intra-seasonal timescales, preferred states arise due to the interaction of rapidly propagating transient eddies associated with midlatitude storms with regional jet streams, and with Rossby waves forced by transient tropical diabetic heating.

A separate goal of cluster analysis is to provide a physical classification of weather and climate patterns for several purposes. One is to provide a good classification of the basic surface climates on earth (e.g., Netzel & Stepinski, 2016). Another is to better understand and forecast the occurrence of weather extremes (storms, floods), climate extremes (heat waves, cold snaps), and other factors relating to human health (e.g., Cassou, Terray, & Phillips, 2005; Coleman & Rogers, 2007; Polo, Ullmann, Roucou, & Fontaine, 2011). In these applications, the emphasis is on reducing the very large nominal dimensionality of the atmospheric dynamic and thermodynamic space to a manageable, discrete set of weather and climate states that have predictive value in terms of how the climate system affects human beings. On weather timescales, these states often correspond to patterns familiar to weather forecasters (James, 2006; Ferranti & Corti, 2011; Ferranti, Corti, & Janousek, 2015).

The Goals of This Article

This article is meant to be an in-depth introduction to the many techniques used in clustering techniques in the field of climate and weather analysis. The techniques most commonly used are described in enough detail so that the reader can better understand the literature, while the interested user can make an intelligent decision as to the best technique for a given problem. The resources necessary to further explore a given technique are provided in the bibliography. In cases where the basic algorithms can be described succinctly, and shed some light on the clustering technique, the algorithms are described. In other cases, the algorithms and their underlying statistical theory are too complex to lie within the scope of this article, but appropriate references are given.

The article does not attempt to follow a historical chronology, as this would not have been very useful in terms of understanding the clustering techniques themselves or their use in climate science.

Basic Concepts

Space-Time Objects of Clustering

In climate science applications, the various states on which clustering are carried out are usually geophysical fields, or sequences of these fields. Examples of fields are geopotential height, temperature, and horizontal winds on a suitable surface. These fields may be global in extent, but are most often hemispheric or regional. In some special applications, the states are not fields, but the locations of well-defined features, such as midlatitude or tropical cyclones, and it is their trajectories that are subject to clustering. The clustering of trajectories is discussed further in the section “Dynamic Clustering Techniques.”

The states (fields) may arise from time averages (such as annual, seasonal, monthly, or even five-day averages, also called pentads), or they may represent instantaneous (“snapshot”) states taken once per day or more often. In the majority of applications, the annual cycle of the particular field is assumed to be known, and it is the departures from this annual cycle (“anomalies”) that are of interest. However, this is not always the case. When the annual cycle interacts very strongly with the phenomena of interest, clustering may be carried out on the full fields (see, e.g., Qian, Robertson, & Moron, 2010; Moron, Robertson, Qian, & Ghil, 2014).

In addition to removing the annual cycle, further filtering of the states to be clustered is also appropriate for many applications. In order to identify preferred states of low frequency variability on intra-seasonal timescales (roughly 10- to 90-day periods), a “low-pass” time filter is often applied, as in Vautard (1990). Such a filter typically removes fluctuations with timescales less than roughly 6 to 10 days, fluctuations that are characteristic of rapidly moving baroclinic systems in midlatitudes.

In some applications the states may not be time filtered, but rather prechosen to correspond to special situations. These may include persistent periods (Toth, 1992), periods of extreme precipitation (Zhao, Deng, & Black, 2017a), extreme dryness (Zhao, Deng, & Black, 2017b), extreme temperature (Xie, Black, & Deng, 2017), or periods of particular synoptic developments such as surges of cold air (Park, Ho, & Deng, 2014).

The two most common mathematical choices for representing states are in terms of grid point values, or in terms of principal components (PCs) and empirical orthogonal functions. The transformation to the PC representation is defined by the covariance matrix of the original field, whose eigenvectors yield a set of fixed patterns, each associated with its own time series (the PC). The PCs are obtained by projecting the original data onto the fixed patterns. More than one type of field can be combined into the definition of a single state; if the types have different units then each is divided by its own temporal standard deviation prior to the computation of the covariance matrix.

The individual state can be represented by an M-dimensional vector z, where M is either the number of grid points, or the number of PCs retained. The major advantage of using the PC representation is that M can be greatly reduced while still retain most of the space-time variability of the set of fields. Because the great majority of studies use the PC representation as a starting point, it will be assumed throughout this article unless otherwise stated.

Definitions of Distance (Dissimilarity)

A critical quantity in most clustering methods is the distance between two states (or sequences of states), also called the dissimilarity. We denote the N states in a data set as the set {zα}(1αN), where the PC coordinates of each M-dimensional vector (denoted by i or j) are Cartesian. The scalar distance dαβ between two states zα and zβ can be defined in a number of ways.

Euclidean Distance: This is the M-dimensional version of the Euclidean geometric definition of length:


This is the most often used measure of distance. When applied to meteorological maps, the quantity dαβ is called the root-mean-square difference. This distance measure is very often used in conjunction with hierarchical and partitioning methods of cluster analysis (e.g., Cheng & Wallace, 1993; Michelangeli, Vautard, & Legras, 1995).

Cosine Distance: This distance can be defined as:


where the M-dimensional dot product is zαzβ=izαizβi and ααβ is the angle between the two vectors. If all states are projected onto the M-dimensional unit hypersphere, this is a natural measure of the distance on that sphere. Such a projection is used, for example, by Crommelin (2004) and Mo and Ghil (1988). Note that if two physical fields are represented using principal components and empirical orthogonal functions, the (un-centered) spatial correlation is just the cosααβ.

Mahalanobis Distance: This is a measure of the dissimilarity between two vectors z1 and z2, which are drawn from a distribution D of states {zα}(1αN) characterized by the mean μ and the covariance matrix Sij:


The Mahalanobis distance is given by:


Several properties of this distance measure are worth noting: it is preserved under invertible linear coordinate transformations. If the mean vanishes, and the covariance matrix is diagonal, one can scale the coordinates so that this measure of distance becomes the Euclidean measure. The Mahalanobis distance appears in the definition of a multivariate Gaussian probability distribution, and in this context is used in the mixture modeling approach to probabilty density estimation (Stephenson, 1997; Stephenson, Hannachi, & O’Neill, 2001).

Dynamic Time Warping Dissimilarity: This measures the dissimilarity between two sequences of states {zα},1αN and {yβ},1βN which are not necessarily of equal length. This method does not assume a one-to-one association between each member of the two sequences, but rather considers the N×N matrix whose elements dαβ=D(zα,yβ) are given by the dissimilarity between the two states zα and yβ (using any measure of dissimilarity). If each element of this matrix is considered a small box, a one-dimensional path through this matrix would consist of choosing a set of connecting boxes whose beginning is at one corner, and whose end is at the opposite corner. More formally, a warping path is defined as a contiguous set of matrix elements connecting d11 to dNN. The optimal warping path is then defined as that path that minimizes the sum of dαβ along the path. This sum is the dynamic time warping dissimilarity. Its use in climate science is for measuring the alignment between two time series that may be locally out of phase. As an example, Netzel and Stepinski (2016), who are concerned with defining global climate types, use this technique to compare the similarity between the annual cycles of temperature and precipitation in different regions.

Probability Density Estimation

The motivation behind cluster analysis is closely related to the behavior of the underlying probability distribution (pdf) of a data set: clusters should somehow represent regions in an appropriate phase space that are more populated than expected based on a background pdf (Toth, 1993), while avoiding areas that are less populated. The definition of the background pdf, equivalent to formulating a null hypothesis for the pdf, depends on the physical problem at hand. Very often, the background pdf is taken to be a multivariate Gaussian (or normal) distribution (Toth, 1991).

Multiple, distinct peaks in the pdf, which would indicate the existence of preferred states, is universally carried out in a very low-dimensional phase space (Benzi, Malgauzzi, Speranza, & Sutera, 1986; Hansen & Sutera, 1986; Molteni, Sutera, & Tronci, 1988; Corti, Molteni, & Palmer, 1999; Hsu & Zwiers, 2001). This is because the exponential increase in hyper-volume associated with added dimensions very rapidly leads to severe under-sampling, given the size of current weather and climate data records. The reduction to low dimension is usually accomplished through the use of principal components.

Kernel Density Estimation

In one dimension (say that of the leading PC), the raw data can be represented by a simple histogram, in which, however, the bin width is somewhat arbitrary. In two or more dimensions M2, a histogram becomes unwieldy, and a continuous estimate is helpful. The concept behind the kernel density estimate is that the contribution of each data point (essentially a delta-function in the phase space) gets smoothed out to a local pdf (with unit area under the curve). The final (normalized) estimate of the pdf is then given by:


where the vector z is a general point in the PC coordinate space, the vectors zα are given in the PC coordinates, the index α labels the coordinate, and h is a smoothing parameter (Silverman, 1986; Corti et al., 1999). The kernel function K can be a range of functions. One commonly used kernel function is the standard normal density function:


In fact the equation for f^(z) is over-simplified, because in more than one dimension a separate value of h in each dimension is appropriate; usually h is proportional to the standard deviation of the data in that dimension. Furthermore, it is very helpful to choose the smoothing in an adaptive manner, so that the parameters vary according to the value of the local pdf at each sample point in order to achieve more reliable estimates in data-sparse regions. See Kimoto and Ghil (1993a) for details on how this is done.

For a sample drawn from an underlying Gaussian (or normal) distribution one would expect the pdf estimate to also be normal, with a maximum at the point representing the overall mean (usually the origin). Significant deviations from this suggest preferred (and also unpreferred) regions of phase space. It is in this context that the choice of the smoothing parameters is of importance: if too large, local maxima in f^ that represent significant preferred anomalies will be smoothed out, if too small, spurious local maxima due to sampling error may appear. An illustration of how increasing the smoothing parameters changes the appearance of the pdf in two-dimensions is given in Figure 1.

Figure 1. Kernel density estimation of the probability distribution function in the space of the leading two principal components of filtered 500 hPa geopotential height for winter (November 16–March 31 for 1980–1981 through 2011–2012). The filter removed periods of 20 days or fewer. An adaptive algorithm was used. The kernel width is given by h times the value of the standard deviation, with the value of h indicated in each panel.

One simple approach to determining the set of parameters h is to use synthetic (or surrogate) data sets to define the background pdf. The purpose in using these synthetic data sets is to get some estimate of how likely multiple peaks in the pdf are due only to sampling error. Each of these synthetic data sets has the same size as the true data set, but the states in PC-space are generated by stochastic processes. We give a particular example here in some detail. In each realization of the synthetic data set, each PC is generated by an independent stochastic Gaussian process, whose variance and auto-correlation are similar to that of the corresponding original PC. The underlying probability distribution P of each PC is then Gaussian, hence has a single maximum. Considering for example two dimensions, the underlying joint probability distribution of variables x (value of PC-1) and y (value of PC-2) would be simply P(x,y)=P1(x)P2(y), which would have one peak in the x/y plane. While the true P(x,y) has only one peak, estimates of P(x,y) obtained from finite samples may have multiple peaks.

Computing the sample pdf of many such synthetic data sets with a given set of h, one finds that p% of the samples show distinct maxima (not at the origin), maxima that must be due to sampling error. The smoothing parameters must be increased until p is reduced to an acceptable level, usually 5% or 1%. For other choices of the kernel function K, and alternative methods for choosing the values of h, see Silverman (1986).

Mixture Modeling

Many of the clustering methods described here have the property that individual states are either assigned to a single cluster uniquely, or may lie within a preferred region of the pdf. The mixture modeling method expands this framework to include the possibility that an individual state may be associated with more than one cluster, but with different probabilities. The approach is to model the pdf of the data set as a linear combination of k multivariate normal (Gaussian) distributions, or components, where each of the k components is proportional to:


where M is the dimension of the state vectors z (number of PCs), μ is the mean for that component, and Σ the (symmetric) covariance matrix. The vertical bars denote the determinant, and the superscript T denoted vector transpose. Note the use of the Mahalanobis distance. We will denote this distribution for the jth component as:


emphasizing that this component depends on the actual state vectors {zα} on the one hand, and the unknown parameters μj,Σj on the other. The final pdf can then by written as a sum of these component distributions, with weights that add to unity:


where j=1kwj=1. The quantity pj (z) is used to indicate the contribution of component j to the overall probability at point z.

Each of the component distributions gj can then be thought of as a generalized cluster. The weight wj can be thought of as the probability that a particular state z is in the cluster j. The parameters (mean μk and covariance Σk and the weight wj of each component) are determined as those that maximize the likelihood (or logarithm of the likelihood) given the data. In other words, we treat the observed state vectors as fixed, and take the partial derivatives of p(z) with respect to each parameter and set that to zero. This yields a set of coupled nonlinear equations, not usually amenable to a direct solution.

The expectation maximization (EM) procedure can be adapted to this problem. This iterative procedure starts out with an initial guess for the parameters, and then estimates the probability that each data point zα belongs to the jth component (cluster) as:


where the pj are evaluated with initial guesses for the parameters. Then at the next iteration we estimate:


The procedure is repeated until convergence is achieved.

Detailed explanations of this method are given in the context of its application to climate by Haines and Hannachi (1995), Hannachi (1997), Smyth, Ide, and Ghil (1999), Hannachi and O’Neill (2001), and Hannachi (2007). The optimum number of component distributions (clusters) k have been determined by a variety of methods: cross-validation using independent subsamples of the data (as in Branstator & Selten, 2009), cross validation as discussed by Smyth et al. (1999), or the use of order statistics (Hannachi, 2007). Note that the parameters to be chosen consist of the M coordinates for the mean of each component, and the M(M+1)/2 parameters for the covariance matrix, thus kM(M+3)/21 parameters in all. This large number of parameters tends to keep the number k of mixture components relatively low in order to avoid over-fitting of the sample data.

Static Clustering Techniques

Hierarchical Method (Agglomerative)

The hierarchical method sequentially merges individual clusters (groups) of states pairwise, leading to fewer clusters as the process proceeds. At each step of this iterative method, a number of clusters k have been defined, each containing a number of states. Each cluster is characterized by its own internal variance Ek or error sum of squares, which is simply the sum of the squared distance of each state about the centroid of that cluster, the centroid being defined as that point whose coordinates are the average of the coordinates of the points in that cluster. Each step of the clustering is thus characterized by its total error sum of squares E=kEk. To proceed to the next step, two of the k clusters are merged, thereby increasing E. The two clusters to be merged are chosen so as to minimize the increase inE.

To initiate the iteration, each state is considered its own cluster. The final step consists of merging two clusters into a single cluster containing all the states. The entire procedure can be represented by a tree (also called a dendogram), with the various branches merging as you move upward toward the top. The height at which two branches merge, thus defining a new cluster, can be chosen to be the increase in E that characterizes the merging. The entire procedure is known as agglomerative hierarchical clustering. The final number of clusters can be chosen in several ways, including identifying that step that causes a kink, or sudden upturn in E, and determining which clusters which are most consistent in randomly chosen half-length data sets. A straightforward algorithm is described by Cheng and Wallace (1993), who apply this method to find clusters of the geopotential height field. Applications of hierarchical clustering to precipitation and temperature extremes can be found in Park et al. (2014), Xie et al. (2017), and Zhao et al. (2017a, 2017b).

Partitioning Methods


In the k-means approach, the number of clusters (which is denoted by k) is determined a priori, after which all states are assigned uniquely to one of the k clusters, based on a similarity criterion involving the distance of each state to the cluster centroid. (Methods that assign each state to different clusters with an associated weight, or probability, are covered in the sections on “Fuzzy Clustering” and “Mixture Modeling.”) In this manner the entire data set is partitioned among the k clusters. The method described here (the so called k-means method) has been very widely applied in climate science (Michelangeli et al., 1995; Cassou, Terray, Hurrell, & Deser, 2004; Straus & Molteni, 2004; Straus, Corti, & Molteni, 2007; Cassou, 2008; Fereday, Knight, Scaife, & Folland, 2008; Straus, 2010; Dawson, Palmer, & Corti, 2012; Dawson & Palmer, 2015; Munoz, Yang, Vecchi, Robertson, & Cooke, 2017).

The partitioning is almost always carried out in a reduced dimensional space, via the use of principal components. Each vector zα is then represented as a point in the state space of the PCs, which provide the coordinates. The idea is to assign each of the N states (denoted by the index α) uniquely to one of k clusters (where k should be at least 2, and is typically less than 10). We denote this assignment to cluster β as:


The cluster centroid is then defined as the point whose coordinates are the average of the coordinates of all the points assigned to the cluster:


where the sum over α(β) means the sum over all points α assigned to cluster β, and Nβ is the number of states assigned to cluster β. Note that each centroid is a point in PC-space, and so is associated with a physical map through the use of the empirical orthogonal functions associated with the PCs.

To decide on the best choice of partitions, a measure of the clustering strength is needed. The spread among the cluster centroids is defined by:


which is a measure of the variance of the centroids. The spread within clusters is defined as:


Because the relationship between Δ, S, and the (fixed) overall variance of the data set


is simply


a measure of the clustering strength is given by the variance ratioΔ/S. Given that (N1)σ2 is fixed, a maximum of the variance ratio corresponds to a minimum of S.

The iterative algorithm is initiated by randomly identifying k initial points called seeds, whose coordinates form the first estimate of the cluster centroids. Each of the N points in the data set is then assigned to the seed to which it is closest, so that k groups (clusters) are formed. Each of the k centroids are then recomputed based on all members in that cluster, and with this new definition of the centroids, the cluster assignments are recomputed. This iterative process of redefining clusters proceeds until the spread within clusters S has converged to what is assumed to be a local minimum within the set of all possible choices of k clusters. In order to get closer to the global minimum within this set, the algorithm is repeated many times, but with different initial seeds. The final cluster assignment with minimum S (or maximum variance ratio) is chosen.

In one variant of this algorithm, the “annealing” algorithm (Hannachi & Legras, 1995), cluster assignments are occasionally made to increase S in an attempt to avoid local minima of S and ultimately reach a global minimum. As the algorithm proceeds these “wrong” steps become more infrequent. Alternatively, one can ensure a rapid convergence to a global minimum by choosing the k seed points so that they are truly representative of the entire set of points. In PC space, this can be accomplished by ensuring that each seed has a norm less than a maximum value in the full PC space, that each pair of seeds has a minimum distance from each other, and that each pair of seeds belongs to a different sector in the PC1/PC2 plane.

The choice of k is sometimes dictated by the “classifiability index,” which measures how well the algorithm converges. Running the k-means algorithm M times yields M partitions, hence M(M−1)/2 distinct pairs of partitions. The similarity of a single pair of partitions is measured by computing the pattern correlation (equivalent to the angle in the PC space) between each of the k pairs of best-matched cluster centroids, and then taking the minimum correlation. This is then averaged over the M(M−1)/2 pairs of partitions to give the index, which is then plotted as a function of k. A maximum in the index at a particular value of k = k* suggests that k* clusters are preferred.

Another procedure to determine the number of clusters k is to repeat the calculation for increasing k, until the additional (new) cluster obtained represents a subset from a previous cluster, or until the size (number of members) of the new cluster is less than some pre-determined fraction of the size of the next smallest cluster.

Fuzzy Clustering

In its simplest form, this variant of k-means allows for each state to belong to more than one cluster, and to have a weighted value that identifies its relative strength of membership. If the objective distance (dissimilarity) between the state zα and the centroid of the cluster i is given by d(α,i), the weight associated with the ith cluster is defined by:


where the sum is over all clusters j. The parameter q determines the degree of cluster fuzziness. For very small values of q−1, the fraction 2/(q−1) becomes very large, strongly favoring distance ratios near 1. Thus if i refers to the centroid closest to the state zα only the term j=i will contribute to the sum, leading to a weight wi = 1. On the other hand, very large values of q indicate more smoothing of the weights (for very large q, the quantity in the square brackets is approximately k, indicating each cluster gets equal weight). Values of q~2 are typical. One way of using these weights is to not assign any cluster to those points that have weights for all clusters less than w¯σ (where w¯ is the mean and σ standard deviation of all weights). Such points lie on the border between clusters. For more details on the method the reader is referred to Harr and Elsberry (1995) and Harr, Anwender, and Jones (2008).

Partitioning Around Medoids

This method is similar to the k-means method. However, medoids are used instead of centroids. If we consider the set {zα} of all states within a cluster, then the medoid is that state zβ for which αd(zα,zβ) is a minimum, where d(zα,zβ) is the distance (dissimilarity) between zα and zβ. While the centroid of a cluster is a point that does not correspond to any of the individual states, the medoid is one of those states. Given a measure of distance, one can use the same algorithm as for the k-means method to obtain clusters.

Self-Organizing Maps: Local Clustering

In the self-organizing map (SOM) approach (Johnson, Feldstein, & Tremblay, 2008; Johnson & Feldstein, 2010; Polo et al., 2011), the goal is to identify a (not necessarily small) set of representative vectors mi that can be viewed as a composite of relatively similar individual states within a local cluster. The idea is that each state zi is reasonably similar to one of these representative vectors mi. The method is best understood by considering the map, or physical, representation of each state, and of the representative vectors. The SOM approach assigns each representative map to a node in a two-dimensional grid or array of maps, in which similar patterns are nearby and dissimilar patterns more widely separated. The size of this grid (which determines the number of representative maps) is typically in the range of 5 x 5 to 10 x 10 (hence 25 to 100 representative maps). There is no rule for determining the size of the array, which may depend on the size of the data set and represents a compromise between economy and detail.

In this iterative method, the initial representative vectors mi might be chosen to span the range of coordinates of the leading two PCs, but they are then modified through an iterative process. In the first step (call it μ=1), each vector zi is assigned the representative vector mi to which it is closest. This node, as well as those nodes that are close to it, iteratively adjust their representative vectors mi a specific amount toward the input vector zi by the relation:


where h is a neighborhood kernel that decreases rapidly with increasing distance between zi and mi thereby including influencing a certain neighborhood of zi. This procedure acts to “bend” the representative vectors toward the data points zi. As the iteration proceeds (μ increases), the kernel h is tightened to influence smaller neighborhoods. The entire procedure approximately minimizes the sum of Euclidean distances between each state and its representative vector (Johnson et al., 2008).

Dynamic Clustering Techniques

The estimation of the pdf (covered in the section on “Probability Density Estimation”) and approaches discussed in the section on “Static Clustering Techniques” do not take make use of the temporal order in which the states occur. Although it is of course possible to trace the evolution of the system from one cluster to another after the clusters have been defined, the time evolution of the states plays no role in the original identification of the clusters. There are certain special modifications in the static clustering techniques that can remedy this. One is to apply the clustering only to so-called “quasi-stationary” states, that is, ones that change relatively slowly and can be considered more persistent. This can be accomplished by computing the “speed” in phase space for each state (the distance in phase space between successive points in time), and considering only those states with a fraction (say ½) of the maximum speed as candidates for clustering (Toth,1992; Straus et al., 2007). Another modification is to compute the PCs themselves not using the fields zi but sequences of fields, so that the state vectors are expanded to include a number of successive points in time. A particular form of this approach allows one to cluster trajectories of well-defined objects, such as tropical or mid-latitude cyclones, as described in the “Clustering Trajectories” section.

Clustering Trajectories

An adaption of the mixture modeling approach allows for the clustering of trajectories of tropical or midlatitude cyclones (Gaffney, Robertson, Smyth, Camargo, & Ghil, 2007; Camargo, Robertson, Gaffney, & Smyth, 2007a, 2007b). Here the state vectors zi now refer to the successive positions of the cyclone center in discrete time. To be concrete, let the n discrete times t be described by the indices tj:


and let the longitudes λj and latitudes ϕj of a single cyclone be described by the following second order polynomial regression (p=2) model in time:


where ϵ1 and ϵ2 represent Gaussian noise terms with zero mean and variances σ12 and σ22. The state vector z for this cyclone becomes an n×2 matrix that describes successive positions of the cyclone in time, and can be written as:


where T is the n×3 matrix whose elements Tjm are given by Tjm=tjm1, β is the (3×2) matrix of regression coefficients:


and ϵ is the (n×2) noise matrix. To generalize to a higher order regression model (p>2) is straightforward: T becomes an n×(p+1) matrix and β an (p+1)×2 matrix.

This model of cyclone trajectories can then be used to describe the full data set of trajectories (all with length n) as a linear combination of k component pdfs, each of the form:


where the matrix of regression coefficients β, the covariance matrix Σ for each component, and the weights given each component are to be determined, while the length of the trajectory n is fixed. Here the vertical bars denote the determinant, the superscript T the transpose, and Tr means the trace. This is very similar to the component pdf written in the section on the “Mixture Modeling” approach, with some modifications necessary due to the matrix nature of the states z.

Hidden Markov Models

This approach seeks to determine persistent states in the atmosphere (Franzke, Crommelin, Fischer, & Majda, 2008; Franzke, Woollings, & Martius, 2011). The method identifies a pre-specified number of underlying multivariate Gaussian pdfs (called output distributions), and associates each pdf with an underlying “hidden” discrete state X. The evolution between these discrete states is governed by a Markov (first-order autoregressive) chain. This corresponds to the assumption that the hidden state transitions are governed by a stationary process.

The identification of persistent states (clusters) depends on the character of the matrix A, which describes the transition between the discrete states X. This transition probability matrix has elements Aij: if the system resides in state Xi, the probability of transitioning to hidden state Xj in a fixed time interval is Aij. Note that the diagonal elements give the probability of remaining in a certain state. From this matrix it may be possible to identify certain groups of hidden states for which transitions are very likely within the group, but unlikely between a state in the group and one outside of it. Such groups are termed “metastable” and correspond to persistent states. Figure 2 gives a simple illustration of this, while figure 3 illustrates how a number of Gaussian pdfs, each associated with a hidden state, combine to give a distinctly non-Gaussian pdf for the observed system. Note the similarity here with the mixture modeling method.

Figure 2. Schematic of Hidden Markov model of clusters. Each dark oval is a hidden state, corresponding to a cluster (or persistent regime). Each regime is associated with a probability distribution function (pdf) of an observed variable, which is given in the curve below the regime. The black numbers indicate the probability that each regime will persist for one time unit. The colored numbers indicate the probability of a transition between regimes.

Figure 3. The three probability distribution functions shown are each associated with a hidden state, or cluster. The final pdf of the observed variable is a combination of the three component pdfs.

The main statistical problem in constructing a Hidden Markov model is the estimation of the elements of the transition matrix as well as the means and variances of the multivariate Gaussian output pdfs. This is accomplished by use of the expectation maximization algorithm. In addition, because each hidden state is associated with a range of possible observed states, one would like to estimate the most probable hidden state sequence. The technical approach to the statistical approaches, including estimates of statistical significance, are covered at length in Franzke, Crommelin, Fischer, and Majda (2008) and Franzke, Woollings, and Martius (2011).

This approach yields the possibility of a non-Gaussian (even multimodal) pdf of all the states even though each hidden state is associated with a Gaussian pdf (see figure 2 for an illustration). This feature is also seen in mixture modeling. However, the time sequencing of the observed states, which plays no role in the mixture modeling, is very important in the Hidden Markov model.

Non-Stationary Clustering; Finite Element Clustering

The great majority of approaches to cluster analysis assume that the statistics of the variables considered are stationary in time, so that there is no underlying slow change in such basic quantities as first and second-order moments. However, long-term changes in the climate system are known to occur, due to for example the changes in greenhouse gases or volcanic eruptions. Non-stationary clustering methods are designed to take into account these possible influences of time-dependent forcing functions on the preferred states. In the finite element, vector autoregressive factor approach (Horenko, 2010), the sequence of observed states z is modeled by a discrete process in time t, in general form:


Ignoring for the moment the subscript X, this equation represents a general, discrete auto-regressive model, plus a term ut representing time-dependent external forcing, plus noise ϵ.

The autoregressive model uses multiples of the time lag τ up to m, the memory depth. The term μ represents the mean. However, the set of parameters (μ,Aq,B,C) depends on which cluster X the system is in, and this affiliation of the system with a particular cluster, thus a particular set of parameters, depends in general on time. This is illustrated in the schematic in figure 4.

Figure 4. Schematic of the finite element vector autoregressive model. The time evolution of the state z(t) is given schematically in the black line. Two clusters are illustrated by the red and blue lines; z(t) is described by either the upper or lower equation, depending on which cluster the system is in at a given time.

Determination of the parameters for each cluster involves the adaptive finite element technique, while the order of the model (values of m and the number of clusters) can be determined via the Akaike information criterion (Horenko, 2010).

Compared to the Hidden Markov model, this technique includes a more general treatment of the evolution of the observed states, including the effects of external forcing, which may have a timescale that is very long compared to the shortest timescale sampled in the data. The disadvantage of this technique is that more parameters have to be estimated compared to the Hidden Markov model.

Significance of Clusters

Statistical Significance

One general approach to the assessment of statistical significance of multiple maxima (multimodality) in the direct estimate of a probability distribution function, or in the results of a cluster analysis, is to rely on the use of synthetic (also called surrogate) data sets (Nitsche, Wallace, & Kooperberg, 1994; Hsu & Zwiers, 2001; Christiansen, 2007; Straus et al., 2007; Franzke et al., 2008). For example, assume the original (physical) data set is expressed in terms of principal components. Then each of the N original maps (or states of the system) is represented as a vector in the PC-space of dimension M, where M is the number of PCs retained. The entire data set thus consists of M time series, each of length N. Each synthetic data set is constructed to have the same number N of original states as the physical data set, but with the time series of each of the M PC components replaced by a stochastically generated Gaussian time series, generated so that the lag-correlation statistics are close to the observed statistics for that PC component. By construction the time series of the various components are statistically independent of each other. Such synthetic data sets are constructed so that any deviations from a multi-normal distribution can be due only to sampling error.

The use of many such synthetic data sets can help establish the confidence of either the occurrence of multiple maxima (multimodality) in the pdf. For example, estimating the pdf in two dimensions with the kernel density estimation, one can increase the kernel smoothing h until only an acceptable percentage of the synthetic series show multiple maxima (Corti et al., 1999). On the other hand, one may rely on more stringent tests to distinguish the entire pdf from a multi-normal pdf, such as testing the marginal distributions against a normal distribution with the Kolmogorov-Smirnov test (Stephenson et al., 2001).

For the k-means method, one can similarly establish the confidence level by the percentage of synthetic data sets for which the variance ratio is larger than that associated with the physical data set (Straus et al., 2007). Often one finds that the confidence level is high for all values of k greater than some threshold, because the variance ratio generally increases with k (becoming infinite when k=N [each state its own cluster])! There is no accepted definition of the best or optimal value of k, which may depend on the size of the data set and the use to which the clusters are put. For example, the reproducibility of results to subsampling tends to decrease with increasing k, and could be used to determine an upper limit.

One source of difficulty in assessing significance is the relevance of a Gaussian distribution as the relevant background against which to test, because it is known that many geophysical variables have skewed distributions (White, 1980). Skewness can be taken into account in generating the stochastic surrogate data sets (Christiansen, 2007), and generally lowers the significance.

An important issue in the significance testing of clusters is the size of the sample of data. Often the available data record is just too short to rigorously confirm significance (Stephenson et al., 2001), yet the clusters obtained may be consistent with other statistical analyses or may provide a useful coarse-graining of the original data for various purposes. This is discussed in the section on “Physical Significance and Usefulness of Clusters.”

A second, often critical, issue is the choice of how the physical data is chosen and prepared for clustering. To the extent that the user has some a priori knowledge of a particular set of structures that dominate fluctuations on a certain timescale, filtering of the data to remove much fast timescales (which may represent noise) will boost the significance of the clusters. For example, in seeking the preferred states associated with intra-seasonal fluctuations (roughly 10- to 90-day periods), the failure to filter out rapidly evolving synoptic scale, baroclinic weather systems can degrade the significance (e.g., Fereday et al., 2008). For static clustering techniques, significance can also be boosted if only quasi-stationary periods in the data record, periods in which the states change more slowly than usual, are first isolated (e.g., Toth, 1992; Straus et al., 2007).

Physical Significance and Usefulness of Clusters

The existence of preferred regions of the state space of the atmospheric circulation is suggested by studies of a hierarchy of dynamical models truncated to retain only a few modes. In the most highly truncated models, steady states were identified and associated with preferred states of the real atmosphere (Charney & DeVore, 1979; Charney & Straus, 1981; Reinhold & Pierrehumbert, 1982). States that characterize minima in the tendency of large-scale fields were identified in less highly truncated barotropic (Anderson, 1992; Branstator & Opsteegh, 1989) and two-level baroclinic models (Hannachi, 1997). In the intermediate three-level model of Itoh and Kimoto (1996), remnants of unstable periodic cycles could be identified in the full dynamical attractor, suggesting a physical basis for preferred states.

The clusters of intra-seasonal variability for the more complete (primitive equation) general circulation models (GCMs) with full suites of physical parameterizations generally compare well with those obtained from reanalyses, lending support to the physical nature of the results (Dawson & Palmer, 2015; Swenson & Straus, 2017). Further, the importance of the sample size in establishing formal significance is underlined by cluster analysis of large ensembles of GCM simulations and forecasts: the significance of the k-means clusters of intra-seasonal variability tends to be much higher for GCMs compared to reanalyses because the sample size is much larger. This indicates that the GCM results are robust across ensemble members.

Cluster analysis has also been proven to be a useful tool in a number of topics of weather and climate research, including not only understanding the midlatitude response to tropical forcing, but also in downscaling and extremes, model evaluation, weather forecasting, and synoptic climatology, all of which are reviewed in Hannachi et al. (2017). Downscaling is required for climate impact assessment. Statistical downscaling is, in some sense, the inverse of parameterization used in numerical weather prediction, in that it uses large-scale flow features and circulation regimes (clusters) as predictors to infer smaller-scale processes, notably precipitation and surface temperature at regional and local scales.

The large-scale atmospheric circulation patterns obtained from clustering can also be used for ocean downscaling. In particular, winter and summer circulation regimes over the North Atlantic–European region have been used as predictors to reconstruct ocean surface variables consisting of surface winds and temperature, which can then be used to force ocean models (Barrier, Cassou, Deshayes, & Treguier, 2014).

The link to surface weather and climate extremes is another application of clusters (Park et al., 2014; Xie et al., 2017; Zhao et al., 2017a, 2017b; Amini & Straus, 2018). The consequent persistence of these large-scale and synoptic patterns yields, through a cumulative process of weather fluctuations, an amplitude increase leading to extremes. A related application of regimes (clusters) to local weather is the endeavor known as synoptic climatology (James, 2006; Coleman & Rogers, 2007). Here it is the synoptic weather patterns themselves that are classified into clusters representing preferred synoptic structures.

The very nature of the preferred (and often persistent) large-scale circulation regimes obtained by cluster analysis suggests a link to weather forecasting and predictability. Operational forecast centers thus use clustering to summarize the wealth of dynamical information available from large ensembles of forecasts to indicate scenarios, or alternate distinct possibilities suggested by the forecasts (Ferranti & Corti, 2011; Ferranti, Corti, & Janousek, 2015).

A Brief Guide to the Use of Clustering Techniques

While a number of methods are available for obtaining clusters, they each have distinct advantages and disadvantages for the user, which include the types of information they provide and the difficulty in establishing significance. The goal of this section is to provide some guidance in this regard.

Hierarchical Method

This is a very basic method that yields a hierarchy of representative patterns or maps. This method is very good for getting a feel for the way in which preferred patterns emerge from the data, and how they combine to form a few distinct clusters. Moreover, the dendogram (graphical tree representation of how clusters merge, and at what level of dissimilarity) provides some guidance as to the choice of how many clusters to retain. However, other than establishing reproducibility among many subsets of the data, overall significance is hard to establish.

Partitioning Methods

These methods include both k-means and a variant, clustering around medoids. These approaches also yield distinct preferred patterns, and assignments of each original state (or map) to one of the k patterns. Because for each value of k there is a well-defined metric of how clustered the data are (the variance ratio), synthetic data sets can be used in a straightforward manner to establish the statistical significance. In addition, the user can decide the null hypothesis (nature of the background pdf) against which the data are to be tested. For example, is the null hypothesis represented by a multivariate Gaussian pdf, or a multivariate pdf that includes skewness (or other properties)? Similar to the hierarchical method, the choice of the final number of clusters is not unique. This method is useful for clustering ensemble forecasts, by picking k to be the lowest value for which a threshold significance is achieved, thus allowing for the possibility that there are not multiple preferred states in the forecasts.

Fuzzy Clustering

This method is similar to the partitioning methods, but assigns a probability of each original state belonging to more than one cluster. Because this approach identifies states that are not particularly close to any single cluster centroid, it provides a method for filtering out those states that might be considered ambiguous with regard to the clusters. The control the user has for the degree of smoothing in assigning probabilities is both an advantage (sensitivity can be established) and a disadvantage, because it is difficult to establish a statistically optimal level of smoothing.

Self-Organizing Maps

The goal of this approach is quite distinct from other methods: it seeks to determine a relatively large (of order 25) set of representative patterns that are ordered by their similarity, and so provides a coarse-grained view of the continuum of possible states. This is useful if one is interested in studying the evolution of a particular regional field (such as sea-level pressure over the Pacific) during the annual cycle, or as function of tropical forcing on intra-seasonal and inter-annual time scales (Johnson & Feldstein, 2010). Because the goal is not the identification of just a few preferred patterns, but the sense of evolution of the entire field, the arbitrariness in defining the number of representative patterns is not of concern. However, the assessment of statistical significance is problematic.

Mixture Modeling

This technique is useful if a statistical model for the entire pdf is sought. The pdf is modeled as a sum of multivariate Gaussian component distributions, with each component distribution interpreted as a cluster. Every observed state is associated with multiple components (clusters), each with a certain probability. Mixture modeling is a particularly efficient approach to modeling the entire pdf, in the sense that a relatively few components can describe a variety of pdfs (given that each component has a number of parameters to fit). This strength in pdf modeling, however, suggests a certain caution in using this technique as a cluster analysis technique: because each component does have many parameters, the number of clusters (components) is likely to be quite small in order to avoid over-fitting.

Hidden Markov Model

This approach can be very useful in identifying preferred states that are also persistent, as well as the most likely transition paths between these states. As with many other approaches, programming packages on various popular platforms are available. However, because the underlying assumptions made in identifying the number of preferred states and their transition properties, as well as significance testing, are not transparent, it is recommended that the user have some prior knowledge and physical intuition with regard to the system under study before application of this method.

Finite Element, Vector Autoregressive Factor Model

This is the only approach discussed that can explicitly take into account non-stationarity, and so may be essential in some applications. Because it, like the Hidden Markov model, is a relatively new approach, the climate community has not had as much experience in its application. Thus it is recommended that the user obtain some some prior knowledge and physical intuition with regard to the system under study before application of this method.


The extensive body of literature seems to provide evidence in support of the existence of significant clusters, particularly of the large-scale circulation, from simplified, for example, quasi-geostrophic, models of the atmosphere, from comprehensive climate models (GCMs) and from reanalyses. It is also perhaps fair to say, however, that the significance of clusters is not established in all cases. This points to the need for more research on the existence and identification of preferred and persistent structures, and to how the statistical significance relates to the physical significance.

Suggested Reading

The following articles give an overview of the goals of cluster analysis and how it fits into our understanding of atmospheric and climate variability. They also give many references to both cluster analysis and related subjects.


  • Amini, S., & Straus, D. M. (2018). Control of storminess over the Pacific and North America by circulation regimes. Climate Dynamics.
  • Anderson, J. L. (1992). Barotropic stationary states and persistent anomalies in the atmosphere. Journal of the Atmospheric Sciences, 49, 1709–1722.
  • Barrier, N., Cassou, C., Deshayes, J., & Treguier, A.-M. (2014). Response of North Atlantic ocean circulation to atmospheric weather regimes. Journal of Physical Oceanography, 44, 179–201.
  • Benzi, R., Malgauzzi, P., Speranza, A., & Sutera, A. (1986). The statistical properties of the general atmospheric circulation: Observational evidence and a minimal theory of bimodality. Quarterly Journal of the Royal Meteorological Society, 112, 661–676.
  • Branstator, G., & Opsteegh, J. D. (1989). Free solutions of the barotropic vorticity equation. Journal of the Atmospheric Sciences, 46, 1799–1814.
  • Branstator, G., & Selten, F. (2009). Modes of variability and climate change. Journal of Climate, 22(10), 2639–2658.
  • Camargo, S. J., Robertson, A. W., Gaffney, S. J., & Smyth, P. (2007a). Cluster analysis of typhoon tracks. Part I: General properties. Journal of Climate, 20, 3635–3653.
  • Camargo, S. J., Robertson, A. W., Gaffney, S. J., & Smyth, P. (2007b). Cluster analysis of typhoon tracks. Part II: Large-scale circulation and ENSO. Journal of Climate, 20, 3654–3673.
  • Cassou, C., Terray, L., Hurrell, J., & Deser, C. (2004). North Atlantic winter climate regimes: Spatial asymmetry, stationarity with time, and oceanic forcing. Journal of Climate, 17, 1055–1068.
  • Cassou, C., Terray, L., & Phillips, A. S. (2005). Tropical Atlantic influence on European heat waves. Journal of Climate, 18(15), 2806–2811.
  • Charney, J. G., & Devore, J. G. (1979). Multiple flow equilibria in the atmosphere and blocking. Journal of the Atmospheric Sciences, 36, 1205–1216.
  • Charney, J. G., & Straus, D. M. (1981). Form-drag instability, multiple equilibria and propagating planetary waves in baroclinic, orographically forced planetary wave systems. Journal of the Atmospheric Sciences, 37, 1157–1176.
  • Cheng, X., & Wallace, J. M. (1993). Cluster analysis of the Northern Hemisphere wintertime 500 hPa height field: Spatial patterns. Journal of the Atmospheric Sciences, 50, 2674–2696.
  • Christiansen, B. (2007). Atmospheric circulation regimes: Can cluster analysis provide the number? Journal of Climate, 20, 2229–2250.
  • Coleman, J. S. M., & Rogers, J. C. (2007). A synoptic climatology of the central United States and associations with Pacific teleconnection pattern frequency. Journal of Climate, 20(14), 3485–3497.
  • Corti, S., Molteni, F., & Palmer, T. N. (1999). Signature of recent climate change in frequencies of natural atmospheric regimes. Nature, 398, 799–802.
  • Crommelin, D. T. (2004). Observed nondiffusive dynamics in large-scale atmospheric flow. Journal of the Atmospheric Sciences, 61, 2384–2396.
  • Dawson, A., & Palmer, T. N. (2015). Simulating weather regimes: Impact of model resolution and stochastic parameterization. Climate Dynamics, 44(7–8), 2177–2193.
  • Dawson, A., Palmer, T. N., & Corti, S. (2012). Simulating regime structures in weather and climate prediction models. Geophysical Research Letters, 39(21).
  • Dole, R. M., & Gordon, N. D. (1983). Persistent anomalies of the extratropical Northern Hemisphere wintertime circulation: Geographical distribution and regional persistence characteristics. Monthly Weather Review, 111(8), 1567–1586.
  • Fereday, D. R., Knight, J. R., Scaife, A. A., & Folland, C. K. (2008). Cluster analysis of North Atlantic–European circulation types and links with tropical sea surface temperatures. Journal of Climate, 21, 3687–3703.
  • Ferranti, L., & Corti, S. (2011, Spring). New clustering products. ECMWF Newsletter, 127, 6–11.
  • Ferranti, L., Corti, S., & Janousek, M. (2015) Flow-dependent verification of the ECMWF ensemble over the Euro-Atlantic sector. Quarterly Journal of the Roy Meteorological Society, 141, 916–924.
  • Franzke, C., Crommelin, D., Fischer, A., & Majda, A. J. (2008). A Hidden Markov model perspective on regimes and metastability in atmospheric flows. Journal of Climate, 21, 1740–1757.
  • Franzke, C., & Feldstein, S. (2005). The continuum and dynamics of Northern Hemisphere teleconnection patterns. Journal of the Atmospheric Sciences, 62, 3250–3267.
  • Franzke, C.Woollings, T., & Martius, O. (2011). Persistent circulation regimes and preferred regime transitions in the North Atlantic. Journal of the Atmospheric Sciences, 68, 2809–2825.
  • Gaffney, S. J., Robertson, A. W., Smyth, P., Camargo, S. J., & Ghil, M. (2007). Probabilistic clustering of extratropical cyclones using regression mixture models. Climate Dynamics, 29, 423–440.
  • Ghil, M., & Robertson, A. W. (2002). “Waves” vs. “particles” in the atmosphere’s phase space: A pathway to long-range forecasting? Proceedings of the National Academy of Science, 99, 2493–2500.
  • Haines, K., & Hannachi, A. (1995). Weather regimes in the Pacific from a GCM. Journal of the Atmospheric Sciences, 52, 2444–2462.
  • Hannachi, A. (1997). Low-frequency variability in a GCM: Three-dimensional flow regimes and their dynamics. Journal of Climate, 10(6), 1357–1379.
  • Hannachi, A. (2007). Tropospheric planetary wave dynamics and mixture models: Two preferred regimes and a regime shift. Journal of the Atmospheric Sciences, 64(10), 3521–3541.
  • Hannachi, A., & Legras, B. (1995). Simulated annealing and weather regime classification. Tellus, 47A, 955–973.
  • Hannachi, A., & O’Neill, A. (2001). Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Quarterly Journal of the Royal Meteorological Society, 127, 939–958.
  • Hannachi, A., Straus, D. M., Franzke, C. L. E., Corti, S., & Woollings, S. (2017). Low-frequency nonlinearity and regime behavior in the Northern Hemisphere extratropical atmosphere. Reviews of Geophysics, 55.
  • Hansen, A. R., & Sutera, A. (1986). On the probability density distribution of large-scale atmospheric wave amplitude. Journal of the Atmospheric Sciences, 43, 3250–3265.
  • Harr, P. A., Anwender, D., & Jones, S. C. (2008). Predictability associated with the downstream impacts of the extratropical transition of tropical cyclones: Methodology and a case study of Typhoon Nabi (2005). Monthly Weather Review, 136, 3205–3225.
  • Harr, P. A., & Elsberry, R. L. (1995). Large-scale circulation variability over the tropical western North Pacific. Part I: Spatial patterns and tropical cyclone characteristics. Monthly Weather Review, 123, 1225–1246.
  • Horenko, I. (2010). On the identification of nonstationary factor models and their application to atmospheric data analysis. Journal of the Atmospheric Sciences, 67, 1559–1574.
  • Hsu, C. J., & Zwiers, F. (2001). Climate change in recurrent regimes and modes of Northern Hemisphere atmospheric variability. Journal of Geophysical Research, 106(D17), 20145–20159.
  • Ioth, H., & Kimoto, M. (1996). Multiple attractors and chaotic itinerancy in a quasigeostrophic model with realistic topography: Implications for weather regimes and low-frequency variability. Journal of the Atmospheric Sciences, 53(7), 2217–2231.
  • James, P. (2006). An assessment of European synoptic variability in Hadley Centre global environmental models based on an objective classification of weather regimes. Climate Dynamics, 27, 215–231.
  • Johnson, N. C., & Feldstein, S. B. (2010). The continuum of North Pacific sea level pressure patterns: Intraseasonal, interannual, and interdecadal variability. Journal of Climate, 23, 851–857.
  • Johnson, N. C., Feldstein, S. B., & Tremblay, B. (2008). The continuum of Northern Hemisphere teleconnection patterns and a description of the NAO shift with the use of self-organizing maps. Journal of Climate, 21, 6354–6371.
  • Kimoto, M., & Ghil, M. (1993a). Multiple flow regimes in the Northern Hemisphere winter. Part I: Methodology and hemispheric regimes. Journal of the Atmospheric Sciences, 50(16), 2625–2643.
  • Kimoto, M., & Ghil, M. (1993b). Multiple flow regimes in the Northern Hemisphere winter. Part II: Sectorial regimes and preferred transitions. Journal of the Atmospheric Sciences, 50(16), 2645–2673.
  • Marshall, J. C., & Molteni, F. (1993). Towards a dynamical understanding of planetary-scale flow. Journal of the Atmospheric Sciences, 50, 1792–1818.
  • Michelangeli, P. A., Vautard, R., & Legras, B. (1995). Weather regimes: Recurrence and quasi stationarity. Journal of the Atmospheric Sciences, 52, 1237–1256.
  • Mo, K., & Ghil, M. (1988). Cluster analysis of multiple planetary flow regimes. Journal of Geophysical Research, 93, 10927–10952.
  • Molteni, F., Sutera, A., & Tronci, N. (1988). The EOFs of the geopotential eddies at 500 mb in winter and their probability density distribution. Journal of the Atmospheric Sciences, 45(21), 3063–3080.
  • Molteni, F., Tibaldi, S., & Palmer, T. N. (1990). Regimes in the wintertime circulation over northern extratropics. I: Observational evidence. Quarterly Journal of the Royal Meteorological Society, 116(491), 31–67.
  • Moron, V., Robertson, A. W., Qian, J., & Ghil, M. (2014). Weather types across the Maritime continent: From the diurnal cycle to interannual variations. Frontiers of Environmental Science, 2.
  • Munoz, A. G., Yang, X., Vecchi, G. A., Roberston, A. W., & Cooke, W. F. (2017). A weather-type-based cross-time-scale diagnostic framework for coupled circulation models. Journal of Climate, 30(21), 8951–8972.
  • Netzel, P., & Stepinski, T. (2016). On using a clustering approach for global climate classification. Journal of Climate, 29, 3387–3401.
  • Nitsche, G., Wallace, J. M., & Kooperberg, C. (1994). Is there evidence of multiple equilibria in planetary wave amplitude statistics? Journal of the Atmospheric Sciences, 51, 314–322.
  • Park, T. W., Ho, C. H., & Deng, Y. (2014). A synoptic and dynamical characterization of wave-train and blocking cold surge over East Asia. Climate Dynamics, 43, 753–770.
  • Polo, I., Ullmann, A., Roucou, P., & Fontaine, B. (2011). Weather regimes in the Euro-Atlantic and Mediterranean sector and relationship with West African rainfall over the period 1989–2008 from a self-organizing maps approach. Journal of Climate, 24, 3423–3432.
  • Qian, J. H., Robertson A. W., & Moron, V. (2010). Interactions among ENSO, the monsoon and diurnal cycle in the rainfall variability over Java, Indonesia. Journal of the Atmospheric Sciences, 67, 3509–3524.
  • Reinhold, B. B., & Pierrehumbert, R. P. (1982). Dynamics of weather regimes: Quasi-stationary waves and blocking. Monthly Weather Review, 110, 1105–1145.
  • Silverman, B. W. (1986). Density estimation for statistics and data analysis. London, U.K.: Chapman and Hall.
  • Smyth, P., Ide, K., & Ghil, M. (1999). Multiple regimes in Northern Hemisphere height fields via mixture model clustering. Journal of the Atmospheric Sciences, 56, 3704–3723.
  • Stephenson, D. B. (1997). Correlation of spatial climate/weather maps and the advantages of using the Mahalanobis metric in predictions. Tellus A, 49, 513–527.
  • Stephenson, D. B., Hannachi, A., & O’Neill, A. (2001). On the existence of multiple flow regimes. Quarterly Journal of the Royal Meteorological Society, 130(491), 583–605.
  • Straus, D. M. (2010). Synoptic-eddy feedbacks and circulation regime analysis. Monthly Weather Review, 138(11), 4026–4034.
  • Straus, D. M., Corti, S., & Molteni, F. (2007). Circulation regimes: Chaotic variability versus SST-forced predictability. Journal of Climate, 20(10), 2251–2272.
  • Straus, D. M., & Molteni, F. (2004). Circulation regimes and SST forcing: Results from large GCM ensembles. Journal of Climate, 17, 1641–1656.
  • Straus, D. M., Molteni, F., & Corti, S. (2017). Atmospheric regimes: The link between weather and the large-scale circulation. In C. L. E. Franzke & T. J. O’Kane (Eds.), Nonlinear and stochastic climate dynamics (pp. 105–135). Cambridge, U.K.: Cambridge University Press.
  • Swenson, E., & Straus, D. M. (2017). Rossby wave breaking and transient eddy forcing during Euro-Atlantic circulation regimes. Journal of the Atmospheric Sciences, 74, 1735–1755.
  • Toth, Z. (1991). Circulation patterns in phase space: A multinormal distribution? Monthly Weather Review, 119, 1501–1511.
  • Toth, Z. (1992). Quasi-stationary and transient periods in the Northern Hemisphere circulation series. Journal of Climate, 5(11), 1235–1247.
  • Toth, Z. (1993). Preferred and unpreferred circulation types in the Northern Hemisphere wintertime phase space. Journal of the Atmospheric Sciences, 50(17), 2868–2888.
  • Vautard, R. (1990). Multiple weather regimes over the North Atlantic: Analysis of precursors and successors. Monthly Weather Review, 118, 2056–2081.
  • Vautard, R., & Legras, B. (1988). On the source of mid-latitude low-frequency variability. Part II: Nonlinear equilibration of weather regimes. Journal of the Atmospheric Sciences, 45, 2845–2867.
  • White, G. H. (1980). Skewness, kurtosis, and extreme values of Northern Hemisphere geopotential heights. Monthly Weather Review, 108, 1446–1455.
  • Xie, Z., Black, R. X., & Deng, Y. (2017). The structure and large-scale organization of extreme cold waves over the conterminous United States. Climate Dynamics, 29, 4075–4088
  • Zhao, S., Deng, Y., & Black, R. X. (2017a). A dynamical and statistical characterization of U.S. extreme precipitation events and their associated large-scale meteorological patterns. Journal of Climate, 30, 1307–1326.
  • Zhao, S., Deng, Y., & Black, R. X. (2017b). Observed and simulated spring and summer dryness in the United States: The impact of the Pacific sea surface temperature and beyond. Journal of Geophysical Research: Atmospheres, 122(13), 12713–12731.