2.5 Design of Aggregated Segments
In order to solve the model as an intertemporal optimization over several time steps, it is necessary to reduce the number of intra-annual segments over which dispatch is resolved. US-REGEN uses a novel approach for selecting and appropriately weighting a subset of representative hours. In choosing a subset of weighted hours from the 8,760 hours of the year, the goal is to maintain the important characteristics of the disaggregated data so that model outcomes in the reduced form version are as close as possible to the hypothetical outcome using the full data.[1] These characteristics include:
- The area under the load duration curve (i.e. total annual load, for each region)
- The shape of the load duration curve (for each region)
- The capacity factors of new wind and solar capacity (for each region and class)
- The shape of wind and solar output relative to load (for each region), in particular the extremes of the joint distribution (e.g. hours when load is high but wind/solar output are low)
If the problem were simply to capture the load profile, only a handful of segments, perhaps a peak, shoulder, and base load, would be necessary. This level of aggregation is often employed by models to approximate a load shape. However, wind and solar are more variable, ranging from near 0% to near 100%, and considering all three together extends the variability to multiple dimensions. Furthermore, the set of representative hours chosen should apply across all regions, so that synchronicity of inter-regional transmission is preserved. This necessarily adds additional hours to the set; while the distributions in adjacent regions may have many similarities, conditions in Florida will bear little resemblance to those at the same hour in the Pacific Northwest.
Given these complexities, trade-offs among the various criteria are inevitable, and the value of a systematic approach is strongly indicated. We have developed a novel heuristic approach that relies on a simpler integer formulation to identify a set of hours that at a minimum covers the extremes of the joint distribution in each region, combined with a clustering algorithm to capture the interior of the joint distribution of load-wind-solar. Including the extremes ensures that, on the one hand, capacity values are not over-estimated (e.g. the moment of highest load and lowest wind/solar output is represented), and on the other that abundance of wind/solar output relative to load (potentially forcing transmission, storage, or curtailment) is included. After the hours are chosen, the selected hours are weighted so as to minimize error in total load and average annual capacity factor across all regions and classes. We provide an abbreviated description of this algorithm below and refer the reader to Blanford et al. (2018) for more detail.
2.5.1 Choosing Extreme Hours
In each region, we consider three synchronous hourly time series corresponding to load (as a percentage of peak), and wind and solar output (as a percentage of the annual maximum, weighted average across available classes). These hourly values can be plotted in three-dimensional space. These are the red markers in Figure 2‑15, which gives the example of Texas. The extremes of interest are the eight vertices of the space spanned by the hourly data, as well the vertices of the one- and two-dimensional projections of this space, which may or may not coincide. These vertices are the hours identified as those closest (in the conventional Cartesian sense) to the actual vertices of the unit cube (or line or plane in the projection spaces). For example, the hour closest to the unit cube vertex (1,0,0) (with co-ordinates referring to load, wind, and solar respectively) will not literally have values (1,0,0) unless the hour with peak load also happened to have zero output of both wind and solar. Instead, we identify the hour whose values are closest to this point, which might be, for example (0.9, 0.09, 0) at 7:00 p.m. (local time). In this case, the vertex hour would capture the moment when load is still high after a hot summer day, wind has not picked up, and the sun has set. It is crucial to the analysis for the model with aggregated hours to know that such moments exist.
The essential principle behind the extreme hour selection algorithm is to identify the minimum number of hours such that at least one hour is selected with sufficient proximity to each vertex in each region. If we define "sufficient proximity" as "exactly equal," we would use the set of vertex hours themselves. However, some of the vertex hours turn out to be vertices in more than one region, and other hours close to the vertex could be used to represent extreme conditions in multiple regions. Thus, if we allow a selected hour to qualify as a vertex if it is within some small distance from the actual vertex, the number of representative extreme hours can be reduced. The bubbles in Figure 2‑15 are centered on the identified vertices for Texas and extend five percentage points in each dimension. The tolerances used in the current version of the model are shown in Table X. When we define "sufficient proximity" to each vertex according to these tolerances, the minimum number of "extreme-spanning" hours turns out to be roughly 100.[2]

2.5.2 Choosing Cluster Hours
For most of the year, conditions are not at any of these extremes, so using only these extreme hours tends to over-represent the tails of the load, wind, and solar distributions. Thus, in addition to the selection of extreme hours (which drive capacity requirements), US-REGEN also employs a clustering algorithm to select additional segments to ensure that the interior region of the joint distribution of load-wind-solar is adequately sampled. These additional interior hours describe shoulder and base operating conditions and better capture region-specific load duration curves, capacity factor distributions, and correlations between load and intermittent resources. Although adding these cluster hours results in longer runtimes, the model demonstrates a better fit for regional outputs like load duration curves. The model uses k-means clustering to partition the observations for regional load and capacity factors for intermittent technologies, including all resource classes, into a specified number of mutually exclusive sets ("clusters") by minimizing the within-cluster sums of point-to-centroid distances. The selected hours are the segments closest to the cluster centers ("centroids"). This partitioning method provides representative clusters that contain hours with characteristics that are as close to each other (and as far away as observations in other clusters) as possible. The blue markers on Figure 2‑15 represent the hours chosen by the algorithm (both extreme and cluster hours) plotted for Texas.
In the 2019 version of US-REGEN, the extreme hours are chosen to achieve the specified tolerances, then sufficient clustering hours are added to total 120 representative hours for the default 16 region version of the model. For comparison, the 2016 version of EIA's National Energy Modeling System (NEMS) uses 9 hours in its electric sector expansion model. Blanford et al. (2018) provides an extended discussion and a comparison to other segment selection methods. When US-REGEN uses different aggregations of states to regions, the number of hours chosen may vary as the number of extreme hours is chosen endogenously to satisfy a given tolerance limit. Additionally, the model chooses a different set of hours for each time period when run in conjunction with the end-use model, as the load shape varies over time as the end-use mix changes.
2.5.3 Weighting Chosen Hours
Once the representative hours have been chosen, they must be weighted such that the sum of weights equals 8,760. That is, for each moment described by a representative hour, in what fraction of the year do those conditions prevail? Since the conditions in a given representative hour likely vary significantly across regions and since only one set of weights can be applied, it is an over-constrained problem to select mean-preserving weights (i.e. weights such that total load and average annual capacity factor for the aggregated distribution are equal to those in the hourly distribution). Thus, the objective of the weighting procedure is to minimize the sum of squared normalized errors between the aggregated averages and the hourly averages across regions for load and each wind and solar class. To avoid numerical problems associated with very small weights, we enforce a lower bound of one (i.e. each representative hour gets a weight of at least one hour). This formulation is easily solved by non-linear optimization in GAMS with errors of 5% or less.
In summary, the first step of the aggregation heuristic is designed to ensure that the shapes of load, wind, and solar relative to each other are adequately represented, and the second ensures that magnitude of load and the wind/solar resource is not significantly altered. A sample of the results is shown for load, wind, and solar photovoltaic output in Texas in Figure 2‑16, Figure 2‑17, and Figure 2‑18. In each figure, the duration curve (i.e. plot of time series values sorted in descending order) is shown in black for the hourly data (smooth curve) and in red for the aggregated representative hour segments (piecewise linear curve). In each case, while there is some deviation, the basic characteristics of the shape and the area under the curve (for which error was minimized) are preserved by the aggregation. Moreover, it can be shown that the distributions for more complex attributes such as load relative to wind are also captured well by the representative hours approach. These graphs are representative of the methodology.



Although this cannot be tested in the dynamic setting, we can evaluate the success in reproducing results in the static setting, which can be run with all 8,760 hours. ↩︎
This result was obtained through the application of the mixed-integer solver CPLEX-MIP in GAMS. With a smaller "bubble" radius, the minimum number increases. The number of representative hours vs. the tolerance of the "bubbles" reflects a trade-off between speed of computation in the model (which is highly sensitive to the number of segments) and accuracy of the approximation. ↩︎