Fast, Accurate Processor Evaluation Through Heterogeneous, Sample-Based Benchmarking

Performance evaluation is a key task in computing and communication systems. Benchmarking is one of the most common techniques for evaluation purposes, where the performance of a set of representative applications is used to infer system responsiveness in a general usage scenario. Unfortunately, most benchmarking suites are limited to a reduced number of applications, and in some cases, rigid execution configurations. This makes it hard to extrapolate performance metrics for a general-purpose architecture, supposed to have a multi-year lifecycle, running dissimilar applications concurrently. The main culprit of this situation is that current benchmark-derived metrics lack generality, statistical soundness and fail to represent general-purpose environments. Previous attempts to overcome these limitations through random app mixes significantly increase computational cost (workload population shoots up), making the evaluation process barely affordable. To circumvent this problem, in this article we present a more elaborate performance evaluation methodology named BenchCast. Our proposal provides more representative performance metrics, but with a drastic reduction of computational cost, limiting app execution to a small and representative fraction marked through code annotation. Thanks to this labeling and making use of synchronization techniques, we generate heterogeneous workloads where every app runs simultaneously inside its Region Of Interest, making a few execution seconds highly representative of full application execution.


INTRODUCTION
R EACHING nowadays the 50 th anniversary of the commercialization of the first CPU-on-a-chip [1], we have witnessed technology evolution that has turned computing devices into the core component of nearly any activity in our everyday life. Currently, despite the recent emergence of domain-specific processors [2] (led by GPU computing for deep-learning applications), the general-purpose computing model still constitutes a relevant fraction of the semiconductor market. In this computing model, the processor runs applications (often concurrently) with quite dissimilar characteristics. Under these conditions, measuring (and defining) the expected processor behavior (performance) is challenging. Each piece of application code can interact in a different way with processor microarchitecture and concurrency might introduce "unwanted" cross-effects, affecting overall system behavior negatively.
Benchmarking is the predominant methodology employed for performance evaluation, providing a standardized way to measure and compare alternative processors. A meticulous selection process is usually performed in order to define a reduced set of applications that are sufficiently representative of a much broader usage scenario, corresponding to a specific target environment (scientific [3], NoSQL serving [4], Machine Learning [5], etc.) or one closer to the "general purpose" scenario [6], [7]. Unfortunately, many of these benchmarking suites present two important drawbacks. First, the number of applications under evaluation is usually limited to a few tens. Even when considered a representative sample, if we want to model performance as a random variable, the available number of values is usually below the recommended limit to reach a reasonable confidence margin in the evaluation process. Second, most of the CPU market share corresponds to environments (desktop, cloud computing) where there is limited control of the kind of applications that run on the same processor chip simultaneously. Current benchmarking metrics (latency, rate-mode throughput) might not suffice to gain insight into the consequences of this resource sharing. Therefore, this makes the re-design of the "representative workload" and "representative metric" concepts necessary.
A straightforward technique to increase the benchmark size (and hence the statistical soundness of the results), targeting both heterogeneous and concurrent environments, consists of a random mix of benchmark applications running in parallel inside the same computer [8], [9], [10]. To the best of our knowledge, this technique is usually employed with a single benchmark suite, and parallel execution relies merely on launching every application in a synchronous way. Despite partly solving traditional benchmarking limitations, this methodology significantly increases the computational cost of the evaluation process (to the point of being impractical in certain conditions). Relying on the same principle of random mixing, in this paper we propose a much more elaborate methodology to avoid these increased costs through the following features: Computational resource usage is limited to a small fraction of application code, belonging to its Region of Interest (ROI). Our preliminary explorations demonstrate that many applications from different benchmarks show a similar loop-based ROI structure that has repetitive behavior from the microarchitectural viewpoint. To ensure that every application runs its ROI while performance is being measured a fine-grain synchronization process is used. Additionally, automated hardware event counting during evaluation increases the variety of information available about execution, given the profuse list of events available in state-ofthe-art processors. The methodology is generalized to any application, independently of its benchmark suite. This allows a sort of Meta-benchmarking methodology to be created, which can increase metric coverage. To do this, we formally define the code and execution conditions that must be met by a new application to be part of the random mixes. Following the proposed methodology, we can increase performance metric representativeness, yet under constrained time. This enables the concurrent exploration of alternative performance metrics (such as fairness) and the study of diverse microarchitectural behaviors.
This work expands on previous work [11] by generalizing our methodology to multiple benchmark suites and enhancing evaluation features. In this work, we make the following contributions: We develop a multi-benchmark tool for exhaustive and accurate system evaluation. Thanks to the automated workload generation, execution and monitoring process, the user will gain insight into performance issues transparently and in a feasible amount of time. We define and standardize the process to add new benchmarks to the initial application pool. Conditions that must be met by any candidate application are defined. Around 50 applications have been profiled and employed in this work to test the methodology. We carry out a raw performance evaluation of two counterpart server architectures from the two main CPU vendors, AMD and Intel. Our evaluation is compared to a "conventional" one, such as the one performed through the SPEC CPU17 benchmark [6]. Direct access to hardware counters during the ROI execution enables elaborate performance evaluation methodologies such as Top-Down [12] and more subtle microarchitectural analysis. We extend processor evaluation of micro-architectural parametrization (SMT and hardware prefetching), to prove that the technique is suitable to enhance understanding of the effect of these techniques.

MOTIVATION
As mentioned in the previous section, computational cost can hinder the evaluation process when it moves from a few workloads to several hundreds. This problem has been widely addressed for simulation-based research, where the entire execution of an application is, in most cases, unattainable. To circumvent the problem, sampling techniques (i.e., measuring performance only in a relevant fraction of the original application) are usually employed [13], [14], [15]. Our proposal follows the same approach in a different context: evaluation of real systems when the number of workloads to be considered is impractical for full execution.
The core operation of BenchCast is based on a well-known observation about the execution structure found in many programs. As described in [16], computationally bound applications go through different stages of execution. They usually start with an initialization phase where data structures are set up, moving next to a stage corresponding to the bulk of the execution and ending up with a phase devoted to presenting the application's results. The central stage of the three described is usually labeled as the Region of Interest (ROI), because it corresponds to the largest fraction of execution time and is devoted to the resolution of the main tasks. For this reason, a program's ROI is the most relevant stage in terms of performance. This stage usually has a marked periodical behavior [16] because it tends to be implemented as a set of hierarchical procedures contained in a main loop. Analyzed in detail, this arrangement implies non-uniform behavior from a performance viewpoint, making it difficult to find an execution phase that is representative of the whole program's execution. Fig. 1 shows an example of this time-varying behavior for the 505.mcf application from the SPEC CPU 2017 benchmark [6]. In both graphs, we measured the temporal evolution of alternative performance metrics (instructions per cycle, branch prediction accuracy and L1D Cache miss rate) making use of two different granularities. In the upper graph, performance metrics were collected through the Linux perf command [17], with a fixed period of 100 milliseconds. In contrast, for the lower graph, events were measured at the end of each ROI iteration (variable period), modifying source code to perform this task. The obvious differences between the two graphs reveal a special feature of the aforementioned periodic behavior. When the sampling period is "randomly" selected as a constant time interval, the high variability makes it hard to find a single representative execution phase. In contrast, when the sampling period is somehow adapted to the internal structure of the program (fitting in this case the length of a ROI iteration), the performance metrics become much steadier, and average metrics are close to the global ones. According to this observation, we hypothesize that the execution of a single ROI iteration can represent the whole application with accuracy.
The next step in this process consisted of the exploration of a large set of applications to verify our hypothesis. We extended this kind of analysis (see Section 3.4 for system configuration) to all the applications from three different benchmarks focused on stressing the system's processor and memory subsystem: SPEC CPU17 [6], Parsec [7] and NAS Parallel Benchmark [3]. SPEC CPU is an industry-standardized suite with 23 benchmarks (rate mode) organized in two different suites (int and float), representative of very different application areas (from desktop to scientific). Similarly, the Parsec suite contains 13 applications focused on emerging fields (computer vision, animation physics, financial analytics, etc.), attempting to be representative of nextgeneration software. Finally, the eight workloads from the NAS Parallel Benchmarks have a more specific target, all of them being derived from computational fluid dynamics. Fig. 2 summarizes the results obtained in our exploration. We were able to identify a loop-based ROI in 47 out of 50 applications (94 percent). For each of these 47 applications, we measured performance values for each ROI iteration, calculating average and standard deviation of each metric dataset. Next, average values were compared to fullexecution results, calculating their relative error, which was the value represented by the horizontal bars in Fig. 2. The relative error of each performance metric formed the graph and, as can be seen, almost every application presented a total value below 10 percent.
Therefore, in most cases, it seems accurate to consider that a single iteration of the main loop inside the ROI represents the whole execution with a high degree of confidence. This means it could be possible to reduce the computational effort required to evaluate heterogeneous workloads. If ROI execution can be sychronized, then simply running one (or a few) iteration of the ROI loop of each application simultaneously would be enough to characterize system performance for each workload. This is the cornerstone of the proposed methodology, mixing smart sampling and synchronization to build a computationally feasible and statistically sound evaluation methodology. Through the rest of the paper the proposal is thoroughly described (Section 3) and alternative evaluation procedures are presented (Section 4). To facilitate access to the tool by other researchers and simplify the adoption of their own modifications, a public source code repository and project management tools have been made available (https:// github.com/prietop/BenchCast).

METHODOLOGY (BENCHCAST)
The three main features of BenchCast are described in detail in the following subsections.

Application Profiling & ROI Evaluation
Despite being found in most applications analyzed, not every workload code corresponds to the loop-ROI structure, or the observed steady state between iterations. For this reason, every new application proposed as part of BenchCast must fulfill the set of requirements defined in this section. The profiling process was standardized to guarantee minimal deviation between the fraction of ROI executed and the whole application. Unfortunately, given the heterogeneous nature of the methodology (multi-language, multi-benchmark, etc.), the complete automatization of this profiling process was nearly impossible, and minimal manual work was required to identify and label the ROI.
In summary, this preliminary process involves the following steps: ROI identification: the application is profiled to identify those functions consuming the largest fraction of execution time. Loop labeling: previous functions are analyzed looking for the outer loop structure. Code is annotated to measure the fraction of time spent in that loop, considering only applications returning values over 70 percent. Variability analysis: several performance metrics are measured for every loop iteration. Variability is measured to find out the sample size (number of iterations) required for a pre-defined error and confidence interval. Execution time: the time required to execute the number of iterations calculated previously is estimated. Only those applications with a value below a certain threshold (in this case the maximum ROI is set to 20 seconds) are eligible. Total time required to perform the measurement is highly sensitive to this parameter (and hence, the chosen threshold is relatively small). To gain insight into this process, we will walk through a specific example in detail. Fig. 3 describes the steps for the 505.mcf application, corresponding to the SPEC CPU2017 benchmark.
The process starts with hot code-paths exploration (ROI candidates). Stack traces are captured to generate their associated call-graph (calling relations between code functions) and the later profiling can be performed with the scripting tools provided with perf (stackcollapse) or by generating a graphical representation called Flame Graph [18]. Both solutions provide alternative representations of equivalent information. Fig. 3 shows the Flame Graph for 505.mcf (Step 1), where we can identify the function chain consuming the largest fraction of execution time. It corresponds to the following stack: main! global_opt! primal_net_simplex! master! primal_bea_mpp! spec_qsort. This part of the process, automated in BenchCast, finishes by locating the source code files where these functions are defined. The information generated in this process facilitates the manual annotation performed in the next step.
Main loop identification is the only supervised action in this process. This search is performed from bottom to top of the flame graph (the bottom functions in Fig. 3 are those consuming a larger fraction of time). Every function is examined looking for the outermost loop structure. In this case, main and global_opt functions can be found in the same file, mcf.c. It is easy to identify a while declaration inside global_opt consuming about 90 percent of the execution time. Once located, it is necessary to verify that this loop consumes a significant fraction of total execution time. In our experiments, only those loops consuming more than 70 percent of total execution time are considered as a suitable ROI. The loop code is annotated to measure both execution time and performance metrics for every iteration. As Fig. 3 (Step 2) shows, we make use of the PAPI C interface to obtain a precise event count every iteration.
Once identified as a valid ROI loop, the next step consists of a variability analysis of performance metrics across loop iterations. The mean and standard deviation of IPC, Bpred accuracy and L1Cache hit rate values are obtained (Step 3). Making use of these values, the sample size (number of iterations to be executed) can be estimated for a pre-defined error rate and confidence level.
Previous evaluations [11] show that a 10 percent Error with a 95 percent confidence interval is enough to ensure the representativeness of the workloads generated. In the 505.mcf application, the required number of iterations is 2 (the maximum of the three Ns obtained). As a final step, the execution time required to run N iterations is calculated, and only when this value is below the 20-second threshold, the application is included as part of BenchCast.
A similar process to the one described here was done for every application in Fig. 2. The values in Table 1 summarize the results of this profiling process, showing information about ROI fraction of total execution time (second column), per-iteration average and standard deviation of main performance metrics (IPC, L1D, L1I, Bpred columns) and Niterations execution time (last column).
The literature does not provide a formal definition of which fraction of execution is required to establish that a portion of code makes up a ROI. To select an appropriate value for this threshold, we decided to guarantee that the IPC measured for the whole ROI and the whole application should have a relative error below 5 percent. A 70 percent ROI keeps the applications listed in Fig. 2 below this error rate. Similarly, the 20-second threshold for ROI execution fulfills two conditions. First, it is small enough to ensure an evaluation process at least one order of magnitude faster than a whole one (in this case, the average execution time of whole applications is $300 seconds). Second, it is large enough to fulfill the representativeness error and confidence interval margins for most applications.
For some applications, a large variability was observed between iterations (high Stdev values), caused mainly by a variable ROI behavior across different execution phases. In many of these cases we observed that phases respond to simple patterns, it being feasible to split the application into multiple workloads, one for each phase [11].
After this analysis, only 5 applications were ruled out. Three of them with a ROI execution fraction below the 70 percent limit (538, 704, 806) and the remaining two exceeding the 20-second threshold imposed for ROI execution (510, 710). Relative error and 20-second ROI are mutually related. Relaxing error-related values could lead to a smaller number of discarded applications if ROI length is maintained or to an even shorter ROI execution for the same applications.

ROI Annotation & Synchronization
Once the ROI of the selected application is known, heterogeneous workloads will be defined. This will increase the available number of samples on our evaluation mechanism. All applications running in any workload will be executing their region of interest simultaneously. Each application should execute at least one iteration of the main loop. To achieve this, we create a master application launcher that executes each application of the workload and synchronizes them at the beginning of their ROI. BenchCast uses a POSIX thread barrier mapped onto a shared memory region through a POSIX shared memory object. The barrier and the shared memory object are created by BenchCast master launcher. We append barrier calls within the ROI annotation code located in the previous section. The BenchCast master launcher then creates child processes for each application to be executed in the workload, attaching each process to a different core (or hardware context) of the system under evaluation, using Linux sched setaffinity system call. BenchCast master and the applications wait at the same barrier until all the applications reach their ROI. This process can be repeated as many times as needed, and in our experiments, workloads usually begin after all applications have executed at least one ROI loop (so the workload starts the second time the barrier is reached). Then, the barrier is raised and disabled, and measurements can begin with all the applications executing their ROI concurrently.
BenchCast comes with code annotations for SPEC17, PARSEC and NPB applications. BenchCast includes the necessary information to launch the applications of these benchmarks as well as the PATH to the local installation.
To add a new application to the pool (provided it complies with the previous section's requirements), some information must be provided to the master launcher program, such as the PATH to the new application and its launch command.

Workload Generation and Execution
BenchCast both creates workloads and evaluates their behavior during execution. Making use of the PAPI library and attaching PAPI events to the applications executing on the system, BenchCast can measure any performance counter available through the PAPI interface. The PAPI library and PAPI event initialization is performed by the BenchCast master launcher, and the event list is provided through an easy to modify configuration file. Examples for top-down analysis and basic performance analysis configuration files are provided.
To perform an evaluation using BenchCast, we dynamically generate sufficient variety of workloads so the results are statistically significant. Workloads are generated choosing randomly among the available applications in the pool (SPEC2017, PARSEC and NPB out of the box). By default, BenchCast launches one application per available core in the system under test. If the number of selected applications is fewer than the number of available cores, multiple copies of each application are launched until all hardware contexts are allocated.
Once the applications reach the synchronization point, at the beginning of their ROIs, they start running simultaneously. The master launcher then starts the PAPI measurement, for the duration of at least one loop of the ROI (at least 20 seconds). Once the execution completes, BenchCast stops the measurement and stops all the applications, so the next workload execution can be initiated. The results obtained through the performance counters are written in a results file when each workload ends.
Among others, BenchCast provides the following parameters to perform an evaluation of a system: Number of cores: number of cores to use on the system. By default, the number of cores available, but a lower number can be provided and some of the cores of the system under test will not be used for the evaluation. These include simultaneous multithreading hardware contexts. Number of applications: number of different applications that will be used in each workload. Multiple copies of each application are launched until the selected number of cores has one application each. Number of workloads: number of different workloads that will be generated for the complete evaluation. Event list: A file containing the list of PAPI events that will be measured for each application in each workload. Measurement time: the execution time each application runs for the evaluation. Typically, 20 seconds, to guarantee at least one iteration of the ROI loop.

Methodology Validation
For the experiment in this section, we used a desktop-like computer configuration, an Intel i5-7500 4-Core chip running at 3.4GHz with 6MB of cache and a main memory of 16GB. The software stack corresponds to Debian 9 distribution (Linux kernel 4.9.0). 1000 random combinations are generated, enough to guarantee that variables follow a normal distribution. For TOTAL workloads, each core runs a single application of the combination in an "infinite loop" and execution is terminated when every application completes at least one execution. BenchCast results are obtained executing 20 seconds of their synchronized ROIs. For this number of applications and execution-time values (20 sec. ROI vs. 300 seconds average app execution time), BenchCast can reduce the computational cost from more than a week to only 20 hours. These savings remain constant for each experiment performed, meaning that all the data included in this paper were obtained in less than 7 days, in contrast to the multiple months that would have been necessary without the proper methodology. Fig. 4 shows the IPC histogram for both experiments. The degree of similarity between the two measurements, suggests that the performance figures of BenchCast are equivalent to full application, at a fraction of the computational cost. This postulation is statistically supported through a two-sample Kolmogorov-Smirnov (henceforth KS) test [19]. This is a nonparametric test used to compare the equality (probability distribution fit test) of two data samples. The KS statistic is based on the largest vertical difference between the cumulative distribution function (CDF) of both samples and is defined as, where CDF ITER and CDF TOTAL are the samples under test and N the number of observations. This KS statistic is meant for testing the (null) hypothesis of both samples coming from a common distribution. The hypothesis regarding the distributional form is rejected if the test statistic D is greater than a critical value obtained from a table [19]. In this case, with a number of samples larger than 40 and a 1 percent significance level, the critical value can be calculated as, According to the data collected for both samples, the maximum difference is 0.0109, which is less than the critical value. Therefore, we would accept, at the 1 percent significance level, the hypothesis that both sample distributions come from the same population.
To gain even more insight into similarity we evaluate the random variable e(w) defined as e w ð Þ ¼ 1 À IPC iter w ð Þ IPC total w ð Þ : In words, e(w) is the per-workload IPC relative error between TOTAL and ITER results. Both IPC sample and IPC total can be approximated by a normal distribution [11], and as results in Fig. 5 show, the generated error variable e(w) seems to fit into a similar kind of distribution. As can be seen, the average value of error distribution is 0.0012, while the standard deviation is 0.014. These values mean that

SYSTEM EVALUATION THROUGH BENCHCAST
In this section, we will describe the versatility of the Bench-Cast methodology to carry out alternative performance evaluations. It should be noted that all these evaluations would also be possible running complete applications, but at a prohibitive computation cost.
The flexibility of hardware performance counters enables the evaluation of many events, which provides not only performance metrics, but also enables the analysis of the hidden architectural causes explaining these results. For this reason, each of the experiments presented provides both the raw performance numbers and additional information about microarchitectural behavior, which leads to a much more consistent discussion of results.
The number of potential experiments is nearly as large as the number of events available in the performance monitoring unit. In this work we limit the content to three basic experiments that provide a reasonable idea of the strengths of our tool. These experiments are described in the next three subsections.

System-wide Performance
The first experiment will compare two alternative commercial processors using the proposed methodology. Basic performance evaluation with BenchCast is carried out measuring the total number of instructions retired during the 20-second ROI interval (IPC is not a valid latency metric in this case due to the different frequency of operation in the processors evaluated and potential power scaling across measurements). Two similar servers with AMD and Intel processors will be used. The first server configuration is a two-socket with two Intel Xeon Silver 4216 chips running at 2.10 GHz, with 22MB of cache and a main memory of 110GB. The counterpart server configuration corresponds to a two-socket AMD EPYC 7352 with a 24-Core processor per socket (48 cores in total) at 1.5 GHz (up to 3.2 GHz) with 128MB of cache and 110GB of main memory. The software stack is the same in both systems. The same 1000 workloads are generated, executed and profiled to obtain final results in both systems. Our performance results are compared to the SPECrate metrics collected through the "official" procedure described in the "Run and Reporting Rules" section of SPECCPU documentation [6].
Both SPECrate (above) and BenchCast (below) results are presented in Fig. 6. In both cases, the performance metrics are represented with a frequency histogram (bars) and its estimated probability distribution function (lines). Results are normalized to Intel's Average value. As observed in Fig. 6.above, SPECrate evaluation estimates a 2.33 times better average performance of the AMD server compared to the Intel one. With 1.5 times the core count (32 vs. 48), AMD seems to obtain a better per-core performance than its counterpart. Unfortunately, the small number of workloads employed by SPECrate provides two probability distributions with a large standard deviation, reducing the confidence interval below 50 percent, which is far from statistical standards.
Comparing SPECrate results to BenchCast ones, we observe two significant differences. First, the number of workloads evaluated with BenchCast enables a drastic reduction in standard deviation. Second, the margin of AMD is reduced from 2.33 to 1.62 in this case. This result indicates that when a large, heterogeneous number of workloads is evaluated, per-core performance becomes nearly equal in both server configurations and the only advantage of the AMD server comes from the number of cores. This substantial difference between the two evaluation methodologies could be a determining factor in a tradeoff metric such as performance-cost in certain multitenancy environments, such as cloud providers.
BenchCast enables us to move one step further in the performance comparison process. Thanks to event counting tools, we can explore in detail the divergence observed in SPECrate and BenchCast results. To do so, we will explore the per-core performance of both processors  defining the random variable D(w) as: D(w) is the Performance difference between AMD-based and Intel-based servers running the same workload w. Since both performance variables can be approximated by a normal distribution [11], D(w) is also normal. Since we are interested in the performance of a single core (not a hardware context, SMT is disabled for this experiment), we divide the performance results by the number of cores of each processor chip. We show the results in Fig. 7. In this graph, the x-axis indicates which processor performs better (Intel for positives, AMD for negatives), confirming the similar behavior pointed out in Fig. 6 (BenchCast). The estimated distribution mean is -0.01, which represents a marginal advantage of AMD cores over Intel ones. With this value, we can conclude that on average, both cores perform similarly, and AMD-server benefit is derived nearly exclusively form core count.
Despite the near-zero mean, the standard deviation indicates the presence of many non-zero values where processors perform differently. This dataset can be useful to continue obtaining relevant performance information, dividing workloads executed into two groups, depending on their side of the x-axis. Thus, we could determine whether the workloads from each side have different features which could indicate the strengths and weaknesses of each processor microarchitecture.
For this group analysis we will employ a performance analysis methodology known as Top-Down [12]. It is a practical method to identify true bottlenecks in out-of-order processors, built on top of existing performance counters in Intel microarchitectures. From total pipeline slots (number of instructions that can be issued/retired per cycle), Top-Down estimates which fraction is utilized by "good instructions" and which fraction remains empty due to stalls from different parts of the processor pipeline. Processor stalls are classified following a hierarchical approach. At the top of this hierarchy four major categories are defined: Frontend Bound: fraction of slots wasted because the frontend undersupplies instructions to the backend, fetching and decoding issues mainly Backend Bound: fraction of slots wasted because no uops are delivered at the issue pipeline, due to a lack of required resources, memory hierarchy or functional unit issues.
Bad Speculation: fraction of slots wasted due to incorrect speculations associated with branch prediction. Retiring: issued uops that get retired. Slots utilized by "good instructions" (a 100 percent Retire corresponds to the maximal IPC of the given microarchitecture). In this experiment we will evaluate the behavior of the two application groups (x-axis sides) analyzing whether Top-Down results differ. We limit this evaluation to the Intel server, which is enough to provide a preliminary idea about what makes each processor core better/worse from a software perspective. For the same number of workloads employed throughout this section we obtained the probability distribution function of each of the four categories, shown in Fig. 8. With one graph per category, we pair the results of both groups, in order to check for any observable difference. Both Frontend Bound and Bad Speculation categories seem to obtain quite similar results, which means that no particular difference is noted for these parts of the pipeline between both groups. In contrast, there is a significant difference in the Backend Bound category, where we can observe that those applications with a more relevant bottleneck in the backend seem to behave better in intel processors than in AMD. Those applications pressuring core backend seem to be better suited to the Intel-based server. To understand the source of inefficiencies of the AMD backend, this exploration should analyze the lower levels of top-down hierarchy. However, this is beyond the scope of this work, which is limited to demonstrating that this kind of systematic analysis is feasible through the BenchCast tool.

Simultaneous Multithreading
Simultaneous Multithreading (SMT) is a performance enhancement technique present in almost every modern generalpurpose processor. It basically consists of the splitting of a physical core into multiple (usually two) virtual cores known as hardware threads or hardware contexts. This organization allows two instruction streams to run simultaneously through the same pipeline, improving aggregate ILP by improving processor resource utilization (especially if some of the threads have a clear bottleneck in some of the stages of the execution  pipeline). Final performance improvement is usually far from the theorical upper limit of adding the IPC of each thread in the aggregate thanks to the second thread. The hardware resources available are shared with a "rival" and this has significant impact, which is dependent on the nature of each thread (even to the point of being detrimental under certain scenarios or resource sharing policies). Using BenchCast it is possible to estimate the actual benefit of enabling SMT in a general scenario, as well as understanding how each shared resource can impact on performance.
For this exploration, we limit our experiments to the Intelbased server configurations employed in the previous section. 1000 randomly generated applications were executed, first with SMT activated, then deactivating SMT through EFI settings. The results of both executions are shown in Fig. 9, represented by a frequency histogram (bars) and its estimated probability distribution function (lines). Performance values were normalized to NoSMT results (mean ¼ 1 for NoSMT distribution). According to the graph, SMT improves raw processor performance by 25 percent on average. The SMT measured benefit is far below the theoretical upper limit, which means that vCPUs perform 40 percent worse than physical cores.
As mentioned previously, the access to every performance counter available allows us to look for the multiple sources of inefficiency and their contribution to the observed performance gap. Concerning SMT, we can distinguish between two kinds of shared resources: core-level and processor-level resources. Core-level resources correspond to those shared inside each physical core, such as L1 cache, branch prediction, issue queue, etc. Processor-level resources are those shared among all cores, such as Last Level Cache or memory bandwidth. Through the appropriate selection of hardware events, BenchCast enables the fast exploration of the performance effect of SMT on different Core-level and processor-level resources. For this work we limit our experiments to the most performance sensitive elements, branch prediction and cache hierarchy (Both L1 and LLC).
First, we evaluate the different behavior of three core-level resources in the presence and absence of SMT: branch predictor, L1 data and instruction caches. Results for these metrics are presented in Fig. 10, as the cumulative frequency graph of misses/misspredictions per kilo instruction (MPKI). In this kind of graph, the x-axis represents the parameter under test while the y-axis indicates the fraction of workloads that are below that MPKI value. The presence of two instruction streams doubles, on average, the number of misses per kilo-instruction. Contrary to the intuitive idea of sharing the instruction cache between two threads, the final impact of this degradation on overall performance seems to be insignificant, because in both cases very low miss rates are observed. The negative impact of SMT is more subtle on both branch prediction and L1D performance. In the case of branch prediction, the low values on the x-axis indicate that degradation might have a minimal impact on performance. In contrast, L1D MPKI results are one order of magnitude larger, which could indicate that the pressure on it has more impact on performance results.
Concerning uncore resources, we close this section analyzing the impact of SMT on L3 performance. As can be seen in Fig. 11, SMT produces a similar degradation to that observed in L1D. While LLC capacity remains constant, the number of active working-sets doubles with SMT, increasing the number of misses per kilo instruction. The exploration enabled by event counting in BenchCast allows us to conclude that the pressure imposed by application datasets on the memory hierarchy has a more noticeable effect on performance than doubling Instruction working sets or branch patterns.

Hardware Prefetching
The last BenchCast use-case example is focused on Hardware prefetching, which is a fundamental technique to tolerate cache miss latency in state-of-the-art processors [20],  [21]. Proactively fetching data from slower locations to a faster cache level in advance might significantly reduce average memory access time. Nearly every modern processor includes some hardware prefetching support, exploiting simple and regular access patterns. This is the case of many Intel microarchitectures (starting with Nehalem), where four different types of data prefetchers are implemented in hardware. Two of these prefetchers, known as DCU, are associated with L1 Data cache, where prefetching is triggered by load instructions when certain conditions are met [22]. The streaming prefetcher is triggered by an ascending access to recent loads, assuming that it is part of a streaming algorithm, automatically fetching the next line. A PC-based prefetcher keeps track of individual load instructions looking for a regular stride. When found, a prefetch is sent to the next address which is the sum of the current address and the stride. The two remaining prefetchers are associated with L2 cache. The L2 Spatial Prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk. The L2 Streamer prefetcher monitors read requests (loads, stores, L1 prefetches and code fetches) from the L1 cache for ascending and descending sequences of addresses. When a stream of requests is detected, the anticipated cache lines are prefetched.
Again, the purpose of this section is to demonstrate the versatility of BenchCast, carrying out a detailed analysis of the performance effect of prefetching. The Model Specific Register (MSR) with address 0x1A4 will be used to control the activation/deactivation of these prefetchers. We define different combinations of enabled/disabled prefetchers, analyzing performance metrics for each of them. Fig. 12 shows the IPC distribution of every prefetcher enabled (ALL), L1 prefetching disabled and L2 enabled (L2), L2 disabled and L1 enabled (L1) and every prefetcher disabled (NONE). In this graph, all performance values were normalized to NONE mean.
As expected, the absence of prefetching has a negative effect on performance, and the average IPC decreases from 1.6 to 1.29, which corresponds to a 20 percent performance degradation. Another observable result is the unbalanced contribution of L1 and L2 prefetchers to performance improvement. The activation of prefetching at each level has a positive effect in both cases but seems to be more relevant in the case of L2. The reason for this result might be the large penalty of LLC misses and the larger LLC cache size minimizing the negative pollution effects caused by prefetching. A noteworthy result is that when both prefetchers are combined, there is no benefit when compared to L2 only prefetching.
In order to establish which fraction of workloads undergo a performance degradation caused by prefetching, we also define the following performance-difference variables: The normal distribution of these three variables is shown in Fig. 12. In this graph, all values below zero represent those workloads with poorer performance after prefetching activation. As can be seen, the activation of both or L2 prefetchers improves performance in a consistent way. According to measured mean (0.296, 0.27) and stdev (0.117, 0.105), less than 1 percent applications will suffer from performance degradation. In the case of L1 prefetch, this fraction grows to 3 percent applications (0.151 mean, 0.068 stdev). It still represents a small fraction, but combined with the lower IPC improvement on average, explains its worse results when compared to L2 prefetching.
Again, the access to event counting allows us to get a sense of the result. In this case, we analyzed how different hierarchy levels react to the changes in the prefetcher. As an initial step, we focused our attention on the L2 Cache, as it is the closestto-processor level where prefetches are stored [22]. Fig. 13 shows the cumulative frequency distribution of the MPKI for each prefetcher combination. These results are consistent with the performance ones, showing a MPKI improvement as more prefetchers are activated. This result shows that L2 MPKI improvement has a direct impact on performance, a similar tendency being observed for both metrics. Despite seeming very obvious, we highlight this conclusion because, as we will see next, this relationship is not as straightforward for every metric, and in some cases a closer look is necessary.
Next, focusing the attention on LLC metrics, we obtained the MPKI results shown in Fig. 14 (above). As can be seen, in  this case we obtained an evolution of MPKI inconsistent with performance results. The best MPKI results were obtained when no prefetcher was activated and they degraded progressively as they were activated. This apparently contradictory behavior has a simple explanation: if the raw LLC access numbers are analyzed, it can be immediately observed that the activation of hardware prefetching doubles the number of LLC accesses on average. If we move to alternative events and measure Miss rate in LLC (see Fig. 14 (below)), we can observe how, in this case, the results are consistent with performance results and also with the expected prefetching behavior.

RELATED WORK
The search for tools and methodologies targeting computer performance evaluation has been constant over time. However, both hardware and (mainly) software heterogeneity have increased the complexity of this task. Consequently, many benchmark suites are currently available, representative of multiple environments where computing systems can be found. Thus, hardware environments such as image processing (GPUs) or High-Performance Computing (HPC) employ specific benchmark suites. Some examples of GPU benchmarking tools are Rodinia [23], Parboil [24] and Lonestar [25]. In contrast, HPC employs suites such as NAS Parallel Benchmark [3] (parallel performance measurement developed by NASA), High-Performance Linpack [26] (CPU's FP performance, employed for setting up the Top-500 rank), High-Performance Conjugate Gradients [27] (HPCG) as an alternative to HPL and HPCC suite [28].
From a software perspective, emerging environments such as Big Data, Cloud Computing or Deep Learning have generated the necessity of new benchmarking tools to allow a representative evaluation for these computing fields. Some of the most representative examples of benchmark suites targeting these environments are MLPerf [29], Cloud-Suite [30], BigDataBench [31], YCSB [4] or HiBench [32].
General purpose hardware relies on heterogeneous benchmark suites such as PARSEC [7] or SPEC [6], in an attempt to be representative of a computing environment where applications cover a wide spectrum. Focused on the evaluation of commercial CPUs, the initial public release of BenchCast already includes workloads generated from applications from these two suites, as well as from the NAS Parallel Benchmark. However, BenchCast was designed to be a sort of metabenchmarking framework, like Google's PerfKit Benchmarker [33]. The rules for including new applications are simple and new benchmarks can be easily added.
Some performance evaluation processes are not suitable for the aforementioned benchmarks. This is the case of detailed architectural simulation [34], where the execution time required to run a complete application makes it unaffordable in practice. In those cases, many studies have focused on alternative solutions to reduce the computational cost required for evaluation. One of the most common techniques is known as simulation sampling [15], [16], [13], where the evaluation process is limited to only a small relevant fraction of each application. SimPoints [35] is a well-known sampling methodology that automatically identifies long, repetitive execution phases in benchmarks, and limits simulations to a few instances of these phases. Similarly, [36] and [37] make use of statistical tools to evaluate the representativeness of a benchmark. This means limiting the execution of an application to a reduced number of instructions, able to maintain representativeness. With a similar objective, Craeynest and Eeckhout [9] analyze the problem of the limited validity of current practices in multi-core simulation. Velazques et al. [8] carried out a benchmark selection, which was as small as possible, also analyzing alternative sampling methods and Singh and Awasthi [38] evaluated the accuracy of characterizing the SPEC CPU2017 benchmarks using SimPoints methodology. Loop-dominant programs are targeted in [39]. For each loop found, the authors define a signature, creating a signature vector for each application. Through the similarity scores process, they reduced the representative score and created microbenchmarks for emulating the original one.
In a similar way, our work also looks for a subset of instructions able to resemble a whole application. In contrast, targeting real hardware execution, we have much more flexibility to choose sample size. This enabled the definition of a single sample per application, as well as its precise labeling for synchronization purposes. These two features enabled an easy, uniform and statistically sound evaluation process for multicore architectures running heterogeneous workloads. Runtime sampling makes the synchronization process more  difficult, and the large number of samples defined by simulation tools makes it nearly impossible to build BenchCast on top of existing simulation sampling techniques. Alternative application modifications such as iteration or working set reduction were discarded. Reducing iterations is not always possible, as some applications presentconvergent algorithms without a predefined number of iterations. Similarly, reducing the working set to non-realistic inputs could reduce the execution time, but modifies the micro-architectural behavior.

CONCLUSION
In this work we presented a processor evaluation methodology suitable for both performance and microarchitectural analyses. Taking advantage of some basic execution features present in many applications, we identified, labeled and synchronized the execution of their ROIs. This process was standardized to include applications from different benchmarks, starting with the three (SPEC, PARSEC, NPB) already provided in the public release of the tool. The number of combinations (and therefore workloads) was large enough to provide statistically-sound results. Additionally, we demonstrated that a small fraction of the ROI is, in most cases, representative of the whole-program execution, which significantly reduces the computational effort required for evaluation. The experiments that previously required several days can now be finished within hours.
The accuracy of 20-second ROI execution was amply validated, demonstrating its suitability when statistical analysis is required. Hybrid workloads, where different applications run simultaneously on the same system, enabled the exploration of alternative performance metrics such as fairness. Finally, the utilization of hardware events for evaluation enabled the exploration of multiple microarchitectural parameters (as many as were available in the PMU of the system under evaluation).
We defined and presented three simple experiments that demonstrate the flexibility of BenchCast. We carried out a deep performance comparison of two commercial processors, providing more accurate results than existing methodologies and establishing the architectural implications on performance. We also extended the evaluation process to configurable hardware features, such as SMT or prefetching. We encourage readers to adapt the tools to the huge number of possibilities provided. All the code generated for this work is open access, with the intention of facilitating its utilization by the research community. Pablo Prieto received the BS, MS, and PhD degrees from the University of Cantabria, Spain, in 2005 and 2014, respectively. He is currently an associate professor of computer architecture with the University of Cantabria, Spain. His research interests include cache hierarchies and memory controller design.
Pablo Abad received the BS, MS, and PhD degrees from the University of Cantabria, Spain, in 2003 and 2010, respectively. He is currently an associate professor of computer architecture with the Department of Computers and Electronics, University of Cantabria, Spain. His research interests include performance evaluation of chip multiprocessors.
Jose-Angel Gregorio received the BS, MS, and PhD degrees in physics (electronics) from the University of Cantabria, Spain, in 1978 and 1983 respectively. He is currently a professor of computer architecture with the Department of Computers and Electronics, University of Cantabria, Spain. His main research interests include chip multiprocessors (CMPs) with special emphasis on the memory subsystem, interconnection network and coherence protocol of these systems.
Valentin Puente received the BS, MS, and PhD degrees from the University of Cantabria, Spain, in 1995 and 2000, respectively. He is currently a professor of computer architecture with the Department of Computers and Electronics, University of Cantabria, Spain. His research interests include memory hierarchy design and the impact that upcoming technology changes might have on it.