On a Catalogue of Metrics for Evaluating Commercial Cloud Services
Zheng Li
School of CS
NICTA and ANU Canberra, Australia Zheng.au
Liam O’Brien
CSIRO eRearch
CSIRO and ANU
Canberra, Australia
Liam.OBrien@csiro.au
He Zhang
School of CSE
NICTA and UNSW
Sydney, Australia
He.au
Rainbow Cai
School of CS
NICTA and ANU
Canberra, Australia
Rainbow.au
Abstract— Given the continually increasing amount of commercial Cloud rvices in the mark et, evaluation of different rvices plays a significant role in cost-benefit analysis or decision making for choosing Cloud Computing. In particular, employing suitable metrics is esntial in evaluation implementations. However, to the best of our knowledge, there is not any systematic discussion abo
ut metrics for evaluating Cloud rvices. By using the method of Systematic Literature Review (SLR), we have collected the de facto metrics adopted in the existing Cloud rvices evaluation work. The collected metrics were arranged following different Cloud rvice features to be evaluated, which esntially constructed an evaluation metrics catalogue, as shown in this paper. This metrics catalogue can be ud to facilitate the future practice and rearch in the area of Cloud rvices evaluation. Moreover, considering metrics lection is a prerequisite of benchmark lection in evaluation implementations, this work also supplements the existing rearch in benchmark ing the commercial Cloud rvices.
Keyw ords- Cloud Computing; Commercial Cloud Service; Cloud Services Evaluation; Evaluation Metrics; Catalogue
I.I NTRODUCTION
C loud C omputing, as one of the most promising computing paradigms [1], has become increasingly accepted in industry. C orrespondingly, more and more commercial Cloud rvices offered by an increasing number of providers are available in the market [2, 5]. Considering that customers have little knowledge and control over the preci nature of commercial C loud rvices even in the “locke
d down” environment [3], evaluation of tho rvices would be crucial for many purpos ranging from cost-benefit analysis for Cloud Computing adoption to decision making for Cloud provider lection.
When evaluating C loud rvices, a t of suitable measurement criteria or metrics must be chon. In fact, according to the rich rearch in the evaluation of traditional computer systems, the lection of metrics plays an esntial role in evaluation implementations [32]. However, compared to the large amount of rearch effort into benchmarks for the Cloud [3, 4, 16, 21, 34, 45], to the best of our knowledge, there is not any systematic discussion about metrics for evaluating C loud rvices yet. C onsidering that the metrics lection is one of the prerequisites of benchmark lection [31], we propod to perform a comprehensive investigation into evaluation metrics in the Cloud Computing domain.
Unfortunately, in contrast with traditional computing systems, the C loud nowadays is still chaos [56]. The most outstanding issue is that there is a lack of connsus of standard definition of C loud C omputing, which inevitably leads to market hype and also skepticism and confusion [28]. As a result, it is hard to point out the range of C loud C omputing and a full scope of metrics for evaluating different commercial Cloud rvices. Therefore, we decided to unfold the investigation along a regre
ssion manner. In other words, we tried to isolate the de facto evaluation metrics from the existing evaluation work to help understand the state-of-the-practice of the metrics ud in Cloud rvices evaluation. When it comes to exploring the existing evaluation practices of C loud rvices, we employed three constraints:
x This study focud on the evaluation of only commercial C loud rvices, rather than that of
private or academic C loud rvices, to make our
effort clor to industry’s needs.
x This study concerned Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) without
considering Software as a Service (SaaS). Since
SaaS with special functionalities is not ud to
further build individual business applications [21],
the evaluation of various SaaS instances could
require infinite and exclusive metrics that would be
out of the scope of this investigation.
x This study only explored empirical evaluation practices in academic publications. There is no doubt
that informal descriptions of C loud rvices
evaluation in blogs and technical websites can also
provide highly relevant information. However, on
the one hand, it is impossible to explore and collect
uful data from different study sources all at once.
On the other hand, the published evaluation reports
can be viewed as typical and peer-reviewed
reprentatives of the existing ad hoc evaluation
practices.
Considering that the Systematic Literature Review (SLR) has been widely accepted as a standard and rigorous approach to evidence collection for investigating specific rearch questions [26, 27], we adopted the SLR method to identify, asss and synthesize the published primary studies
2012 ACM/IEEE 13th International Conference on Grid Computing
of Cloud rvices evaluation. Due to the limit of space, the detailed SLR process is not elaborated in this paper1. Overall, we have identified 46 relevant primary studies covering six commercial Cloud providers, such as Amazon, GoGrid, Google, IBM, Microsoft, and Rackspace, from a t of popular digital publication databas (all the identified primary studies have been listed online for reference: /groups/1104801/slr4cloud/papers /). More than 500 evaluation metrics including duplications were finally extracted from the identified C loud rvices evaluation studies.
This paper reports our investigation result. After removing duplications and differentiating metric types, the evaluation metrics were arranged according to different C loud rvice features covering the following aspects: Performance, Economics, and Security. The arranged result esntially constructed a catalogue of metrics for evaluating commercial Cloud rvices. In turn, we can u thi
s metrics catalogue to facilitate the C loud rvices evaluation work, such as quickly looking up suitable evaluation metrics, identifying current rearch gap and future rearch opportunities, and developing sophisticated metrics bad on the existing metrics.
The remainder of the paper is organized as follows. Section II arranges all the identified evaluation metrics under different Cloud rvice features. Section III introduces three scenarios of applying this metrics catalogue. C onclusions and some future work are discusd in Section IV.
II.T HE M ETRICS FOR C LOUD S ERVICES E VALUATION
It is clear that the choice of appropriate metrics depends on the rvice features to be evaluated [31]. Therefore, we naturally organized the identified evaluation metrics according to their corresponding C loud rvice features. In detail, the evaluated features in the reviewed primary studies can be found scattered over three aspects of Cloud rvices (namely Performance, Economics [35], and Security) and their properties. Thus, we u the following three subctions to respectively introduce tho identified metrics.
A.Performance Evaluation Metrics
In practice, an evaluated performance feature is usually reprented by a combination of a physical property of Cloud rvices and its capacity, for example C ommunication Latency, or Storage Reliability. Therefore, we divide a performance feature into two parts: Physical Property part and Capacity part. Thus, all the elements of performance features identified from the aforementioned primary studies can be summarized as shown in Figure 1. The detailed explanations and descriptions of different performance feature elements have been clarified in our previous taxonomy work [57]. In particular, Scalability and Variability are also regarded as two elements in the Capacity part, while further distinguished from the other capacities, 1The SLR report can be found online:n是什么意思英语
/open?id=0B9KzcoAAmi43LV9IaEgtNnVUenVX
Sy1FWTJKSzRsdw becau they are inevitably reflected by the changes in the index of normal performance features.
Naturally, here we display the performance evaluation metrics mainly following the quence of the performance elements. In addition, the evaluation metrics for overall performance of C loud rvices are particularly listed. The metrics for evaluating Scalability and Variability are also parated respectively.
Figure 1. Performance features of Cloud rvices for evaluation.
1)Communication Evaluation M etrics (cf. Table I): Communication refers to the data/message transfer between internal rvice instances (or different C loud rvices), or between external client and the C loud. In particular, given the parate discussions about IP-level and MPI-message-level networking among public C louds [e.g. 8], we also distinguished evaluation metrics between TCP/UDP/IP and MPI communications.
Brief descriptions of particular metrics in Table I:
x Packet Loss Frequency vs. Probe Loss Rate: Here we directly copied the names of the two metrics
from [43]. Packet Loss Frequency is defined as the
rate between loss_time_slot and total_time_slot, and
Probe Lost Rate is defined as the rate between
lost_probes and total_probes. C onsidering that the
concept Availability is driven by the time lost while
Reliability is driven by the number of failures [10],
we can find that the former metric is for
C ommunication Availability evaluation while the
latter is for Communication Reliability.
x Correlation between Total Runtime and Communication Time: This metric is to obrve a t
of applications about their runtime and the amount
of time they spend communicating in the Cloud. The
trend of the correlation can be ud to qualitatively
discuss the influence of C ommunication on the
applications running in the Cloud.
TABLE I. C OMMUNICATION E VALUATION M ETRICS
Capacity Metrics Benchmark Transaction
Speed Max Number of Transfer Sessions SPECweb 2005 [22] Availability Packet Loss Frequency Badabing Tool [43]
Latency Correlation between Total Runtime
and Communication Time Application Suite [30]
TCP/UDP/IP Transfer Delay
(s, ms)
CARE [45]
Ping [5]
Send 1 byte data [20]
Latency Sensitive
Website [5]
Badabing Tool [43]
MPI Transfer Delay
(s, μs)
HPCC: b_eff [42]
Intel MPI Bench [18]
mpptest [8]
OMB-3.1 with MPI [44]
Reliability Connection Error Rate CARE [45] Probe Loss Rate Badabing Tool [43]
Data Throughput TCP/UDP/IP Transfer bit/Byte
Speed (bps, Mbps, MB/s, GB/s)
iperf [5]
Private tools
TCPTest/UDPTest [43]
SPECweb 2005 [22]
Upload/Download/
Send large size data[23]
MPI Transfer bit/Byte Speed
(bps, MB/s, GB/s)
HPCC: b_eff [42]
Intel MPI Bench [18]
mpptest [8]
OMB-3.1 with MPI [44]
2)Computation Evaluation M etrics (cf. Table II):
C omputation refers to the computing-intensive data/job processing in the C loud. Note that, although coar-grain Cloud-hosted applications are generally ud to evaluate the overall performance of C loud rvices (e Subction 5)), the C PU-intensive applications have been particularly adopted for the specific Computation evaluation.
Brief descriptions of particular metrics in Table II:
x Benchmark Efficiency vs. Instance Efficiency: The two metrics both measure the real individual-
instance C omputation performance as a percentage
of a baline threshold. In Benchmark Efficiency, the
baline threshold is the theoretical peak of
benchmark result, while it is the theoretical C PU
peak in Instance Efficiency.
x ECU Ratio: This metric us Elastic C ompute Unit (ECU) instead of traditional FLOPS to measure the
Computation performance. An ECU is defined as the
CPU power of a 1.0-1.2 GHz 2007 Opteron or Xeon
processor [42].
x CPU Load: This metric is usually ud together with other performance evaluation metrics to judge
bottleneck features. For example, low CPU load with
maximum communication ssions indicate that data
transfer on EC2 c1.xlarge instance is the bottleneck
for a particular workload [22].
TABLE II. C OMPUTATION E VALUATION M ETRICS Capacity Metrics Benchmark
Transaction
Speed
Benchmark Efficiency
(% Benchmark Peak) HPL [42]
ECU Ratio (Gflops/ECU) HPL [42]
Instance Efficiency
(% CPU peak) HPL [17]
Benchmark OP (FLOP) Rate
(Gflops, Tflops)
DGEMM [30]
FFTE [30]
HPL [30]
LMbench [42]
NPB: EP [4]
Whetstone [39] Latency
Benchmark Runtime
(hr, min, s, ms)
Private benchmark/
application [6]
Compiling Linux Kernel [46]
Fibonacci [12]
DGEMM [17]
HPL [17]
NPB [41] Other
CPU Load (%) SPECweb 2005 [22]
Ubench CPU Score Ubench [47]
3)M emory (Cache) Evaluation M etrics (cf. Table III): Memory (C ache) is intended for fast access to temporarily saved data that can be achieved from slow-accesd hard drive storage. Since it could be hard to exactly distinguish the affect to performance brought by memory/cache, there are less ev
aluation practices and metrics for memory/cache than for other physical properties. However, in addition to normal capacity evaluation, there are some interesting metrics for verifying the memory hierarchies in C loud rvices, as elaborated below.
TABLE III. M EMORY (C ACHE)E VALUATION M ETRICS Capacity Metrics Benchmark Transaction
Speed
Random Memory Update
Rate (MUP/s, GUP/s) HPCC: RandomAccess [30] Latency
Mean Hit Time (s) Land Elevation Change App [13]
Memcache Get / Put /
Respon Time (ms) Operate 1Byte / 1MB data [12]
Data
Throughput
Memory bit/Byte Speed
(MB/s, GB/s)
CacheBench [42]
HPCC: PTRANS [30]
HPCC: STREAM [42]
Memory
Hierarchy
Intra-node Scaling
DGEMM [17]
HPL [17]
Sharp Performance Drop
(increasing workload)
Bonnie [42]
CacheBench [42] Other Ubench Memory Score Ubench [47]
Brief descriptions of particular metrics in Table III:
x Intra-node Scaling: This metric is relatively complex. It is ud to judge the position of cache
contention by employing Scalability evaluation
metrics (e Subction 6)). To obrve the scaling
capacity of a rvice instance, the benchmark is
executed repeatedly along with varying workload
and the number of ud CPU cores [17].
x Sharp Performance Drop: This metric is ud to find cache boundaries of the memory hierarchy in a
particular rvice instance. In detail, when repeatedly
executing the benchmark along with gradually
increasing workload, the major performance drop-
offs can roughly indicate the memory hierarchy sizes
[42].
4)Storage Evaluation Metrics (cf. Table IV): Storage of Cloud rvices is ud to permanently store urs’ data, until the data are removed or the rvices are suspended intentionally. C ompared to acessing Memory (C ache), accessing data permantently stored in Cloud rvices usually takes longer time.
TABLE IV. S TORAGE E VALUATION M ETRICS Capacity Metrics Benchmark
Transaction Speed One Byte Data Access Rate
(bytes/s) Download 1 byte data [38] Benchmark I/O
Operation Speed (ops) Bonnie/Bonnie++ [42] Blob/Table/Queue I/O
Operation Speed (ops)
Operate Blob/
Table/Queue Data[5] Performance Rate between
Blob & Table
Operate Blob & Table
Data [20]
Availability
题库Histogram of GET
Throughput (in chart)
Get data of 1Byte/100MB
[9]
Benchmark I/O Delay
(min, s, ms)
BitTorrent [38]25英语怎么读
Private benchmark/
application [6]
NPB: BT [4]
Blob/Table/Queue I/O
Operation Time (s, ms)
Operate Blob/
Table/Queue Data[5] Page Generation Time (s) TPC-W [5]
Reliability I/O Access Retried Rate Download Data [38] HTTP Get/Put [25]
Data Throughput Benchmark I/O bit/Byte Speed
(KB/s, MB/s)
Bonnie/Bonnie++ [42]
IOR in POSIX [44]
PostMark [7]
NPB: BT-IO [44] Blob I/O bit/Byte Speed
(Mbps, Bytes/s, MB/s) Operate Blob Data [38]
Brief descriptions of particular metrics in Table IV:
x One Byte Data Access Rate: Although the unit here ems for Data Throughput evaluation, this metric has been particularly ud for measuring Storage Transaction Speed. Contrasted with accessing large-
size files, the performance of accessing very small-
size data can be dominated by the transaction overheard of storage rvices [38].
x Blob/Table/Queue I/O Operation metrics: Although not all of the public C loud providers specify the definitions, the Storage rvices can be categorized into three types of offers: Blob, Table and Queue [5].
In particular, the typical Blob I/O operations are Download and Upload; the typical Table I/O
operations are Get, Put and Query; and the typical
Queue I/O operations are Inrt, Retrieve, and
Remove.
x Histogram of GET Throughput (in chart): Unlike the other traditional metrics, this metric is reprented as银行最新利率
a chart instead of a quantitative number. In this ca,
the Histogram vividly illustrates the changing of
GET Throughput during a particular period of time,
which intuitively reflects the Availability of a Cloud
rvice. Therefore, the Histogram chart here is also
regarded as a special metric, and so do the other
charts and tables in Subction 6) and 7).
5)Overall Performance Evaluation M etrics (cf. Table V): In addition to the performance evaluations of specific physical properties, there are also a large number of evaluations of the overall performance of commercial Cloud rvices. We consider an overall performance evaluation metric as long as it was intentionally ud for measuring the overall performance of Cloud rvices in the primary study.
Brief descriptions of particular metrics in Table V:
x Relative Performance over a Baline (rate): This metric is usually ud to standardize a t of
performance evaluation results, which can further
facilitate the comparison between tho evaluation
results. Note the difference between this metric and
the metric Performance Speedup over a Baline.
The latter is a typical Scalability evaluation metric,
as explained in Subction 6).
x Sustained System Performance (SSP): This metric us a t of applications to give an aggregate
measure of performance of a Cloud rvice [30]. In
香熏精油fact, we can find that two other metrics are involved
in the calculation of this metric: the Geometric Mean
of individual applications’ Performance per CPU
Core result is multiplied by the number of
computational cores.
x Average Weighted Respon Time (AWRT): By using the resource consumption of each request as
weight, this metric gives a measure of how long on
average urs have to wait to accomplish their
required work [33]. The resource consumption of
each request is estimated by multiplying the
机械摆钟request’s execution time and the required number of
Cloud rvice instances.
6)Scalability Evaluation M etrics (cf. Table VI): Scalability has been variously defined within different contexts or from different perspectives [20]. However, no matter under what definition, the evaluation
of C loud rvices’ Scalability inevitably requires varying workload and/or C loud resources. Since the variations are usually reprented into charts and tables, we treat the corresponding charts and tables also as special metrics. In fact, unlike evaluating other performance properties, the evaluation of Scalability (and also Variability) normally implies comparison among a t of data that can be conveniently organized in charts and tables.
TABLE V. O VERALL P ERFORMANCE E VALUATION M ETRICS Capacity Metrics Benchmark
Transaction Speed
Benchmark OP (FLOP) Rate
(Mflops, Gflops, Mops)
HPL [4]
GASOLINE [48]
青春歌曲NPB [4]
Benchmark Transactional Job
Rate
BLAST [52]
Sysbench on MySQL [3]
TPC-W [29]
WSTest [49] Geometric Mean of Serial NPB
Results (Mop/s) NPB [44] Relative Performance over a
Baline (rate)
MODIS Processing [15]
NPB [4] Sustained System Performance
(SSP) Application Suite [30] Performance per Client TPC-E [20] Performance per CPU Cycle
(Mops/GHz) NPB [4] Performance per CPU Core
(Gflops/core) Application Suite [30]
Availability Histogram of Average
Transaction Time
TPC-E [20]
Latency
Benchmark Delay
(hr, min, s, ms)
Broadband/Epigenome/
Montage [24]
CSFV [8]
FEFF84 MPI [48]
MapReduce App [47]
MCB Hadoop [50]
MG-RAST+BLAST [37]
MODIS Processing [15]
NPB-OMP/MPI [51]
WCD [23]
WSTest [49]
Benchmark Transactional Job
Delay
(min, s)
BLAST [5]
C-Meter [16]
MODIS Processing [15]
SAGA BigJob Sys [40]
TPC-E [20]
TPC-W [53] Relative Runtime over a Baline
(rate)
Application Suite [30]
SPECjvm2008 [5] Average Weighted Respon
Time (AWRT) Lublin99 [33]
Reliability Error Rate of DB R/W CARE [45]
Data Throughput DB Processing Throughput
(byte/c) CARE [45] BLAST Processing Rate
(Mbp/instance/day) MG-RAST + BLAST [37]
Brief descriptions of particular metrics in Table VI:
中药五味子x Aggregate Performance & Performance Degradation/Slowdown over a Baline: The two
metrics are often ud to reflect the Scalability of a
C loud rvice (or feature) when the rvice (or
feature) is requested with increasing workload.
Therefore, the Scalability evaluation here is from the
perspective of workload. x Performance Speedup over a Baline: This metric is often ud to reflect the Scalability of a C loud
rvice (or feature) when the rvice (or feature) is
requested for different amounts or capabilities of
C loud resources. Therefore, the Scalability
evaluation here is from the perspective of C loud
resource.
x Performance Degradation/Slowdown over a Baline: Interestingly, this metric can be intuitively
regarded as an opposite one to the above metric
Performance Speedup over a Baline. However, it
is more meaningful to u this metric to reflect the
Scalability of a Cloud rvice (or feature) when the
rvice (or feature) is requested to deal with different
amount of workload. Therefore, the Scalability
evaluation here is from the perspective of workload.
x Parallelization Efficiency E(n): Interestingly, this metric can be viewed as a “reciprocal” of the normal
Performance Speedup metric. T(n) is defined as the
time taken to run a job with n rvice instances, and
then E(n) can be calculated through T(1)/T(n)/n.
TABLE VI. S CALABILITY E VALUATION M ETRICS
Sample Metrics
[22] Aggregate Performance
[13] Performance Speedup over a Baline
[20] Performance Degradation/Slowdown over a Baline
[23] Parallelization Efficiency E(n)= T(1)/T(n)/n
[48] Reprentation in Single Chart (Column, Line, Scatter) [47] Reprentation in Separate Charts
[42] Reprentation in Table
7)Variability Evaluation Metrics (cf. Table VII): In the
context of C loud rvice evaluation, Variability indicates the extent of fluctuation in values of an individual performance property of a commercial C loud rvice. The variation of evaluation results can be caud by the performance difference of C loud rvices at different time and/or different locations. Moreover, even at the same location and time, variation may still exist in a cluster of rvice instances. Note that, similar to the Scalability evaluation, the relevant charts and tables are also regarded as special metrics.
Brief descriptions of particular metrics in Table VII:
x Average, M inimum, and M aximum Value together: Although the three indicators in this metric cannot be
individually ud for Variability evaluation, they can
still reflect the variation of a C loud rvice (or
feature) when placed together.
x Coefficient of Variation (COV): COV is defined as a ratio of the standard deviation (STD) to the mean of
evaluation results. Therefore, this metric has been
also directly reprented as STD/Mean Rate [5].
x Cumulative Distribution Function vs. Probability Density Function: Both metrics distribute the
probabilities of different evaluation results to reflect
the variation of a C loud rvice (or feature). In the