RobuSTore: Robust Performance for Distributed Storage Systems
Huaxia Xia and Andrew A. Chien Computer Science and Center for Networked Systems University of California, San Diego Abstract Emerging large-scale scientific applications involve massive, distributed-data collections (petabytes), and require robust, high performance for read-dominated workloads. Achieving robust performance (low variability) in storage systems is difficult. We propo RobuSTore, a novel storage technique, which combines erasure codes and speculative access to reduce performance variability and increa performance. RobuSTore us erasure codes to add flexible redundancy then spreads the encoded data across a large number of disks. Speculative access to the redundant data enables application requests to be satisfied with only early-arriving blocks, reducing performance dependence on the behavior of individual disks. We prent the design and an evaluation of RobuSTore which shows improved robustness, reducing the standard deviation of access latencies by much as 5-fold. In addition, RobuSTore improves access bandwidth by as much as 15-fold (vs. traditional RAID). For example, a 1 GB read from 64 disks has average latency of 2 conds with standard deviation of only 0.5 conds or 25%. RobuSTore cures the benefits at the cost of a 2-3x storage capacity overhead and ~1.5x network and disk I/O overhead. 1 Introduction Existing and emerging large-scale scientific applications and data-intensive applications
have required dramatically higher-levels of performance from distributed storage systems. The applications are far from isolated anomalies; they are emerging in virtually every area of science, engineering, and commerce, including biology [1], high-energy physics [2], geology [3], and astronomy [4]. For example, one effort in the OptIPuter project [5] is to construct custom indices across 10GB 3-D images in a collection of 1,000,000 images (10Petabytes data) and to enable interactive, real-time data exploration for collaborative visualization on a large-scale tiled display such as LambdaVision [6] (55 panels, 100M pixels, 300Gbps network bandwidth). Typical properties of large-scale scientific applications include: (1) massive data collections with objects as large as 10 GB and collections larger than tens of petabytes; (2) distributed data sources and owners; (3) large read-dominated workloads (100s megabytes to 10s of gigabytes per read); (4) thousands of urs, and (5) a need for robust, high performance. High performance is a requirement derived from the large quantity of data, and consistent access latency is important for both ur interaction and resource scheduling. Throughout, we u the term robust to mean low-variability in access latency; that is consistent performance. Robust performance is a major goal of RobuSTore. Meeting the performance requirements of the emerging large-scale scientific applications requires the aggregation of large numbers of disks. 1000’s of disks are required to meet the application bandwidth capacity and bandwidth requirements. Meeting the community needs for distributed data
collection sharing dictates that the data collections will be stored on distributed ts of disks. Only a few years ago, intensive sharing of large data across wide-area networks would have been infeasible, but the development of low-cost optical transmission and Den Wavelength Division Multiplexing (DWDM) technique now provides wide-area networks with private 10Gbps (even 40Gbps) connections [7]. Indeed, individual fibers have the potential to carry 100’s of 10 Gbps “lambdas”, providing terabits of bandwidth in each fiber bundle. Numerous rearch infrastructures with private, high-speed “lambdas” have been deployed (OptIPuter [5], National LambdaRail [8], Netherlight [9], and the Global Lambda Interchange Facility (GLIF) [10]).
-1-
While aggregation of distributed storage performance is not a new problem [11-14], existing parallel and distributed file systems do not address the problem. Distributed file systems have historically presumed slow (and expensive) wide area networks (NFS [11] and AFS [12]), and thereby focud on minimizing access latency for small data which can be optimized via techniques like caching, prefetching, and so on. In contrast, parallel disk arrays (e.g. RAID [15]) and parallel file systems [16-20] focus on aggregating disk performance; they presume not only near perfect homogeneity, but also private access – no intensive sharing of the storage as is typical in most distributed systems. W
hen the storage is shared, the performance disruption of even a few disks can destroy overall performance. Thus, such parallel file systems do not work well in wide-area environments with high variability and heterogeneity. We propo a novel storage architecture and a basic design for RobuSTore which enables high, robust performance to distributed storage collections. RobuSTore us erasure codes to add continuous redundancy for striping, exploiting large decreas in the per byte cost of storage [21]. With such layouts, clients can u speculative parallel access and decoding of the fast-returning blocks to both increa performance, and reduce performance dependence on stragglers (lower variability). As a result, RobuSTore can efficiently aggregate large number of distributed storage devices to deliver robust, high access performance. We evaluate the RobuSTore idea via detailed simulation, exploring a wide range of possible system parameters. The studies show when and by how much the RobuSTore idea increas absolute performance, reduces performance variability, and increas storage overhead, disk access, and network bandwidth. Our simulation results prove the promi of the RobuSTore idea. Specific contributions of the paper include:
•
• •
•
•
•
exposition of the RobuSTore idea which combines erasure codes and speculative parallel disk access to reduce variability and improve read performance, description of a system framework which realizes the RobuSTore idea, enabling systematic exploration of the design space, evaluation of RobuSTore robustness for a variety of system configurations, characterizing the achievable reduction in access latency variability (as large as 5fold), evaluation of RobuSTore performance for a wide range of system configurations, characterizing read bandwidth increa as large as 15-fold, study of RobuSTore overhead, showing that to cure the benefits, only moderate storage capacity overhead (~2-3x) and network bandwidth and disk I/Os (~1.5x) are required, finally, study of a real OptIPuter configuration showing that RobuSTore could improve application performance by 16 times (1.07 GBps) and reduce access variability to 0.1 cond (30 times smaller than for RRAID-S)
RobuSTore is part of the OptIPuter Project [1] exploring configurable optical networks with 100’s Gbps to support large-scale scientific applications. Network issues, such as lambda configuration, pr
otocols, etc. are addresd by other parts of the project DVC[22], LambdaRouter[23], CEP [24], and GTP [25]. The remainder of the paper is organized as follows. In Section 2, we describe the problem and prent the RobuSTore approach. We describe the RobuSTore design in Section 3, and our simulation-bad evaluation in Section 4. Section 5 surveys related work, and Section 6 summarizes results, describing directions for future work.
-2-
2
2.1
Robust, High Performance Distributed Storage
The Problem
Numerous emerging, large-scale scientific applications involve massive distributed data collections, and demand distributed storage systems that deliver robust, high performance for their read-mostly workloads. The distributed nature of the scientific communities using the applications implies that their needs will be met by distributed collections of disk arrays. Supporting many urs, at the high
performance levels requires robust and efficient aggregation of disk performance. For the distributed storage in such systems, disk performance varies greatly due to the federated, evolving nature of such infrastructures. Sources of performance variation or heterogeneity include: heterogeneous disk types (deployed over time at each site), irregular layout (due to unique disk history and local/distributed u interaction), and differences in loading (varied competitive access). As disk technology has improved, read bandwidth has incread, so heterogeneous disk collections often include individual disks with 20-fold read bandwidth variation [21]. In modern storage devices, disk performance is nsitive to disk layout [26, 27], as physical contiguity, ek distance, and rotational latency can vary read bandwidth by 100-fold. Finally, sharing caus access streams to be interleaved which can incur additional eks which can also cau as much as 100-fold variance in performance. Traditional parallel filesystems u data striping and replication to aggregate disk performance. However, they only achieve good performance with nearly homogeneous collections of disks and often only when layout is tightly controlled [18]. Further, best performance is often only achieved for single urs. Becau parallel filesystems deliver performance with careful layout scheduling and control, a key limitation is that they do not perform well in the face of uncontrolled
performance variation; even if replication is ud. For example, if access to a single data block or dis
k is slow, an entire large request can be delayed. For parallel systems with replicated data (e Figure 1) late arriving disk respons can still delay completion of the larger request.
Figure 1.Aggregating Disks with Data Striping: 8 blocks, 2 replicas. Disk performance is varied.
To meet the needs of emerging large-scale scientific applications, we need a new t of technologies for aggregating the performance of disk collections. The technologies must - tolerate high variation in the performance of individual disks, - achieve robust access latency (low variance) for large requests, and - deliver high performance. In subquent ctions, we describe the RobuSTore approach which meets the goals.
2.2 RobuSTore: Erasure Speculative access codes and
The key idea in RobuSTore is to add redundancy to stored data and then exploit it to aggregate statistically the disk access latencies, producing an earlier and lower-variance in access latency for large requests. Erasure codes allow the systematic introduction of redundancy, so many combinations of returning blocks can be ud to asmble the desired data. This redundancy reduces the dependence of the large request on the performance of any individual disk request. Further, we employ speculative access, requesting a redundant t of blocks, increasing the chance
s that a t of blocks which makes decoding possible (and -3-
thereby completion of the large request) will be received quickly. The decoding freedom provided by erasure codes is depicted in Figure 2. RobuSTore only requires the first t of returned blocks to decode the data, and need not wait for later blocks.
experiments, we u LT Codes (modified to guarantee decidability [32]) with parameters that enable decoding with almost any t of blocks 1.4 times larger than the original data. RobuSTore combines the decoding flexibility of erasure codes with speculative access. A read client in RobuSTore requests redundant coded blocks in parallel. In general, it receives over a range of time as shown in Figure 2. Benefiting from the decoding flexibility, the client can then reconstruct the original data using a t of early-returning blocks. We do not discuss write operations in this paper, as our workloads are read-dominated. Mixed workloads will be explored in future RobuSTore studies as in [33-35]. 2.2.1 Analyzing Benefits of Erasure Codes Erasure codes provide greater reordering freedom than replication. For example, if we replicate an N block file 3 times, only a smaller number of subts of the resulting 4N blocks are sufficient to reconstruct the original data. This is becau each block has 4 copies; one of which is required in any subt which enables reconstruction. In contrast, if we encode the N block file into 4N blocks using Luby Transform (LT) codes, we can recon
诫言struct with many more subts. More formally, for replication the probability reconstruction with M random blocks for an N block file replicated 3 times is:
P ( M ) = ∑ ( − 1) N − i
i =1 N
Figure 2.RobuSTore Data Access with Similar Storage. Reconstruction with first arriving coded blocks. Disk performance is varied.
大长今成人版For reasons of flexibility in degree of redundancy and reconstruction dependences, we u LT codes rather than Reed-Solomon [28, 29]. The Luby Transform codes [30] are part of a general class of low-density parity check codes [31] and u block-XOR operations bad on spar bipartite graphs. Experiments have shown that the codes can implemented with sufficiently high performance on conventional microprocessors [32]. We u erasure codes to provide a continuum of redundancy, enabling a flexible tradeoff of resources against robustness and high performance. In general, we encode N source data blocks into M erasure-coded blocks for any M>N; and the coded blocks contains nearuniform redundancy for all the N original blocks. During decoding, the structure of erasure codes allows reconstruction of original data from almost any N(1+δ) coded blocks, where δ i不该存在的文物
s zero or a small number decided by the choice of coding algorithms, and (1+δ) is called reception overhead. In our current design and
⎛ N ⎞ ⎛ 4i ⎞ ⎜ ⎟⎜ ⎟ ⎝ i ⎠⎝ M ⎠ ⎛ 4N ⎞ ⎜ ⎟ ⎝M ⎠
For an LT encoding file, using our encoding ttings and equivalent expansion, the probability of reconstruction with M random blocks for an N block file expanded to 4N total blocks is:
N ⎛N⎞ i Pc ( M ) = ∑ ( −1) N −i ⎜ ⎟ ( )5 M i =1 ⎝i ⎠ N
See the Appendix for the full analysis. In practice, nearly 3N blocks are needed for replication versus about 1.5N blocks for LT coded (e Figure 3). -4-
Figure 3.Cumulative Probability of Reasmbly of N Original Blocks from M random Blocks
2.2.2 RobuSTore Rearch Questions The degree of redundancy encoded and aggressiveness of speculative access are critical for RobuSTore performance. However, many other important rearch questions are also critical to understanding RobuSTore’s potential to improve robustness and performance. - With realistic disk variability, what improvements in robustness are possible? - What coding parameters and access strategies are most effective? - What performance improvement can
be expected? With different numbers of disks? Block sizes? - What capacity overhead, additional disk I/O’s, and network bandwidth are required for a particular performance benefit? We explore the questions in following ctions. 3 RobuSTore Design The RobuSTore system framework is flexible, enabling broad exploration of the design space. We describe the components of a RobuSTore system then discuss the key system architecture, component, and other design issues咬肌酸痛
3.1 RobuSTore System Architecture
Figure 4.RobuSTore System Architecture
erasure code implementation to implement the encoding and decoding. It also includes a layout planner which places data on disk rvers to meet specified QoS requirement. Metadata Server maintains both file information (size, organization, and location) and storage rver information (available space, performance, connectivity, current load, etc). Storage Servers provide data storage at block level (erasure-coded block, presumed to be larger than disk blocks). Servers may be single disks or arrays, and each implements local admission and access control. Servers typically have variable performance due to heterogeneity in hardware, data layout, or load.
3.2 Basic RobuSTore Operations
陡直
Basic RobuSTore operation interfaces include:
open(filename, read/write,QoS_options); read(fdescriptor,buffer,offt,length) write(fdescriptor,data,offt,length) clo(fdescriptor)
RobuSTore storage system architecture includes four key components: client, metadata rver, and storage rvers (e Figure 4). Clients perform many distributed filesystem functions: accessing metadata, planning layout, encoding data, nding requests to storage rvers, and decoding/asmbling the completed respon. The client includes an
西塞山
Figure 5 depicts the basic read process: open, read and clo. The client opens the file, obtaining storage rver and block location information from the Metadata Server, and obtaining any required locks. It also negotiates admission (bandwidth) and access with each storage rver. To read, the client requests all coded blocks from rvers and does LT decoding (in parallel). When enough blocks have been received, the decoding completes and the original data is returned. Simultaneously, outstanding requests to the storage rvers are cancelled. Clo notifies the Metadata Server releasing read locks, and also releas and bandwidth rervations on the storage rvers. -5-
Figure 5.The Read Process in RobuSTore
陋室铭教学设计
Writes are similar. On write, the client access the Metadata Server, opens the file, and bad on disk map information and application QoS requirements, it plans a layout -- an encoding and distribution across storage rvers. The client then allocates the needed storage on the rvers, and encodes and writes the data appropriately using specified LT coding parameters (C, δ, and redundancy). After the data is committed on the rvers, the client registers the data location and file structures with the Metadata Server.
3.3 Key Design Decisions
The RobuSTore design reflects the major choices in architecture and component implementation. General design principles include modularity, performance, scalability, and simplicity. We summarize the major design decisions below. Coding: The coding modules (ENC/DEC) are logically in the client, critical if the composition of variable performance is to encompass network variability. If desired, a parate machine/cluster to save client CPU or improve coding speed. Becau the clients manage the coding (and store information about it in the Metadata Server), the storage rvers need not be aware of coding. While there are many different erasure codes, such as Reed-Solomon [29], Raptor [36], Tornado [37], etc. RobuSTore us a modified
世界最大的岛屿Luby Transform (LT) codes [30, 32]. LT codes have a number of advantages. First, they are “rateless”, allowing redundancy to be decoupled other system design issues, such as the number of storage rvers ud. Second, LT codes u irregular bipartite graphs and blockXOR operations, enabling fast implementations [32]. Third, LT code structure allows decoding to be overlapped receiving data from storage rvers, masking coding time. However, as defined, pure LT codes do not guarantee the original data can be recovered. To fix this, we adapted LT codes by generating coding graphs [32] which guarantee decodability, and built a fast implementation which improved decoding bandwidth ten-fold 30MByte/s to 320 MByte/s on a Intel 2.4Ghz Xeon CPU. Our work suggests that multi-processors of the future will surpass 1GByte/s of coding bandwidth. Metadata Service: A simple central metadata rver minimizes update and synchronization costs at some penalty in scalability. Reliability and scalability can be addresd by well-known replication and failover techniques [14, 38]. The metadata rver maintains object-level information, so the storage rvers function as object rvers. The benefits of object-bad storage systems are well recognized [20, 39, 40], and this approach supports our future extension RobuSTore to federated storage rvers in the Grid [41]. Admission and Access Control: Becau RobuSTore aggregate distributed resources which may be shared with local or other distributed u, central admission control is not possible. Each storage rver implements local admission control (for QoS) and access control. Clie
nts must negotiate access on file Open. There are many good candidates for the distributed filesystem curity. Credentialbad access control [42] is flexible in the environments with a large number of rvers and urs. 4 Evaluation We u detailed discrete-event simulation to evaluate the RobuSTore idea across the -6-