首页 > 英文翻译

An Abstract

更新时间:2023-07-28 00:51:23 阅读：评论：0

External Memory Parallel Sorting by Sampling

Yongzheng Zhang Faculty of Computer Science Dalhousie University

Halifax,Canada B3H1W5

yongzhen@cs.dal.ca www.cs.dal.ca/∼yongzhen

Rui Zhang Faculty of Computer Science Dalhousie University Halifax,Canada B3H1W5

rzhang@cs.dal.ca www.cs.dal.ca/∼rzhang

Abstract

This paper introduces an external memory parallel sorting algorithm in a multipro-cessor architecture.The overall goal is to choo p−1partitioning elements so that

theﬁnal p sortedﬁles,one per processor,are of roughly equal size.Itﬁrst determines

a sample of splitters by either regular sampling or random sampling techniques.Then

each dataﬁle at each processor is parated according toﬁnal splitters and sublists are

redistributed to appropriate processors.Finally each processor sorts incoming records

into runs and merges sorted runs into a fully sortedﬁle.We implemented our algorithm

using C and MPI package and tested its performance on both a cluster of SUN Solaries

workstations and a Linux cluster CGM1.The result indicates that regular sampling

provides better performance than random sampling does.

1Introduction

The classical problem of sorting and related processing is universally acknowledged to be important and fundamental in computing[3,5],becau it accounts for a signiﬁcant per-centage of computing resources in large-scale applications[9,12],and also becau sorting is an important paradigm in the design of eﬃcient external memory algorithms.Currently, the substantial access gap between internal memory and disk memory is growing rapidly, since the latency and bandwidth of memory chips are improving more quickly than tho of disks.U of parallel processors further widens the gap[16].This gap leads to the problem of performance bottleneck.In light of this,the speciﬁc problem of external memory sorting assumes particular importance[3,5].

美国偶像历届冠军

In external memory sorting,data ts are often too massive toﬁt completely inside the computer’s internal memory.The data items are typically stored on disks.The resulting input/output communication(or I/O)between fast internal memory and slower external memory can be a major performance bottleneck[16],as I/O is fundamental and frequently ud operation during the sorting process[3].

provence是什么意思

External memory algorithms explicitly control data placement and movement,so it is very important for algorithm designers to have a simple but reasonably accurate model of the memory system’s characteristics.Vitter and Shriver introduced the commonly ud parallel disk model(PDM)[15]to capture the main properties of multiple disk systems:•N=problem size(in units of data items),

•M=internal memory size(in units of data items),

•B=block transfer size(in units of data items),

abridge•D=number of independent disk drives,and

•p=number of processors.

where M<N,and1<DB<M/2.All the data items are assumed to be ofﬁxed length. In a single I/O,each

of the D disks can simultaneously transfer a block of B contiguous data items.PDM provides an elegant and reasonably accurate model for analyzing the relative performance for external memory algorithms and data structures[16].

The key to achieve eﬃcient I/O performance in external memory applications is to design the application to access its data with a high degree of locality[16].In our project,we assume D=P,which means each processor holds exactly one independent disk(precily, one bigﬁle).We also assume that the work of the individual nodes of the machine(including I/O,CPU,and network)is done completely in parallel with the other nodes of the machine. The assumptions allow us to consider that the time for a single processor to perform its gment of the computation is the total time for the multiprocessor to perform the full computation[6].

At the top level,our algorithm works as follows:

•Determine a sample S with p−1records s1,s2,...,s p−1,such that in theﬁnal sorted order,all records on processor P1have sorting key value less than or equal to s1,all records on processor P2have sorting key value greater than s1but less than or equal to s2,and so on,until all records on processor P p have sorting key value greater than s p−1.

•Bad on the sample,all processors redistribute each block of records in theﬁle so that each record is at the appropriate processor.After the redistribution,each processor sorts the local records and save them in a temporaryﬁle.

•Each processor merges its local temporaryﬁles into aﬁnal sortedﬁle.

The remaining part of this paper is organized as follows:ﬁrst we review the relevant literature in the area of external memory parallel sorting.Next we prent our main algo-rithms andﬁnally we show the experiment results.

2Review of the Relevant Literature

Over the last a few decades,there has been quite a lot of work on sorting,becau sorting and sorting-like operations account for a signiﬁcant percentage of computer u[16,9].Many recently developed external memory sorting algorithms u disks independently.The algorithms are bad upon the important distribution and merge paradigms,which are two generic approaches to sorting[16].

Distribution sort is a recursive process in which we determine a t of p−1partitioning elements to pa

rtition all the items into p disjoint buckets.All the items in one bucket precede all the items in the next bucket.We complete the sorting by recursively sorting the individual buckets and concatenating them to form a single fully sorted list[16].The

requirement here is that we choo p−1partitioning elements so that the buckets are of roughly equal ,the load balance is good.Much work in this area has been done.

英语翻译工具DeWitt et al[6]consider the external memory parallel sorting in a shared nothing multiprocessor.The critical step is to determine the range of sorting keys to be handled by each processor.Two techniques for determining the ranges of sorting keys are introduced: probabilistic splitting,using sampling to estimate quantiles,and exact splitting,which us a parallel version of the algorithm propod by Iyer et al[8].Theﬁrst step of the exact splitting algorithm requires that each processor fully sorts itsﬁle fragment,producing p sorted runs,one per processor.It does not work eﬃciently when the size of theﬁle gment is much larger than internal memory size.

Quinn[13]has suggested implementing a parallel quicksort.The algorithm choos p−1 partitioning elements at random,and run the recursive quicksorts in parallel.Huang and Chow[7]consider external memory parallel sorting using approximate partitioning bad upon sampling.

The performance of distribution sort primarily depends on how well the data can be evenly partitioned into smaller ordered buckets.Unfortunately,no general,eﬀective method is currently available,and it is an open question of how to achieve linear speedup for parallel sorting on multiprocessors[14].

Merge sort paradigm is somewhat orthogonal to the distribution paradigm[16].A typical merge sort algorithm consists of two main phas.In the“run formation”pha, a t of data blocks are scanned,one memory load at a time.Then each memory load is sorted into a single“run”,which is then written onto the disks.After the initial runs are formed,the merging pha begins.In each pass of the merging pha,data items in a group of runs are merged as they stream through the internal memory.Merge sort algorithms (e.g.,[1,3,4])perform well only with a small number of processors[14].

中英文互译在线Chaudhry et al[5]introduces an eﬃcient algorithm bad on Leighton’s columnsort algorithm[11].It sorts N numbers,which are treated as an r×s matrix,where N=rs, s is a divisor of r,and r≥2(s−1)2.When columnsort completes,the matrix is sorted in column-major order.That is,each column is sorted,and the keys in each column are no larger than the keys in columns to the right.

3Parallel Sorting Algorithms

Our overall goal is to choo p−1partitioning elements so that theﬁnal p sortedﬁles are of roughly equal ,the load balance is good.We ud two sampling techniques to lect a sample of splitters:regular sampling and random sampling.Regular sampling lects splitters with equal intervals,while random sampling lects a certain number of pivots at random.

3.1Main Algorithm

Our approach mainly derives from the work by Shi et al[14],which is an internal memory parallel sorting by regular sampling.We also applied the random sampling technique intro-duced by Quinn[13].In this project we consider the problem of external memory parallel sorting in a distributed memory system.In this multiprocessor architecture,each processor has its own memory and an independent disk(precily,a bigﬁle),and all communications among processors must happen through an interconnection network.

ccpIn this sorting scenario,we have a dataﬁle with N integer numbers created with a data generator.Each processor holds a diskﬁle with N/p unsorted records.At the termination of the sorting algorithm,ﬁles have been partitioned,redistributed and merged into approx-imately equal sized non-overlapping sortedﬁles,which must again be on disks,one at each processor.In more details,our algorithm can be described as follows:

/*Input:originalﬁle list F=f1,f2,...,f p(total size is N,p is the number of processors).Processor P i holds N/p unsorted data items stored inﬁle f i(1≤i≤p). Output:sortedﬁle list F’=f 1,f 2,...,f p,where all records in f i are less than or equal to tho in f i+1(1≤i≤p−1).*/

1.Each processor P i samples its disk-residentﬁle f i:it reads all data elements block by

block(B=128K)and lects p−1pivots at equal intervals(or at random)from each block to form the t of splitters S i.So the size of S i is N(p−1)

2.The coordinate processor P1gathers all unsorted samples S i from all other processors.

P1then sorts the t of the samples to form a regular sample S with size N(p−1)

B .physics是什么意思

Then theﬁnal sample S with p−1elements at equal intervals are lected from the regular sample S ,and is broadcasted to all other processors.

3.Each processor P i reads and sorts data items block by block from localﬁle f i,and

redistributes the records to the appropriate processors using theﬁnal sample S.When

a processor’s memory has beenﬁlled with incoming records,the processor sorts the

records,writes the sorted run onto disk as a temporaryﬁle,and continues reading incoming records.

4.In parallel,the processors merge the sorted runs(precily,temporaryﬁles)and back

onto the disk as theﬁnal sortedﬁle f i.

Note that during the cond step,if we lect pivots by regular sampling,then each data block has to be sortedﬁrst.This additional sorting time can be saved by an alternative way: write the sorted blocks into local temporaryﬁles,and in the third step,each processor reads theﬁles(sorted blocks),not unsorted data block of the original dataﬁle.This means each data block of the originalﬁle has to be sorted only once anyway.

3.2Parallel Disk Model Parameters

Parallel Disk Model(PDM)is an elegant and reasonably accurate model for analyzing the relative performance of external memory parallel algorithms.Basically,the PDM parame-ters of our project are as follows:

•N=16M,problem size(in units of data items),

•M=8M,internal memory size(in units of data items),

•B=128K,optimal block transfer size(in units of data items),

•D=p=16,maximum number of processors(diskﬁles).

All data items are assumed to be ofﬁxed length of4bytes.This means we have the problem size64MB and the internal memory size32MB.For example,when16processors sort the16M data items(1M data items per processor)in parallel,theﬁrst two steps

in the algorithm above look like this:First,each processor can simultaneously transfer a

block of128K contiguous data items,leading to a total of1M

128K =8local blocks.From

each block,15pivots are lected at equal intervals or at random,leading to the t of 15×8=120splitters.Next,the coordinate processor P1gathers all16ts of splitters and forms the regular sample with size120×16=1920.Then theﬁnal sample with15records are determined at equal intervals.

If weﬁx the optimal block size B=128K(as shown in the next ction)and the

internal memory size M=8M,then in order for the regular sample of N(p−1)

B splitters

toﬁt completely on the coordinate processor(suppo it takes one fourth of the internal

horpower

memory size),the total problem size can be as large as N=8M/4×128K

16−1=17.1G(in data

items).This means we can make a reasonable assumption that the whole regular sample can beﬁt on one processor.

4Performance Evaluation

In this ction we discuss the experimental examination of our algorithm.Weﬁrst discuss our experimental methodology and then prent the performance results obtained.

4.1Experimental Setup and Methodology

In order to investigate the performance of our external memory parallel sorting algorithm bad on regular sampling and random sampling,we implemented our algorithm using C and the MPI communication library[2].We also ud a data generator which can create data ts of various sizes(from uniform to skewed data created via ZIPF distributions).

We constructed a parallel machine for preliminary experiment purpo.It consists of 16UltraSPARC-III processors running Sun Solaries8.Each processor has64MB local memory and10M Ethernet card.We did allﬁnal experiments on CGM1.CGM1is a commercial-grade,32board,dual processor Linux cluster.In total there are64Xeon1.8 GHz chips(2per board),32GB of distributed memory(1GB per board),and2.56TB of distributed external memory(two40GB IDE disks per board).And there is a100MB (Fast Ethernet)interconnect.

All quential times were measured as wall clock times in conds,running on each processor.All parallel times were measured as the wall clock time between the start of theﬁrst process and the ter

mination of the last process.The times include all I/Os. Furthermore,all wall clock times were measured with no other ur except us on CGM1. Instead of running applications directly from the command line,CGM1us a batch sub-mission infrastructure that guarantees fair,eﬃcient u of resources.A queuing system known as OpenPBS(Portable Batch System)monitors all submission requests and takes care of running the applications and returning the results.

In our project,we did a t of experiments(in order to eliminate the inﬂuence of varying run-time characteristics,each test was repeated four times)to evaluate the performance of our algorithm in the following steps:

1.Optimal block size:we executed our algorithm withﬁxed data size and number

of processors,but various block sizes to determine what block size can achieve the shortest execution time.

2.Speedup experiments:we executed our algorithm on a single processor of the

CGM1machine and measured the quential wall clock time.Then we executed our algorithm on up to16processors of the CGM1machine and measured the parallel wall clock time.

三中英才

3.Load balance experiments:we executed our algorithm with regular sampling and

random sampling techniques on up to16processors of the CGM1machine and eval-uated the load balance performance in each ca.

4.Scaleup experiments:we executed our algorithm with regular sampling on2to16

processors of the CGM1machine and evaluated the scalability of our algorithm.

5.Sizeup experiments:we executed our algorithm with regular sampling on4pro-

cessors of the CGM1machine and evaluated the sizeup performance of our algorithm.

4.2Performance Results

Typically there are four metrics for evaluating the performance of a parallel algorithm: speedup,load balance,scaleup and sizeup.Speedup is a uful metric becau it indicates whether additional processors result in a corresponding decrea in the sorting time.Load balance is another important metric to measure if all processors are equally load balanced. Scaleup indicates whether a constant respon time can be maintained when the workload is incread by adding a proportional number o

hear

f processors andﬁles.We alsoﬁx the number of processors andﬁles,but increa the size of data elements to the sizeup evaluation.Sizeup experiments indicate the growth rate of the execution time as a function of the problem size[5].

4.2.1Optimal Block Size

The minimum block transfer size impod by hardware is often512bytes,but operating systems generally u a larger block size,such as16KB[16].Since the CPU cost of the sorting is independent of the size of each block,producing longer runs tends to reduce the I/O cost while making the sorting CPU bound[6].

It is possible to u blocks in larger size to reduce the relative signiﬁcance of ek and rotational latency,but the wall clock time per I/O will increa accordingly.For best results in applications where the data are streamed through internal memory,the block transfer size B in PDM“should be considered to be aﬁxed hardware parameter a litter larger than the track size(say,on the order of100KB for most disks)”[16].

In order to test the optimal block size for our algorithm on the cluster,we executed our algorithm withﬁxed16M data items and16processors,but various block sizes ranging from8K to512K to determi

ne what block size can achieve the shortest execution time.

As shown in Table1,We did ven tests bad on block sizes of8K,16K,32K,64K, 128K,256K and512K(in data items),respectively.By calculating the average time in conds toﬁnish the parallel sorting,we can e that using a block size of128K can achieve the shortest sorting time,as indicated in Figure1.

4.2.2Speedup Performance Evaluation

For the speedup experiments,weﬁxed the problem size at16M data items,while varying the number of processors from1to16.The data items were created using the data generator

本文发布于:2023-07-28 00:51:23，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/190786.html

上一篇：垃圾分类建议英语作文范文整理汇总

下一篇：高考英语作文范文(垃圾分类热点话题)

标签：美国冠军偶像

留言与评论（共有 0 条评论）