External Memory Parallel Sorting by Sampling
Yongzheng Zhang Faculty of Computer Science Dalhousie University
Halifax,Canada B3H1W5
yongzhen@cs.dal.ca www.cs.dal.ca/∼yongzhen
Rui Zhang Faculty of Computer Science Dalhousie University Halifax,Canada B3H1W5
rzhang@cs.dal.ca www.cs.dal.ca/∼rzhang
Abstract
This paper introduces an external memory parallel sorting algorithm in a multipro-cessor architecture.The overall goal is to choo p−1partitioning elements so that
thefinal p sortedfiles,one per processor,are of roughly equal size.Itfirst determines
a sample of splitters by either regular sampling or random sampling techniques.Then
each datafile at each processor is parated according tofinal splitters and sublists are
redistributed to appropriate processors.Finally each processor sorts incoming records
into runs and merges sorted runs into a fully sortedfile.We implemented our algorithm
using C and MPI package and tested its performance on both a cluster of SUN Solaries
workstations and a Linux cluster CGM1.The result indicates that regular sampling
provides better performance than random sampling does.
1Introduction
The classical problem of sorting and related processing is universally acknowledged to be important and fundamental in computing[3,5],becau it accounts for a significant per-centage of computing resources in large-scale applications[9,12],and also becau sorting is an important paradigm in the design of efficient external memory algorithms.Currently, the substantial access gap between internal memory and disk memory is growing rapidly, since the latency and bandwidth of memory chips are improving more quickly than tho of disks.U of parallel processors further widens the gap[16].This gap leads to the problem of performance bottleneck.In light of this,the specific problem of external memory sorting assumes particular importance[3,5].
美国偶像历届冠军
In external memory sorting,data ts are often too massive tofit completely inside the computer’s internal memory.The data items are typically stored on disks.The resulting input/output communication(or I/O)between fast internal memory and slower external memory can be a major performance bottleneck[16],as I/O is fundamental and frequently ud operation during the sorting process[3].
provence是什么意思
External memory algorithms explicitly control data placement and movement,so it is very important for algorithm designers to have a simple but reasonably accurate model of the memory system’s characteristics.Vitter and Shriver introduced the commonly ud parallel disk model(PDM)[15]to capture the main properties of multiple disk systems:•N=problem size(in units of data items),
•M=internal memory size(in units of data items),
•B=block transfer size(in units of data items),
abridge•D=number of independent disk drives,and
•p=number of processors.
where M<N,and1<DB<M/2.All the data items are assumed to be offixed length. In a single I/O,each
of the D disks can simultaneously transfer a block of B contiguous data items.PDM provides an elegant and reasonably accurate model for analyzing the relative performance for external memory algorithms and data structures[16].
The key to achieve efficient I/O performance in external memory applications is to design the application to access its data with a high degree of locality[16].In our project,we assume D=P,which means each processor holds exactly one independent disk(precily, one bigfile).We also assume that the work of the individual nodes of the machine(including I/O,CPU,and network)is done completely in parallel with the other nodes of the machine. The assumptions allow us to consider that the time for a single processor to perform its gment of the computation is the total time for the multiprocessor to perform the full computation[6].
At the top level,our algorithm works as follows:
•Determine a sample S with p−1records s1,s2,...,s p−1,such that in thefinal sorted order,all records on processor P1have sorting key value less than or equal to s1,all records on processor P2have sorting key value greater than s1but less than or equal to s2,and so on,until all records on processor P p have sorting key value greater than s p−1.
•Bad on the sample,all processors redistribute each block of records in thefile so that each record is at the appropriate processor.After the redistribution,each processor sorts the local records and save them in a temporaryfile.
•Each processor merges its local temporaryfiles into afinal sortedfile.
The remaining part of this paper is organized as follows:first we review the relevant literature in the area of external memory parallel sorting.Next we prent our main algo-rithms andfinally we show the experiment results.
2Review of the Relevant Literature
Over the last a few decades,there has been quite a lot of work on sorting,becau sorting and sorting-like operations account for a significant percentage of computer u[16,9].Many recently developed external memory sorting algorithms u disks independently.The algorithms are bad upon the important distribution and merge paradigms,which are two generic approaches to sorting[16].
Distribution sort is a recursive process in which we determine a t of p−1partitioning elements to pa
rtition all the items into p disjoint buckets.All the items in one bucket precede all the items in the next bucket.We complete the sorting by recursively sorting the individual buckets and concatenating them to form a single fully sorted list[16].The
requirement here is that we choo p−1partitioning elements so that the buckets are of roughly equal ,the load balance is good.Much work in this area has been done.
英语翻译工具DeWitt et al[6]consider the external memory parallel sorting in a shared nothing multiprocessor.The critical step is to determine the range of sorting keys to be handled by each processor.Two techniques for determining the ranges of sorting keys are introduced: probabilistic splitting,using sampling to estimate quantiles,and exact splitting,which us a parallel version of the algorithm propod by Iyer et al[8].Thefirst step of the exact splitting algorithm requires that each processor fully sorts itsfile fragment,producing p sorted runs,one per processor.It does not work efficiently when the size of thefile gment is much larger than internal memory size.
Quinn[13]has suggested implementing a parallel quicksort.The algorithm choos p−1 partitioning elements at random,and run the recursive quicksorts in parallel.Huang and Chow[7]consider external memory parallel sorting using approximate partitioning bad upon sampling.
The performance of distribution sort primarily depends on how well the data can be evenly partitioned into smaller ordered buckets.Unfortunately,no general,effective method is currently available,and it is an open question of how to achieve linear speedup for parallel sorting on multiprocessors[14].
Merge sort paradigm is somewhat orthogonal to the distribution paradigm[16].A typical merge sort algorithm consists of two main phas.In the“run formation”pha, a t of data blocks are scanned,one memory load at a time.Then each memory load is sorted into a single“run”,which is then written onto the disks.After the initial runs are formed,the merging pha begins.In each pass of the merging pha,data items in a group of runs are merged as they stream through the internal memory.Merge sort algorithms (e.g.,[1,3,4])perform well only with a small number of processors[14].
中英文互译在线Chaudhry et al[5]introduces an efficient algorithm bad on Leighton’s columnsort algorithm[11].It sorts N numbers,which are treated as an r×s matrix,where N=rs, s is a divisor of r,and r≥2(s−1)2.When columnsort completes,the matrix is sorted in column-major order.That is,each column is sorted,and the keys in each column are no larger than the keys in columns to the right.
3Parallel Sorting Algorithms
Our overall goal is to choo p−1partitioning elements so that thefinal p sortedfiles are of roughly equal ,the load balance is good.We ud two sampling techniques to lect a sample of splitters:regular sampling and random sampling.Regular sampling lects splitters with equal intervals,while random sampling lects a certain number of pivots at random.
3.1Main Algorithm
Our approach mainly derives from the work by Shi et al[14],which is an internal memory parallel sorting by regular sampling.We also applied the random sampling technique intro-duced by Quinn[13].In this project we consider the problem of external memory parallel sorting in a distributed memory system.In this multiprocessor architecture,each processor has its own memory and an independent disk(precily,a bigfile),and all communications among processors must happen through an interconnection network.
ccpIn this sorting scenario,we have a datafile with N integer numbers created with a data generator.Each processor holds a diskfile with N/p unsorted records.At the termination of the sorting algorithm,files have been partitioned,redistributed and merged into approx-imately equal sized non-overlapping sortedfiles,which must again be on disks,one at each processor.In more details,our algorithm can be described as follows:
/*Input:originalfile list F=f1,f2,...,f p(total size is N,p is the number of processors).Processor P i holds N/p unsorted data items stored infile f i(1≤i≤p). Output:sortedfile list F’=f 1,f 2,...,f p,where all records in f i are less than or equal to tho in f i+1(1≤i≤p−1).*/
1.Each processor P i samples its disk-residentfile f i:it reads all data elements block by
block(B=128K)and lects p−1pivots at equal intervals(or at random)from each block to form the t of splitters S i.So the size of S i is N(p−1)
Bp
.
2.The coordinate processor P1gathers all unsorted samples S i from all other processors.
P1then sorts the t of the samples to form a regular sample S with size N(p−1)
B .physics是什么意思
Then thefinal sample S with p−1elements at equal intervals are lected from the regular sample S ,and is broadcasted to all other processors.
3.Each processor P i reads and sorts data items block by block from localfile f i,and
redistributes the records to the appropriate processors using thefinal sample S.When
a processor’s memory has beenfilled with incoming records,the processor sorts the
records,writes the sorted run onto disk as a temporaryfile,and continues reading incoming records.
4.In parallel,the processors merge the sorted runs(precily,temporaryfiles)and back
onto the disk as thefinal sortedfile f i.
Note that during the cond step,if we lect pivots by regular sampling,then each data block has to be sortedfirst.This additional sorting time can be saved by an alternative way: write the sorted blocks into local temporaryfiles,and in the third step,each processor reads thefiles(sorted blocks),not unsorted data block of the original datafile.This means each data block of the originalfile has to be sorted only once anyway.
3.2Parallel Disk Model Parameters
Parallel Disk Model(PDM)is an elegant and reasonably accurate model for analyzing the relative performance of external memory parallel algorithms.Basically,the PDM parame-ters of our project are as follows:
•N=16M,problem size(in units of data items),
•M=8M,internal memory size(in units of data items),
•B=128K,optimal block transfer size(in units of data items),
•D=p=16,maximum number of processors(diskfiles).
All data items are assumed to be offixed length of4bytes.This means we have the problem size64MB and the internal memory size32MB.For example,when16processors sort the16M data items(1M data items per processor)in parallel,thefirst two steps
in the algorithm above look like this:First,each processor can simultaneously transfer a
block of128K contiguous data items,leading to a total of1M
128K =8local blocks.From
each block,15pivots are lected at equal intervals or at random,leading to the t of 15×8=120splitters.Next,the coordinate processor P1gathers all16ts of splitters and forms the regular sample with size120×16=1920.Then thefinal sample with15records are determined at equal intervals.
If wefix the optimal block size B=128K(as shown in the next ction)and the
internal memory size M=8M,then in order for the regular sample of N(p−1)
B splitters
tofit completely on the coordinate processor(suppo it takes one fourth of the internal
horpower
memory size),the total problem size can be as large as N=8M/4×128K
16−1=17.1G(in data
items).This means we can make a reasonable assumption that the whole regular sample can befit on one processor.
4Performance Evaluation
In this ction we discuss the experimental examination of our algorithm.Wefirst discuss our experimental methodology and then prent the performance results obtained.
4.1Experimental Setup and Methodology
In order to investigate the performance of our external memory parallel sorting algorithm bad on regular sampling and random sampling,we implemented our algorithm using C and the MPI communication library[2].We also ud a data generator which can create data ts of various sizes(from uniform to skewed data created via ZIPF distributions).
We constructed a parallel machine for preliminary experiment purpo.It consists of 16UltraSPARC-III processors running Sun Solaries8.Each processor has64MB local memory and10M Ethernet card.We did allfinal experiments on CGM1.CGM1is a commercial-grade,32board,dual processor Linux cluster.In total there are64Xeon1.8 GHz chips(2per board),32GB of distributed memory(1GB per board),and2.56TB of distributed external memory(two40GB IDE disks per board).And there is a100MB (Fast Ethernet)interconnect.
All quential times were measured as wall clock times in conds,running on each processor.All parallel times were measured as the wall clock time between the start of thefirst process and the ter
mination of the last process.The times include all I/Os. Furthermore,all wall clock times were measured with no other ur except us on CGM1. Instead of running applications directly from the command line,CGM1us a batch sub-mission infrastructure that guarantees fair,efficient u of resources.A queuing system known as OpenPBS(Portable Batch System)monitors all submission requests and takes care of running the applications and returning the results.
In our project,we did a t of experiments(in order to eliminate the influence of varying run-time characteristics,each test was repeated four times)to evaluate the performance of our algorithm in the following steps:
1.Optimal block size:we executed our algorithm withfixed data size and number
of processors,but various block sizes to determine what block size can achieve the shortest execution time.
2.Speedup experiments:we executed our algorithm on a single processor of the
CGM1machine and measured the quential wall clock time.Then we executed our algorithm on up to16processors of the CGM1machine and measured the parallel wall clock time.
三中英才
3.Load balance experiments:we executed our algorithm with regular sampling and
random sampling techniques on up to16processors of the CGM1machine and eval-uated the load balance performance in each ca.
4.Scaleup experiments:we executed our algorithm with regular sampling on2to16
processors of the CGM1machine and evaluated the scalability of our algorithm.
5.Sizeup experiments:we executed our algorithm with regular sampling on4pro-
cessors of the CGM1machine and evaluated the sizeup performance of our algorithm.
4.2Performance Results
Typically there are four metrics for evaluating the performance of a parallel algorithm: speedup,load balance,scaleup and sizeup.Speedup is a uful metric becau it indicates whether additional processors result in a corresponding decrea in the sorting time.Load balance is another important metric to measure if all processors are equally load balanced. Scaleup indicates whether a constant respon time can be maintained when the workload is incread by adding a proportional number o
hear
f processors andfiles.We alsofix the number of processors andfiles,but increa the size of data elements to the sizeup evaluation.Sizeup experiments indicate the growth rate of the execution time as a function of the problem size[5].
4.2.1Optimal Block Size
The minimum block transfer size impod by hardware is often512bytes,but operating systems generally u a larger block size,such as16KB[16].Since the CPU cost of the sorting is independent of the size of each block,producing longer runs tends to reduce the I/O cost while making the sorting CPU bound[6].
It is possible to u blocks in larger size to reduce the relative significance of ek and rotational latency,but the wall clock time per I/O will increa accordingly.For best results in applications where the data are streamed through internal memory,the block transfer size B in PDM“should be considered to be afixed hardware parameter a litter larger than the track size(say,on the order of100KB for most disks)”[16].
In order to test the optimal block size for our algorithm on the cluster,we executed our algorithm withfixed16M data items and16processors,but various block sizes ranging from8K to512K to determi
ne what block size can achieve the shortest execution time.
As shown in Table1,We did ven tests bad on block sizes of8K,16K,32K,64K, 128K,256K and512K(in data items),respectively.By calculating the average time in conds tofinish the parallel sorting,we can e that using a block size of128K can achieve the shortest sorting time,as indicated in Figure1.
4.2.2Speedup Performance Evaluation
For the speedup experiments,wefixed the problem size at16M data items,while varying the number of processors from1to16.The data items were created using the data generator