首页 > 英语园地

Spark中repartition和coalesce的用法区别及源码分析

更新时间:2023-06-19 14:50:04 阅读：评论：0

Spark中repartition和coalesce的⽤法区别及源码分析

repartition 在spark中源码中实际执⾏的是: coalesce(numPartitions, shuffle = true)

* Return a new RDD that has exactly numPartitions partitions.

* * Can increa or decrea the level of parallelism in this RDD. Internally, this us

* a shuffle to redistribute data.

* * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,

* which can avoid performing a shuffle.

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {

coalesce(numPartitions, shuffle = true)

sine

宿命论者}

由于强制开启shuffle，所以既可以扩⼤分区数，也可以缩⼩分区数量

同样因为开启了shuffle，中间会有写磁盘操作，所以缺点是性能差，优点是相⽐coalesce不易OOM

只能接受⼀个Int参数

coalesce 在spark中的源码: def coalesce(numPartitions: Int, shuffle: Boolean = fal, partitionCoalescer: Option[PartitionCoalescer] = pty) (implicit ord: Ordering[T] = null)

def coalesce(numPartitions: Int, shuffle: Boolean = fal,

partitionCoalescer: Option[PartitionCoalescer] = pty)

(implicit ord: Ordering[T] = null)

: RDD[T] = withScope {一月英文缩写

basketrequire(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")

if (shuffle) {

/** Distributes elements evenly across output partitions, starting from a random partition. */

val distributePartition = (index: Int, items: Iterator[T]) => {

var position = (new Random(index)).nextInt(numPartitions)

aspectratioitems.map { t =>

// Note that the hash code of the key will just be the key itlf. The HashPartitionersophisticated

电台// will mod it with the number of total partitions.

eioposition = position + 1

(position, t)

}

} : Iterator[(Int, T)]

/ include a shuffle step so that our upstream tasks are still distributed

weather怎么读

new CoalescedRDD(

new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),

new HashPartitioner(numPartitions)),

numPartitions,

partitionCoalescer).values

} el {

new CoalescedRDD(this, numPartitions, partitionCoalescer)

}

vcm}

coalesce重新分区，可以选择是否进⾏shuffle过程。由参数shuffle: Boolean = fal/true决定

默认不开启shuffle，所以默认情况下只能缩⼩分区

如果开启了shuffle，则效果等同repartition，使⽤hash partitioner分区

相⽐repartition，coalasce还可以传⼊⼀个⾃定义分区器，分区器必须实现rializable序列化

总结：如果是减少分区, ⽤coalasce即可，尽量避免 shuffle

本文发布于:2023-06-19 14:50:04，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/991684.html

上一篇：基于安卓的外卖点餐系统的设计与开发v1.0

下一篇：Pavement removal device null of cover frame for ma

标签：分区默认只能缺点性能源码开启是否

留言与评论（共有 0 条评论）