Spark中repartition和coalesce的⽤法区别及源码分析
repartition 在spark中源码中实际执⾏的是: coalesce(numPartitions, shuffle = true)
* Return a new RDD that has exactly numPartitions partitions.
* * Can increa or decrea the level of parallelism in this RDD. Internally, this us
* a shuffle to redistribute data.
* * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
sine
宿命论者}
由于强制开启shuffle,所以既可以扩⼤分区数,也可以缩⼩分区数量
同样因为开启了shuffle,中间会有写磁盘操作,所以缺点是性能差,优点是相⽐coalesce不易OOM
只能接受⼀个Int参数
coalesce 在spark中的源码: def coalesce(numPartitions: Int, shuffle: Boolean = fal, partitionCoalescer: Option[PartitionCoalescer] = pty) (implicit ord: Ordering[T] = null)
def coalesce(numPartitions: Int, shuffle: Boolean = fal,
partitionCoalescer: Option[PartitionCoalescer] = pty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {一月英文缩写
basketrequire(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
aspectratioitems.map { t =>
// Note that the hash code of the key will just be the key itlf. The HashPartitionersophisticated
电台// will mod it with the number of total partitions.
eioposition = position + 1
(position, t)
}
} : Iterator[(Int, T)]
/
/ include a shuffle step so that our upstream tasks are still distributed
weather怎么读
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} el {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
vcm}
coalesce重新分区,可以选择是否进⾏shuffle过程。由参数shuffle: Boolean = fal/true决定
默认不开启shuffle,所以默认情况下只能缩⼩分区
如果开启了shuffle,则效果等同repartition,使⽤hash partitioner分区
相⽐repartition,coalasce还可以传⼊⼀个⾃定义分区器,分区器必须实现rializable序列化
总结:如果是减少分区, ⽤coalasce即可,尽量避免 shuffle