Hadoop Fair Scheduler Design Document
October18,2010
Contents
1Introduction2 2Fair Scheduler Goals2 3Scheduler Features2
3.1Pools (2)
3.2Minimum Shares (3)
3.3Preemption (3)
3.4Running Job Limits (4)
杭七中3.5Job Priorities (4)
小鸡的英语
3.6Pool Weights (4)
3.7Delay Scheduling (4)
3.8Administration (4)
4Implementation5
4.1Hadoop Scheduling Background (5)
4.2Fair Scheduler Basics (6)
4.3The Schedulable Class (6)
中石化加油卡办理
4.4Fair Sharing Algorithm (7)
4.5Preemption (7)
美女背景4.6Fair Share Computation (7)
4.7Running Job Limits (8)
4.8Delay Scheduling (9)
4.9Locking Order (10)
4.10Unit Tests (10)
5Code Guide11
1
1Introduction
The Hadoop Fair Scheduler started as a simple means to share MapReduce clusters.Over time,it has grown in functionality to support hierarchical scheduling,preemption,and multiple ways of organizing and weighing jobs.This document explains the goals and features of the Fair Scheduler and its internal design.
2Fair Scheduler Goals
The Fair Scheduler was designed with four main goals:
1.Run small jobs quickly even if they are sharing a cluster with large jobs.Unlike
Hadoop’s built-in FIFO scheduler,fair scheduling lets small jobs make progress even if a large job is running,without starving the large job.
2.Provide guaranteed rvice levels to“production”jobs,to let them run alongside
experimental jobs in a shared cluster.
3.Be simple to administer and configure.The scheduler should do something reasonable
“out of the box,”and urs should only need to configure it as they discover that they want to u more advanced features.
4.Support reconfiguration at runtime,without requiring a cluster restart.
3Scheduler Features
This ction provides a quick overview of the features of the Fair Scheduler.A detailed us-age guide is available in the Hadoop documentation in build/docs/fair scheduler.html.
3.1Pools
The Fair Scheduler groups jobs into“pools”and performs fair sharing between the pools. Each pool can u either FIFO or fair sharing to schedule jobs internal to the pool.The pool that a job is placed i
n is determined by a JobConf property,the“pool name property”. By default,this is mapreduce.job.ur.name,so that there is one pool per ur.However, different properties can be up.name to have one pool per Unix group.
A common trick is to t the pool name property to an unud property name such as pool.name and make this default to mapreduce.job.ur.name,so that there is one pool per ur but it is also possible to place jobs into“special”pools by tting their pool.name directly.l snippet below shows how to do this:托福题型
2
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>pool.name</value>
</property>
<property>
<name>pool.name</name>
<value>${mapreduce.job.ur.name}</value>
</property>
3.2Minimum Shares
Normally,active pools(tho that contain jobs)will get equal shares of the map and reduce task slots in the cluster.However,it is also possible to t a minimum share of map and reduce slots on a given pool,which is a number of slots that it will always get when it is active,even if its fair share would be below this number.This is uful for guaranteeing that production jobs get a certain desired level of rvice when sharing a cluster with non-production jobs.Minimum shares have three effects:
1.The pool’s fair share will always be at least as large as its minimum share.Slots
are taken from the share of other pools to achieve this.The only exception is if the minimum shares of the active pools add up to more than the total number of slots in the cluster;in this ca,each pool’s share will be scaled down proportionally.
2.Pools who running task count is below their minimum share get assigned slotsfirst
when slots are available.
3.It is possible to t a preemption timeout on the pool after which,if it has not received银行转型
enough task slots to meet its minimum share,it is allowed to kill tasks in other jobs to meet its share.Minimum shares with preemption timeouts thus act like SLAs.
Note that when a pool is inactive(contains no jobs),its minimum share is not“rerved”for it–the slots are split up among the other pools.
3.3Preemption
As explained above,the scheduler may kill tasks from a job in one pool in order to meet the minimum share of another pool.We call this preemption,although this usage of the word is somewhat strange given the normal definition of preemption as pausing;really it is the job that gets preempted,while the task gets killed.The feature explained above is called min share preemption.In addition,the scheduler supports fair share preemption,to kill tasks when a pool’s fair share is not being met.Fair share preemption is much more conrvative than min share preemption,becau pools without min shares are expected
3
to be non-production jobs where some amount of unfairness is tolerable.In particular,fair share preemption activates if a pool has been below half of its fair share for a configurable fair share preemption timeout,which is recommended to be t fairly 10minutes).
In both types of preemption,the scheduler kills the most recently launched tasks from over-scheduled pools,to minimize the amount of computation wasted by preemption. 3.4Running Job Limits
The fair scheduler can limit the number of concurrently running jobs from each ur and from each pool.This is uful for limiting the amount of intermediate data generated on the cluster.The jobs that will run are chon in order of submit time and priority.Jobs submitted beyond the limit wait for one of the running jobs tofinish.
3.5Job Priorities
Within a pool,job priorities can be ud to control the scheduling of jobs,whether the pool’s internal scheduling mode is FIFO or fair sharing:
•In FIFO pools,jobs are orderedfirst by priority and then by submit time,as in Hadoop’s default scheduler.
•In fair sharing pools,job priorities are ud as weights to control how much share a job gets.The normal priority corresponds to a weight of1.0,and each level gives2x more weight.For example,a high-priority job gets a weight of2.0,and will therefore get2x the share of a normal-priority job.
3.6Pool Weights
Pools can be given weights to achieve unequal sharing of the cluster.For example,a pool with weight2.0gets2x the share of a pool with weight1.0.
3.7Delay Scheduling
The Fair Scheduler contains an algorithm called delay scheduling to improve data locality. Jobs that cannot launch a data-local map task wait for some period of time before they are allowed to launch non-data-local tasks,ensuring that they will run locally if some node in the cluster has the relevant data.Delay scheduling is described in detail in Section4.8.
3.8Administration
The Fair Scheduler includes a web UI displaying the active pools and jobs and their fair shares,moving jobs between pools,and changing job priorities.In addition,the Fair Scheduler’s allocationfile(specifying min shares and preemption timeouts for the pools) is automatically reloaded if it is modified on disk,to allow runtime reconfiguration.
4
4Implementation
4.1Hadoop Scheduling Background
Hadoop jobs consist of a number of map and reduce tasks.The task run in slots on
the nodes on the cluster.Each node is configured with a number of map slots and reduce
slots bad on its computational resources(typically one slot per core).The role of the scheduler is to assign tasks to any slots that are free.
All schedulers in Hadoop,including the Fair Scheduler,inherit from the TaskScheduler abstract class.This class provides access to a TaskTrackerManager–an interface to我真棒
the JobTracker–as well as a Configuration instance.It also ask the scheduler to implement three abstract methods:the lifecycle methods start and terminate,and a method called assignTasks to launch tasks on a given TaskTracker.Task assignment
in Hadoop is reactive.TaskTrackers periodically nd heartbeats to the JobTracker with
their TaskTrackerStatus,which contains a list of running tasks,the number of slots on
感控知识培训内容the node,and other information.The JobTracker then calls assignTasks on the scheduler
to obtain tasks to launch.The are returned with the heartbeat respon.
Apart from reacting to heartbeats through assignTasks,schedulers can also be notified when jobs have been submitted to the cluster,killed,or removed by adding listeners to
the TaskTrackerManager.The Fair Scheduler ts up the listeners in its start method.
An important role of the listeners is to initialize jobs that are submitted–until a job is initialized,it cannot launch tasks.The Fair Scheduler currently initializes all jobs right away,but it may also be desirable to hold offinitializing jobs if too many are submitted to
limit memory usage on the JobTracker.
Selection of tasks within a job is mostly done by the JobInProgress class,and not
by individual schedulers.JobInProgress expos two methods,obtainNewMapTask and obtainNewReduceTask,to launch a task of either type.Both methods may either return
a Task object or null if the jo
b does not wish to launch a task.Whether a job wishes to launch a task may change back and forth during its lifetime.Even after all tasks in the
job have been started,the job may wish to run another task for speculative execution.In addition,if the node containing a map task failed,the job will wish to re-run it to rebuild
its output for u in the reduce tasks.Schedulers may therefore need to poll multiple jobs
until theyfind one with a task to run.
Finally,for map tasks,an important scheduling criterion is data locality:running the
task on a node or rack that contains its input data.Normally,JobInProgress.obtainNewMapTask returns the“clost”map task to a given node.However,to give schedulers slightly more control over data locality,there is also a version of obtainNewMapTask that allow the sched-
uler to cap the level of non-locality allowed for the quest a task only on the same node,or null if none is available).The Fair Scheduler us this method with an algorithm called delay scheduling(Section4.8)to optimize data locality.
5