elasticjob已下线_Elastic-job启动“假死”的问题分析

更新时间:2023-05-20 22:54:51 阅读：评论：0

问题记录

最近项⽬引⼊Elastic Job实现定时任务的分布式调度。引⼊的版本2.1.5，加⼊相关的job配置后启动项⽬，主线程假死，不进⾏后续逻辑处理和⽇志输出。

输出的⽇志如下：

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.049] [] [StdSchedulerFactory] [Using default implementation for ThreadExecutor]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.130] [] [SchedulerSignalerImpl] [Initialized Scheduler Signaller of type: class SchedulerSignalerImpl]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.131] [] [QuartzScheduler] [Quartz Scheduler v.2.2.1 created.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.135] [] [JobShutdownHookPlugin] [Registering Quartz shutdown hook.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.136] [] [RAMJobStore] [RAMJobStore initialized.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'

Scheduler class: 'QuartzScheduler' - running locally.

NOT STARTED.

Currently in standby mode.

Number of jobs executed: 0

Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.

Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.

]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler

'dailyScanMercReratingJob' initialized from an externally provided properties instance.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]

解决⽅案

直接上解决⽅案：将项⽬的curator的框架版本全部调整为2.10.0 ，包括

curator-client;

curator-recipes;

curator-framework

项⽬中因为curator-framework引⽤了2.7.0导致出现了此问题。

问题追踪

在项⽬的Spring框架的以下位置打断点追踪项⽬启动过程：

发现代码在fresh() ⽅法⾥，执⾏： finishBeanFactoryInitialization(beanFactory) 时陷⼊等待⼀直⽆法跳出继续执⾏。

根据Spring框架的启动机制，finishBeanFactoryInitialization 是完成单例bean的初始化的⽅法，这个⽅法会去真正操作elastic-job 对于job的操作代码。

眼睛流泪是什么原因从⽇志中发现代码最后⼀⾏输出为：

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]

在StdSchedulerFactory 类中找到关于Quartz scheduler version 相关的⽇志打断点继续跟踪：

//…………………… 省略之前的代码

jrsf.initialize(scheduler);

qs.initialize();

getLog().info(

"Quartz scheduler '" + SchedulerName()

+ "' initialized from " + propSrc);

//这⾥断点

getLog().info("Quartz scheduler version: " + qs.getVersion());

// prevents the repository from being garbage collected

qs.addNoGCObject(schedRep);

// prevents the db manager from being garbage collected

if (dbMgr != null) {

qs.addNoGCObject(dbMgr);

}

schedRep.bind(scheduler);

return scheduler;

在上⾯的位置断点后发现，elastic job 继续执⾏，持续跟踪最终跟踪到SchedulerFacade类的registerStartUpInfo⽅法：

/**

美丽人生电影简介* 注册作业启动信息.

* @param enabled 作业是否启⽤

public void registerStartUpInfo(final boolean enabled) {

listenerManager.startAllListeners();

leaderService.electLeader();

rverService.persistOnline(enabled);

instanceService.persistOnline();

shardingService.tReshardingFlag();

monitorService.listen();

if (!reconcileService.isRunning()) {

reconcileService.startAsync();

}

代码在leaderService.electLeader(); 陷⼊等待。

根据以上最终可⽤得出结论：

elastic-job 在job的选主过程中陷⼊了⽆限等待，即⽆法选出主节点执⾏任务。

根据对LeaderService 的代码的研究，elastic job 选主使⽤的是 curator框架的 LeaderLatch 类完成的。

具体时线程wait的操作在：JobNodeStorage的executeInLeader⽅法中：

/**

* 在主节点执⾏操作.

大明山风景区在哪里*

* @param latchNode 分布式锁使⽤的作业节点名称

* @param callback 执⾏操作的回调

public void executeInLeader(final String latchNode, final LeaderExecutionCallback callback) {

try (LeaderLatch latch = new LeaderLatch(getClient(), FullPath(latchNode))) {

latch.start();

latch.await();

//CHECKSTYLE:OFF

} catch (final Exception ex) {

//CHECKSTYLE:ON

handleException(ex);

}

上⾯的⽅法调⽤ latch.await(); 来等待获取 leadership。由于⽆法获取主节点，导致线程⼀致wait。

蝴蝶结怎么打

LeaderLatch ⼤概的机制为：所有客户端向zk的同⼀个path竞争的写⼊数据，谁先写⼊成功谁就获取了leadership。LeaderLatch的await⽅法如下：

public void await() throws InterruptedException, EOFException

{

synchronized(this)

网课自律作文{

while ( (() == State.STARTED) && !() )

{

wait();

}

if ( () != State.STARTED )

{

throw new EOFException();

}

如果LeaderLatch⽆法获取leadership那么就当前的Thread就会⼀直陷⼊wait。

问题解决

定位到问题的发⽣点，解决问题的思路就要看为什么⽆法获取到leadership。

登录到ZK上查询节点信息，发现正常项⽬启动后，elastic job会向zk的写⼊如下格式的节点内容：

/{job-namespace}/{job-id}/leader/election/latch

但是异常的项⽬是没有这个节点的，所以应该是ZK的操作发⽣了问题。具体哪⾥发⽣了问题这⾥还没有发现。

继续将项⽬⽇志调整为DEBUG级别会发下有如下的⽇志输出：

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.687] [] [RAMJobStore] [RAMJobStore initialized.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'

Scheduler class: 'QuartzScheduler' - running locally.

勇的笔顺NOT STARTED.

Currently in standby mode.

Number of jobs executed: 0

Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.

Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.

]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler

'dailyScanMercReratingJob' initialized from an externally provided properties instance.]

[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]

[DEBUG] [Timer-0] [2018-10-10 17:51:49.553] [] [UpdateChecker] [Checking for available updated version ] [DEBUG] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:49.586] [] [LeaderService] [Elect a new leader now.]

要性[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.724] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.738] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.759] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.769] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.791] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.803] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.813] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]

[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.818] [] [LeaderService] [Elect a new leader now.]

这⾏⽇志的输出代码位于：elastic-job的RegExceptionHandler.handleException()⽅法：

/**

* 处理异常.

处理掉中断和连接失效异常并继续抛注册中⼼.

* @param cau 待处理异常.

public static void handleException(final Exception cau) {

if (null == cau) {

return;

什么是子宫腺肌症}

if (isIgnoredException(cau) || null != Cau() && Cau())) {

log.debug("Elastic job: ignored exception for: {}", Message());

} el if (cau instanceof InterruptedException) {

Thread.currentThread().interrupt();

} el {

throw new RegException(cau);

}

这⾥ elastic job ignore了zk的操作异常，导致选主失败但是并没有做兼容处理，主线程陷⼊ wait()。

本文发布于:2023-05-20 22:54:51，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/920913.html

上一篇：林肯公园经典歌曲faint歌词

下一篇：Oracle数据库对象编译报错

标签：节点问题操作处理代码启动发现

留言与评论（共有 0 条评论）