首页 > 美文阅读

详细分析Redis集群故障

更新时间:2023-05-17 09:26:54 阅读：评论：0

详细分析Redis集群故障

故障表象：

业务层⾯显⽰提⽰查询redis失败

集群组成：

3主3从，每个节点的数据有8GB

机器分布：

在同⼀个机架中，

<199

<200

时髦词汇201

redis-rver进程状态：

通过命令ps -eo pid,lstart | grep $pid，

发现进程已经持续运⾏了3个⽉

发⽣故障前集群的节点状态：

<200:8371(bedab2c537fe94f8c0363ac4ae97d56832316e65) master

<199:8373(792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave

<201:8375(5ab4f85306da6d633e4834b4d3327f45af02171b) master

<201:8372(826607654f5ec81c3756a4a21f357e644efe605a) slave

<199:8370(462cadcb41e635d460425430d318f2fe464665c5) master

<200:8374(1238085b578390f3c8efa30824fd9a4baba10ddf) slave

---------------------------------下⾯是⽇志分析--------------------------------------

步1：

主节点8371失去和从节点8373的连接：

46590:M 09 Sep 18:57:51.379 # Connection with 199:8373 lost.

步2：

主节点8370/8375判定8371失联：

42645:M 09 Sep 18:57:50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached).

步3：

从节点8372/8373/8374收到主节点8375说8371失联：

46986:S 09 Sep 18:57:50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

步4：

主节点8370/8375授权8373升级为主节点转移：

42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

动物图片简笔画步5：

原主节点8371修改⾃⼰的配置，成为8373的从节点：

46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring mylf as a replica of

792020fe66c00ae56e27cd7a048ba6bb2b67adb6

步6：

主节点8370/8375/8373明确8371失败状态：

42645:M 09 Sep 18:57:51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: master without slots is reachable again.

步7：

杏林春

新从节点8371开始从新主节点8373，第⼀次全量同步数据：

8373⽇志：：

4255:M 09 Sep 18:57:51.906 * Full resync requested by 200:8371

4255:M 09 Sep 18:57:51.906 * Starting BGSAVE for SYNC with target: disk

4255:M 09 Sep 18:57:51.941 * Background saving started by pid 5230

8371⽇志：：

46590:S 09 Sep 18:57:51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

步8：

主节点8370/8375判定8373(新主)失联：

42645:M 09 Sep 18:58:00.320 * Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

步9：

主节点8370/8375判定8373(新主)恢复：

60295:M 09 Sep 18:58:18.181 * Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is rving its slots after some time.

步10：

主节点8373完成全量同步所需要的BGSAVE操作：

5230:C 09 Sep 18:59:01.474 * DB saved on disk

5230:C 09 Sep 18:59:01.491 * RDB: 7112 MB of memory ud by copy-on-write

4255:M 09 Sep 18:59:01.877 * Background saving terminated with success

步11：

从节点8371开始从主节点8373接收到数据：

46590:S 09 Sep 18:59:02.263 * MASTER <-> SLAVE sync: receiving 2657606930 bytes from master

步12：

主节点8373发现从节点8371对output buffer作了限制：

4255:M 09 Sep 19:00:19.014 # Client id=14259015 200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be clod ASAP for overcoming of output buffer limits.

4255:M 09 Sep 19:00:19.015 # Connection with 200:8371 lost.

步13：

从节点8371从主节点8373同步数据失败，连接断了，第⼀次全量同步失败：

46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost

46590:S 09 Sep 19:00:20.102 * Connecting to 199:8373

46590:S 09 Sep 19:00:20.102 * MASTER <-> SLAVE sync started

步14：

从节点8371重新开始同步，连接失败，主节点8373的连接数满了：

46590:S 09 Sep 19:00:21.103 * Connecting to 199:8373

46590:S 09 Sep 19:00:21.103 * MASTER <-> SLAVE sync started

46590:S 09 Sep 19:00:21.104 * Non blocking connect for SYNC fired the event.

46590:S 09 Sep 19:00:21.104 # Error reply to PING from master: '-ERR max number of clients reached'

步15：

win8平板从节点8371重新连上主节点8373，第⼆次开始全量同步：

8371⽇志：

46590:S 09 Sep 19:00:49.175 * Connecting to 199:8373

46590:S 09 Sep 19:00:49.175 * MASTER <-> SLAVE sync started

46590:S 09 Sep 19:00:49.175 * Non blocking connect for SYNC fired the event.

46590:S 09 Sep 19:00:49.176 * Master replied to PING, replication

46590:S 09 Sep 19:00:49.179 * Partial resynchronization not possible (no cached master)

46590:S 09 Sep 19:00:49.501 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454 8373⽇志：春天医院

4255:M 09 Sep 19:00:49.176 * 200:8371 asks for synchronization

4255:M 09 Sep 19:00:49.176 * Full resync requested by 200:8371

4255:M 09 Sep 19:00:49.176 * Starting BGSAVE for SYNC with target: disk

4255:M 09 Sep 19:00:49.498 * Background saving started by pid 18413

18413:C 09 Sep 19:01:52.466 * DB saved on disk

18413:C 09 Sep 19:01:52.620 * RDB: 2124 MB of memory ud by copy-on-write

4255:M 09 Sep 19:01:53.186 * Background saving terminated with success

步16：

从节点8371同步数据成功，开始加载经内存：

46590:S 09 Sep 19:01:53.190 * MASTER <-> SLAVE sync: receiving 2637183250 bytes from master

46590:S 09 Sep 19:04:51.485 * MASTER <-> SLAVE sync: Flushing old data

46590:S 09 Sep 19:05:58.695 * MASTER <-> SLAVE sync: Loading DB in memory

步17：

集群恢复正常：

结构健康监测42645:M 09 Sep 19:05:58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65: slave is reachable again.

步18：

从节点8371同步数据成功，耗时7分钟：

46590:S 09 Sep 19:08:19.303 * MASTER <-> SLAVE sync: Finished with success

8371失联原因分析：

由于⼏台机器在同⼀个机架，不太可能发⽣⽹络中断的情况，于是通过SLOWLOG GET命令查看了慢查询⽇志，发现有⼀个KEYS命令被执⾏了，耗时8.3秒，再查看集群节点超时设置，发现是5s(cluster-node-timeout 5000)

出现节点失联的原因：

客户端执⾏了耗时1条8.3s的命令，

2016/9/9 18:57:43 开始执⾏KEYS命令

2016/9/9 18:57:50 8371被判断失联（redis⽇志）

2016/9/9 18:57:51 执⾏完KEYS命令

总结来说，有以下⼏个问题：食品安全管理制度

1.由于cluster-node-timeout设置⽐较短，慢查询KEYS导致了集群判断节点8371失联

2.由于8371失联，导致8373升级为主，开始主从同步

3.由于配置client-output-buffer-limit的限制，导致第⼀次全量同步失败了

4.⼜由于PHP客户端的连接池有问题，疯狂连接服务器，产⽣了类似SYN攻击的效果

5.第⼀次全量同步失败后，从节点重连主节点花了30秒（超过了最⼤连接数1w）

关于client-output-buffer-limit参数：

# The syntax of every client-output-buffer-limit directive is the following:

# client-output-buffer-limit <class> <hard limit> <soft limit> <soft conds>

# A client is immediately disconnected once the hard limit is reached, or if

# the soft limit is reached and remains reached for the specified number of

# conds (continuously).

# So for instance if the hard limit is 32 megabytes and the soft limit is

# 16 megabytes / 10 conds, the client will get disconnected immediately

# if the size of the output buffers reach 32 megabytes, but will also get

# disconnected if the client reaches 16 megabytes and continuously overcomes

# the limit for 10 conds.

# By default normal clients are not limited becau they don't receive data

# without asking (in a push way), but just after a request, so only

# asynchronous clients may create a scenario where data is requested faster

# than it can read.

# Instead there is a default limit for pubsub and slave clients, since

# subscribers and slaves receive data in a push fashion.

# Both the hard or the soft limit can be disabled by tting them to zero.

client-output-buffer-limit normal 0 0 0

client-output-buffer-limit slave 256mb 64mb 60

client-output-buffer-limit pubsub 32mb 8mb 60

采取措施：

1.单实例的切割到4G以下，否则发⽣主从切换会耗时很长

多级列表2.调整client-output-buffer-limit参数，防⽌同步进⾏到⼀半失败

3.调整cluster-node-timeout，不能少于15s

4.禁⽌任何耗时超过cluster-node-timeout的慢查询，因为会导致主从切换

5.修复客户端类似SYN攻击的疯狂连接⽅式

总结

以上就是本⽂关于详细分析Redis集群故障的全部内容，希望对⼤家有所帮助。感兴趣的朋友可以参阅：、、等，如有不⾜之处，请留⾔之处。⼩编会及时更正。感谢朋友们对⽹站的⽀持！

本文发布于:2023-05-17 09:26:54，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/665908.html

上一篇：最佳团队合作奖颁奖词

下一篇：2023年高中生国旗下演讲稿友谊(七篇)

标签：节点集群开始失联导致数据失败故障

留言与评论（共有 0 条评论）