ClickHou常见问题及其解决⽅案
学术论文翻译1 概述
在对ClickHou进⾏分布表+复制表+zookeeper保证⾼可⽤的情况下进⾏性能测试时遇到如下坑,进⾏整理
2 分布表join问题Unknown identifier: LO_CUSTKEY, context:…
1.1 问题描述
SQL如下:hidi
SELECT count(1)
FROM performance.line_all AS c
LEFT JOIN performance.customer_all AS l ON l.C_CUSTKEY = c.LO_CUSTKEY
执⾏该SQL报错如下:
Received exception from rver (version 19.4.0):
适合学英语的美剧Code: 47. DB::Exception: Received from 10.0.0.50:9000. DB::Exception: Received from ambari04:9000, 10.0.0.54. DB::Exception: Unknown identifier: LO_ CUSTKEY, context: query: 'LO_CUSTKEY' required_names: 'LO_CUSTKEY' source_tables: table_alias: complex_alias: masked_columns: array_join_ columns: source_columns: .
根据报错信息可以不知道LO_CUSTKEY,这个连接字段
大耳朵1.2 解决
分布表join,在on后的连接条件中,from后⾯跟的表的连接字段放在前⾯。修改SQL如下:
inception 电影SELECT count(1)
FROM performance.customer_all AS c
LEFT JOIN performance.line_all AS l ON l.C_CUSTKEY = c.LO_CUSTKEY
2 与Zookeeper连接丢失,Unknown status, Cannot allocate block number in ZooKeeper: , ZooKeeper ssion has been expired…
2.1 问题描述
在执⾏SQL中如在遇到如下报错:工程项目管理办法
↑ Progress: 157.94 million rows, 6.91 GB (92.63 thousand rows/s., 4.05 MB/s.) Received exception from rver (version 19.4.0):
Code: 319. DB::Exception: Received from 10.0.0.50:9000. DB::Exception: Unknown status, client must retry. Reason: Connection loss.
↖ Progress: 94.47 million rows, 4.18 GB (95.07 thousand rows/s., 4.20 MB/s.) Received exception from rver (version 19.4.0):
virginland
Code: 999. DB::Exception: Received from 10.0.0.50:9000. DB::Exception: Cannot allocate block number in ZooKeeper: Coordination::Exception: Connectio n loss.
lineorder_flat_all.Distributed.DirectoryMonitor: Code: 225, e.displayText() = DB::Exception: Received from ambari02:9000, 10.0.0.52. DB::Exception: ZooK eeper ssion has been expired.. Stack trace:
根据报错信息可知,是因为与Zookeeper的连接丢失导致不能分配块号等问题。因为clickhou对zookeeper的依赖⾮常的重,表的元数据信息,每个数据块的信息,每次插⼊的时候,数据同步的时候,都需要和zookeeper进⾏交互。zookeerper 服务在同步⽇志过程中,会导致ZK⽆法响应外部请求,
进⽽引发ssion过期等问题
fixmbr
2.2 解决
(1)加⼤zookeeper会话最⼤超时时间,在zoo.cfg 中修改MaxSessionTimeout=120000,修改后重启zookeeper。
注意:zookeeper的超时时间不要设置太⼤,在服务挂掉的情况下,会反映很慢
(2)zookeeper的snapshot⽂件存储盘不低于1T,注意清理策略
(3)在zookeeper中将dataLogDir存放⽬录应该与dataDir分开,可单独采⽤⼀套存储设备来存放ZK⽇志。
(4)在ZOO.CFG中增加:forceSync=no。默认是开启的,为避免同步延迟问题,ZK接收到数据后会⽴刻去将当前状态信息同步到磁盘⽇志⽂件中,同步完成后才会应答。将此项关闭后,客户端连接可以得到快速响应。关闭forceSync选项后,会存在潜在风险,虽然依旧会刷磁盘(log.flush()⾸先被执⾏),但因为操作系统为提⾼写磁盘效率,会先写缓存,当机器异常后,可能导致⼀些zk状态信息没有同步到磁盘,从⽽带来ZK前后信息不⼀样问题。
(5)clickhou建表的时候添加u_minimalistic_part_header_in_zookeeper参数,对元数据进⾏压缩存储,但是修改完了以后⽆法再回滚的。
3 分布表只读Table is in readonly mode
3.1 问题描述
如SQL在执⾏插⼊数据时遇到如下错误:
2020.05.28 10:59:11.048910 [ 47 ] {} <Error> lineorder_flat_all.Distributed.DirectoryMonitor: Code: 242, e.displayText() = DB::Exception: Received from a mbari04:9000, 10.0.0.54. DB::Exception: Table is in readonly mode. Stack trace:
是因为zookeeper压⼒太⼤,表处于“read only mode”模式,导致插⼊失败
3.2 解决
(1)在zookeeper中将dataLogDir存放⽬录应该与dataDir分开,可单独采⽤⼀套存储设备来存放ZK⽇志。
les enphants
(2)做好zookeeper集群和clickhou集群的规划,可以多套zookeeper集群服务⼀套clickhou集群。
4 Clickhou 集群zookeeper数据丢失,Can’t get data for node /clickhou/tables/…
4.1 问题描述
如在⽇志中发现如下报错
Cannot create table from metadata file /var/lib/clickhou/metadata/xx/xxx.sql, error: Coordination::Exception: Can’t get data for node /clickhou/tables/xx/ cluster_xxx-01/xxxx/metadata: node doesn’t exist (No node), stack trace:
是因为zookeeper数据丢失,从⽽使clickhou数据库⽆法启动
4.2 解决
(1)将/var/lib/clickhou/metadata/ 下的SQL与/var/lib/clickhou/data/ 下的数据备份之后删除
(2)启动数据库
(3)创建与原来表数据结构的MergeTree表大连日语
(4)将之前分布式表的数据⽂件夹复制到新表的数据⽬录中。
(5)重启数据库
(6)重新创建原结构本地表
(7)重新创建原结构分布式表
(8)inrt into [分布式表] lect * from [MergeTree表]