Hive优化之多count(distinct)
本篇介绍Hive三种⽅法,优化多count(distinct )
带有似的成语先上待优化代码:小事情
lect count(distinct sid) as sid
,count(distinct entity_id) as entity_id图片白色背景
,count(distinct billing_status_code) as billing_status_code
from c_detail
where cal_dt='2020-03-30';
因为count(disticnt)需要去重操作,需要将所有数据放到同⼀task去重,只会产⽣⼀个reduce task。如果数据量过⼤,成为性能瓶颈
两种普通操作:
⼀、负载均衡法(将⼀步到位转成两步 ①去重 ②count)
新生儿打嗝怎么回事
⼀、负载均衡法
lect max(ca when col = 'sid' then cn end) as sid
,max(ca when col = 'entity_id' then cn end) as entity_id
,max(ca when col = 'billing_status_code' then cn end) as billing_status_code
from
(lect count(1) as cn,'sid' as col from (lect sid from c_detail where cal_dt='2020-03-30' group by sid)a union all
lect count(1) as cn,'entity_id' as col from (lect entity_id from c_detail where cal_dt='2020-03-30' group by entity_id)a union all
lect count(1) as cn,'billing_status_code' as col from (lect billing_status_code from c_detail where cal_dt='2020-03-30' group by billing_status_code)a )b ;
优点:负载均衡,完全解决⼀个reduce产⽣的性能瓶颈
缺点:写起来⽐较⿇烦,需要每个字段单独去重union all到⼀起。最后还得⾏转列,如果⼏⼗个count(distinct) 写死也不为过。
⼆、省事法
⼆、省事法 ①先组合去重减少数据量 ②count(distinct) 第⼆步仍然是⼀个reduce,但是数据量减少了。
lect count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)
from
(lect sid,entity_id,billing_status_code
from c_detail
where cal_dt='2020-03-30'免费设计个性签名
阿哥你别走
group by sid,entity_id,billing_status_code)a;
毒品成瘾优点:写起来简单
缺点:如果去重后数据量仍然⾮常⼤,还是跑不出来,如果sid是⽤户id基数很⼤,其他两列基数很⼩,这个去重效果就微乎其微,需要⼀⼆组合使⽤
组合折中法
lect count(distinct sid),count(distinct entity_id),count(distinct billing_status_code)
from
(lect entity_id,billing_status_code from c_detail
where cal_dt='2020-03-30'
group by entity_id,billing_status_code)a
join (lect count(1) as sid from (lect distinct sid from c_detail
where cal_dt='2020-03-30') b)c on 1=1;
天坛公园景点介绍
如果基数低字段较多的话,组合使⽤写起来也⾮常烦
三、⾼端⼤⽓grouping ts法,既写起来⽅便,⼜能负载均衡
lect count(ca when entity_id is null and billing_status_code is null then 1 end) as sid
,count(ca when sid is null and billing_status_code is null then 1 end) as entity_id
,count(ca when sid is null and entity_id is null then 1 end) as billing_status_code
from
(lect sid,entity_id,billing_status_code
from c_detail
group by sid,entity_id,billing_status_code
grouping ts((sid),(entity_id),(billing_status_code)))a;
相当于使⽤grouping ts来代替 ⽆数的group by+union all
缺点:grouping ts组过多会产⽣性能问题,grouping ts语法逻辑,会产⽣什么样的性能问题呢?请看下篇 Hive多维度聚合