SparkSQLdistinct分析优化总结

更新时间:2023-07-11 19:06:35 阅读: 评论:0

SparkSQLdistinct分析优化总结
⽬录
Spark count distinct原理
由于distinct过程会导致数据膨胀,导致shuffle、reduce双端数据倾斜,因此distinct算⼦操作特别慢
distinct慢的主要原因:
数据膨胀原理:
lect
count(distinct id),
count(distinct name)
from table_a
1. distinct算⼦在处理过程中是将distinct后的字段和group by字段共同作为key传⼊reduce,导致shuffle前map阶段没有预聚合,同
时shuffle时⽹络传输数据量过⼤消耗增加,对reduce处理时负载也增⼤
2. distinct算⼦在处理过程中会将原有数据膨胀,有N个DISTINCT关键字数据就会在map端膨胀N倍,同时对shuffle和reduce的长尾
影响(原因1)也会扩⼤N倍
作业分析:此处有7个count distinct 操作,导致数据膨胀了7倍37,758,58037,758,580
优化⽅案:
当要算uv时需要⽤到count(),count(DISTINCT),可以分成两步计算
1. 先根据group by后的维度与distinct后字段共同作为维度聚合⼀次,计算出某cuid的pv
2. 再根据原有维度聚合,通过SUM(pv),COUNT(cuid)的⽅式避免出现DISTINCT关键字,避免数据膨胀及distinct shuffle的发⽣
样例:
优化前:
SELECT
白胡椒field1_all,
field2_all,
field3_all,
field4_all,
field5_all,
count(1) AS pv,
count(distinct uid) as uv,
sum(if(field1 = '1',1,0)) as download_pv,
sum(if(field1 = '2',1,0)) as install_pv,
sum(if(field1 = '3',1,0)) as launch_pv,
count(distinct ca when field1 = '1' then cuid el null end) as download_uv,
count(distinct ca when field1 = '2' then cuid el null end) as install_uv,
count(distinct ca when field1 = '3' then cuid el null end) as launch_uv,
sum(if(field1 = '1' and field2 = '2',1,0)) as download_succ_pv,
sum(if(field1 = '2' and field2 = '13',1,0)) as install_succ_pv,
sum(if(field1 = '3' and field2 = '14',1,0)) as launch_succ_pv,
count(distinct ca when field1 = '1' and field2 = '2' then cuid el null end) as download_succ_uv,   
    count(distinct ca when field1 = '2' and field2 = '13' then cuid el null end) as install_succ_uv,        count(distinct ca when field1 = '3' and field2 = '14' then cuid el null end) as launch_succ_uv    FROM
(
SELECT
field1,
uid,
field2,
field3,
field4,
field5,
field6
FROM table
WHERE day = '{DATE}'
科目三灯光考试口诀AND id = 'xxx'
AND from = 'xxx'
一脸疑惑) tbl_1
LATERAL VIEW explode(array(field1, 'all')) A AS field1_all
LATERAL VIEW explode(array(field2, 'all')) B AS field2_all
LATERAL VIEW explode(array(field3, 'all')) C AS field3_all
LATERAL VIEW explode(array(field4, 'all')) D AS field4_all
LATERAL VIEW explode(array(field5, 'all')) D AS field5_all
GROUP BY
field1_all,
field2_all,
field3_all,
field4_all,
field5_all
优化后:
执⾏时间从30h缩短到5h,shuffle阶段数据量降低10倍左右
SELECT
field1_all,
field2_all,
field3_all,
field4_all,
field5_all,
sum(cnt) AS pv,
sum(ca when uv > 0 then 1 el 0 end) as uv,
sum(download_pv) as download_pv,
sum(install_pv) as install_pv,
泰山玉sum(launch_pv) as launch_pv,
sum(ca when download_uv > 0 then 1 el 0 end) as download_uv,
sum(distinct ca when install_uv > 0 then 1 el 0 end) as install_uv,
sum(distinct ca when launch_uv > 0 then 1 el 0 end) as launch_uv,
sum(download_succ_pv) as download_succ_pv,
sum(install_succ_pv) as install_succ_pv,粉条做法
sum(launch_succ_pv) as launch_succ_pv,
sum(ca when download_succ_uv > 0 then 1 el 0 end) as download_succ_uv,
sum(ca when install_succ_uv > 0 then 1 el 0 end) as install_succ_uv,
sum(ca when launch_succ_uv > 0 then 1 el 0 end) as launch_succ_uv
FROM (
SELECT
field1_all,
field2_all,
field3_all,
field4_all,
field5_all,
count(1) AS cnt,
count(uid) as uv,
sum(if(field1 = '1',1,0)) as download_pv,
sum(if(field1 = '2',1,0)) as install_pv,
sum(if(field1 = '3',1,0)) as launch_pv,
count(ca when field1 = '1' then 1 el null end) as download_uv,
count(ca when field1 = '2' then 1 el null end) as install_uv,
count(ca when field1 = '3' then cuid el null end) as launch_uv,
sum(if(field1 = '1' and field2 = '2',1,0)) as download_succ_pv,
sum(if(field1 = '2' and field2 = '13',1,0)) as install_succ_pv,
sum(if(field1 = '3' and field2 = '14',1,0)) as launch_succ_pv,
count(ca when field1 = '1' and field2 = '2' then 1 el null end) as download_succ_uv,        count(ca when field1 = '2' and field2 = '13' then 1 el null end) as install_succ_uv,        count(ca when field1 = '3' and field2 = '14' then 1 el null end) as launch_succ_uv
FROM
(
SELECT
field1,
uid,
field2,
白酒的作用
field6,
field3,
field4,
field5
FROM table
WHERE day = '{DATE}'
AND id = 'xxx'
AND from = 'xxx'
GROUP BY
uid,
field1,
field2,
field3,
field4,
field5
) t1
LATERAL VIEW explode(array(field1, 'all')) A AS field1_all
LATERAL VIEW explode(array(field2, 'all')) B AS field2_all
LATERAL VIEW explode(array(field3, 'all')) C AS field3_all
LATERAL VIEW explode(array(field4, 'all')) D AS field4_all
LATERAL VIEW explode(array(field5, 'all')) D AS field5_all    GROUP BY
field1_all,
凉拌酸辣土豆丝field2_all,
field3_all,
field4_all,
field5_all,
uid
) t2
GROUP BY
field1_all,
field2_all,
field3_all,
field4_all,
field5_all;
参考:
《阿⾥巴巴⼤数据之路》P269
>最火的女明星

本文发布于:2023-07-11 19:06:35,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/1091215.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:数据   膨胀   导致
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图