Hive针对distinct的优化(二)

Hive针对distinct的优化。

之前一篇针对单个count(distinct xxx)的优化,本文来讲讲对多个count(distinct xxx)的优化。

0x00 解决思路

优化是在之前单个count的基础上,通过使用union all以及窗口分析函数lag的结合来进行的。具体思路如下。

0x01 分治法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
SELECT 
pid, c1, c2
FROM
(select
pid, c1, lag(c2,1) over win c2, row_number() over win rn
from
(select--取得c1的值
pid, sum(tn) c1, null c2
from
(select
pid, substr(uid,1,4) tag, count(distinct substr(uid, 5)) tn
from
xxtable
group by
pid,substr(uid,1,4)
)t1
group by pid

union all
select--取得c2的值
pid, null c1, sum(tn1) c2
from
(select
pid, substr(cid,1,4) tag, count(distinct substr(cid, 5)) tn1
from
xxtable
group by
pid,substr(cid,1,4)
)t2
group by pid
)t3
window win as (partition by pid order by c1)
)t4
WHERE
rn = 2 --值取决于具体情况
  • 此方法适用于求少量count distinct的情况,有多少个count distinct就union all多少次。
  • 在最内层t1、t2中分别求得需要的计数值。
  • 在中间层t3使用窗口分析函数lag或者lead(或者使用max方法也行)对多个结果值进行聚合。
  • 在最外层t4通过row_number筛选出需要行。

经过验证,该方法在5000万数据量的情况下,不优化需要4.5分钟,经过优化需要1.5分钟,提升效果较为明显。

0x10 随机分组法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
SELECT 
pid, c1, c2
FROM
(select
pid, c1, lag(c2,1) over win c2, row_number() over win rn
from
(select
pid,sum(tc) c1 , null c2
from
(select
pid, count(1) tc,tag
from
(select
pid, cast(rand() * 100 as bigint) tag, uid
from
xxtable
group by
pid, uid
)t1
group by pid, tag
)t2
group by pid

union all
select
pid, null c1 , sum(tc) c2
from
(select
pid, count(1) tc,tag
from
(select
pid, cast(rand() * 100 as bigint) tag, class_id
from
xxtable
group by pid, cid
)t1
group by pid, tag
)t2
group by pid
)t3
window win as (partition by pid order by c1)
)t4
WHERE
rn = 2 --值取决于具体情况

经过验证,该方法在5000万数据量的情况下,不优化需要4.5分钟,经过优化需要40秒,效果更加明显。