2017-08-29

Hive针对distinct的优化(二)

Hive针对distinct的优化。

之前一篇针对单个count(distinct xxx)的优化，本文来讲讲对多个count(distinct xxx)的优化。

0x00 解决思路

优化是在之前单个count的基础上，通过使用union all以及窗口分析函数lag的结合来进行的。具体思路如下。

0x01 分治法

SELECT 
    pid, c1, c2 
FROM
    (select 
        pid, c1, lag(c2,1) over win c2, row_number() over win rn 
    from 
        (select--取得c1的值 
            pid, sum(tn) c1, null c2  
        from 
            (select 
                pid, substr(uid,1,4) tag, count(distinct substr(uid, 5)) tn 
            from 
                xxtable 
            group by  
                pid,substr(uid,1,4)
            )t1
        group by pid
        
        union all
        select--取得c2的值 
            pid, null c1, sum(tn1) c2  
        from 
            (select 
                pid, substr(cid,1,4) tag, count(distinct substr(cid, 5)) tn1 
            from
                xxtable 
            group by  
                pid,substr(cid,1,4)
            )t2
        group by pid
        )t3
    window win as (partition by pid order by c1)
    )t4
WHERE 
    rn = 2 --值取决于具体情况

此方法适用于求少量count distinct的情况，有多少个count distinct就union all多少次。
在最内层t1、t2中分别求得需要的计数值。
在中间层t3使用窗口分析函数lag或者lead（或者使用max方法也行）对多个结果值进行聚合。
在最外层t4通过row_number筛选出需要行。

经过验证，该方法在5000万数据量的情况下，不优化需要4.5分钟，经过优化需要1.5分钟，提升效果较为明显。

0x10 随机分组法

SELECT 
    pid, c1, c2 
FROM
    (select 
        pid, c1, lag(c2,1) over win c2, row_number() over win rn 
    from 
        (select 
            pid,sum(tc) c1 , null c2 
        from 
            (select 
                pid, count(1) tc,tag  
            from 
                (select 
                    pid, cast(rand() * 100 as bigint) tag, uid 
                from 
                    xxtable
                group by 
                    pid, uid
                )t1 
            group by pid, tag
            )t2
        group by pid
    
        union all
        select 
            pid, null c1 , sum(tc) c2 
        from 
            (select 
                pid, count(1) tc,tag  
            from 
                (select 
                    pid, cast(rand() * 100 as bigint) tag, class_id 
                from 
                    xxtable
                group by pid, cid
                )t1 
            group by pid, tag
            )t2
            group by pid
        )t3
    window win as (partition by pid order by c1)
    )t4
WHERE 
    rn = 2 --值取决于具体情况

经过验证，该方法在5000万数据量的情况下，不优化需要4.5分钟，经过优化需要40秒，效果更加明显。