在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数:
select course, count(distinct sid)from stu_tablegroup by course;
Hive
在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:
select app, count(distinct uid) as uvfrom log_tablewhere week_cal = '2016-03-27'
Pig
与之类似,Pig的写法:
-- all usersdefine DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B);}A = load '/path/to/data' using PigStorage() as (app, uid);B = DISTINCT_COUNT(A, uid);--A = load '/path/to/data' using PigStorage() as (app, uid);B = distinct A;C = group B by app;D = foreach C generate group as app, COUNT($1) as uv;-- suitable for small cardinality scenariosD = foreach C generate group as app, SIZE($1) as uv;
为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus
,其采用HyperLogLog++算法,更为快速地Distinct Count:
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();A = load '/path/to/data' using PigStorage() as (app, uid);B = group A by app;C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
Spark
在Spark中,Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count:
rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._1, 1) } .reduceByKey(_ + _)// orrdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _)// or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._1) .countByValue()
同时,Spark提供近似Distinct Count的API:
rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)
实现是基于HyperLogLog算法:
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available .
或者,将Schema化的RDD转成DataFrame后,registerTempTable然后执行sql命令亦可:
val sqlContext = new SQLContext(sc)val df = rdd.toDF()df.registerTempTable("app_table")val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")