博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
大数据下的Distinct Count(一):序
阅读量:6284 次
发布时间:2019-06-22

本文共 2087 字,大约阅读时间需要 6 分钟。

在数据库中,常常会有Distinct Count的操作,比如,查看每一选修课程的人数:

select course, count(distinct sid)from stu_tablegroup by course;

Hive

在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:

select app, count(distinct uid) as uvfrom log_tablewhere week_cal = '2016-03-27'

Pig

与之类似,Pig的写法:

-- all usersdefine DISTINCT_COUNT(A, a) returns dist {    B = foreach $A generate $a;    unique_B = distinct B;    C = group unique_B all;    $dist = foreach C generate SIZE(unique_B);}A = load '/path/to/data' using PigStorage() as (app, uid);B = DISTINCT_COUNT(A, uid);-- 
A = load '/path/to/data' using PigStorage() as (app, uid);B = distinct A;C = group B by app;D = foreach C generate group as app, COUNT($1) as uv;-- suitable for small cardinality scenariosD = foreach C generate group as app, SIZE($1) as uv;

为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus,其采用HyperLogLog++算法,更为快速地Distinct Count:

define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();A = load '/path/to/data' using PigStorage() as (app, uid);B = group A by app;C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;

Spark

在Spark中,Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count:

rdd.map { row => (row.app, row.uid) }  .distinct()  .map { line => (line._1, 1) }  .reduceByKey(_ + _)// orrdd.map { row => (row.app, row.uid) }  .distinct()  .mapValues{ _ => 1 }  .reduceByKey(_ + _)// or rdd.map { row => (row.app, row.uid) }  .distinct()  .map(_._1)  .countByValue()

同时,Spark提供近似Distinct Count的API:

rdd.map { row => (row.app, row.uid) }    .countApproxDistinctByKey(0.001)

实现是基于HyperLogLog算法:

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available .

或者,将Schema化的RDD转成DataFrame后,registerTempTable然后执行sql命令亦可:

val sqlContext = new SQLContext(sc)val df = rdd.toDF()df.registerTempTable("app_table")val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")

转载于:https://www.cnblogs.com/en-heng/p/5332703.html

你可能感兴趣的文章
Go 时间交并集小工具
查看>>
iOS 多线程总结
查看>>
webpack是如何实现前端模块化的
查看>>
TCP的三次握手四次挥手
查看>>
关于redis的几件小事(六)redis的持久化
查看>>
package.json
查看>>
webpack4+babel7+eslint+editorconfig+react-hot-loader 搭建react开发环境
查看>>
Maven 插件
查看>>
初探Angular6.x---进入用户编辑模块
查看>>
计算机基础知识复习
查看>>
【前端词典】实现 Canvas 下雪背景引发的性能思考
查看>>
大佬是怎么思考设计MySQL优化方案的?
查看>>
<三体> 给岁月以文明, 给时光以生命
查看>>
Android开发 - 掌握ConstraintLayout(九)分组(Group)
查看>>
springboot+logback日志异步数据库
查看>>
Typescript教程之函数
查看>>
Android 高效安全加载图片
查看>>
vue中数组变动不被监测问题
查看>>
3.31
查看>>
类对象定义 二
查看>>