大数据组件

基数估计，count distinct

“基数”指集合的总个数。假设哈希函数h能把值均匀映射到[0,1]，取k个不同数，它们哈希值的最小值Y的期望为。或取一组Y计算平均值Z，算很多个Z取中位数。

LogLog计数，内存O(lglgn)，原理是n次伯努利过程。hash值的二进制数可当作抛硬币的1个伯努利过程，正面为1反面为0。n个二进制数对应n个伯努利过程。通过n个二进制数中1出现的第一个位置的最大值，估算总数。

HyperLogLog为减小误差，使用分桶平均。把二进制数的后14位作为桶编号，把前50位中1出现的第一个位置的最大值记录在桶中，每个桶需要6位来计数。最终估计=每个桶估计值的调和平均数*桶数。

频繁项，heavy hitters

初始化k个bin，每个bin内含元素e（初始为null）和计数counter（初始为0）

for each element e in the stream do
	if e is in a bin b then
		increment b's counter
	elseif find a bin whose counter is 0
		set its element to e and its counter to 1
	else decrement the counter of every bin

最终找到的不一定是频繁项，但频繁项一定都被找到

集合包含，bloom filter

说集合不含某元素时肯定不含，说集合含某未知元素时可能不含

近似计数，count-min sketch

一个二维计数表r行c列，对数据流中的每一项做r次不同哈希，第i次哈希将第i行某列值+1。最终该数的近似计数为 min{它的r次不同哈希对应的计数}

🧠 Brain

Explorer

大数据组件

大数据组件

基数估计，count distinct

频繁项，heavy hitters

集合包含，bloom filter

近似计数，count-min sketch

Graph View

Table of Contents

Backlinks

🧠 Brain

Explorer

大数据组件

大数据组件 §

基数估计，count distinct §

频繁项，heavy hitters §

集合包含，bloom filter §

近似计数，count-min sketch §

Graph View

Table of Contents

Backlinks

大数据组件

基数估计，count distinct

频繁项，heavy hitters

集合包含，bloom filter

近似计数，count-min sketch