Hashing to estimate the size of a stream

droffel · on Jan 6, 2016

This feels very similar to how the total work expended by the Bitcoin network can be approximated (not the current hashrate, but the sum of all hashes over time). Basically, you can take all of the block hashes, and find the lowest one of the lot. From that number, you can estimate the total work performed by the network in aggregate. The concept is basically the same here.

Someone · on Jan 6, 2016

I find this interesting, but somewhat sloppy writing. Examples:

"Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory"

That is not what the title promises: the size of the stream.

"assume the data in the stream consists of uniformly random integers between zero and N"

I guess the lower bound is inclusive and the upper bound exclusive (that probably doesn't matter in practice. It would matter if you want to estimate the number of different byte values in a data stream on a truly tiny device, but I would guess that anything that can do that division to get the estimate would have the 32 bytes to store a bitmask with 256 values)

"Maintain a single integer xMin [...] expected value would be 1/(xMin+1)"

With xMin a nonnegative integer, that estimate (of the number of different values in the stream) would never get larger than 1. The code is correct; it multiplies by N to get a number between 1 and N (inclusive).

j2kun · on Jan 7, 2016

Thanks for the feedback. I updated the post a bit, but I'm not sure what to use in place of "size." Maybe "cardinality" but that's not a very appealing word for a title. Suggestions welcome :)

JFlash · on Jan 6, 2016

> So if we assume the data in the stream consists of uniformly random integers between zero and N

This needs to be in the problem statement.

jsnell · on Jan 6, 2016

It shouldn't, the algorithm as described works for arbitrary data. The whole point of the hashing is to transform the stream of arbitrary data to a stream of uniformly distributed random integers, so that you can then apply the reasoning from that paragraph.

dragontamer · on Jan 6, 2016

I enjoyed the discussion at least. I always enjoy these estimation algorithms. Most of the time, you don't really need a precise answer.