Hacker News new | past | comments | ask | show | jobs | submit login

I find this interesting, but somewhat sloppy writing. Examples:

"Problem: Estimate the number of distinct items in a data stream that is too large to fit in memory"

That is not what the title promises: the size of the stream.

"assume the data in the stream consists of uniformly random integers between zero and N"

I guess the lower bound is inclusive and the upper bound exclusive (that probably doesn't matter in practice. It would matter if you want to estimate the number of different byte values in a data stream on a truly tiny device, but I would guess that anything that can do that division to get the estimate would have the 32 bytes to store a bitmask with 256 values)

"Maintain a single integer xMin [...] expected value would be 1/(xMin+1)"

With xMin a nonnegative integer, that estimate (of the number of different values in the stream) would never get larger than 1. The code is correct; it multiplies by N to get a number between 1 and N (inclusive).




Thanks for the feedback. I updated the post a bit, but I'm not sure what to use in place of "size." Maybe "cardinality" but that's not a very appealing word for a title. Suggestions welcome :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: