Hacker News new | past | comments | ask | show | jobs | submit login

A common metric for how much actual content has changed is the Jaccard Index. Even for large numbers of datasets that are too large to fit in memory it can be approximated with various forms of MinHash algorithms. Some write up here: https://blog.nelhage.com/post/fuzzy-dedup/

https://en.wikipedia.org/wiki/Jaccard_index




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: