How about this. For each row, compute an 8 bit hash and write to a file whose na... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

imichael on April 18, 2022 | parent | context | favorite | on: Ask HN: How to Deal with CSVs?

How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).

nullbytesmatter on April 18, 2022 [–]

I have tried something similar, writing each entry to a trie on the filesystem and store each duplicate row on the filesystem. The problem was this created a heck of a mess and was taking too long, but it seems like it's feasible.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact