Ask HN: How to Deal with CSVs?

mattewong · on April 18, 2022

Assuming you want to do this on a single machine with finite memory, a feasible approach would be:

1. Split into multiple shard using e.g. `split`, where each piece is small enough to fit comfortably into memory 2. Sort each shard. You could use something like `zsv` (sql sub-command) (https://github.com/liquidaty/zsv) or `xsv` (sort subcommand) (https://github.com/BurntSushi/xsv) for this 3. Use mergesort to sort all shards, at the same time, into a single output table, and in the process, de-dupe and/or merge (if you had too many shards for your operating system to allow so many files to be open at once, you might need to do an intermediate merge to consolidate shards)

For #3, I don't know of a tool to do this-- probably one exists for simple de-duping, but it may be harder if you need support for merge logic. If a tool for this does not exist, I would be imagine `zsv` could quite be easily extended to handle it.

tored · on April 18, 2022

Put into a database, most databases can import CSV files directly.

From there you can start create SELECT queries and depending how much processing you need to do you can create intermediate views for multiple steps. After that your can export the data directly to CSV.

https://www.mysqltutorial.org/import-csv-file-mysql-table/

https://www.mysqltutorial.org/mysql-export-table-to-csv/

https://www.sqlitetutorial.net/sqlite-import-csv/

https://www.sqlitetutorial.net/sqlite-export-csv/

nullbytesmatter · on April 20, 2022

I managed to "dedupe and merge" by importing the CSV into sqlite and add an index on a column, and then scanning the index by distinct values and doing an index lookup for each key and merging the rows.

Not ideal at all, but it seems fast enough. I calculated it will complete in ~50 hours.

nullbytesmatter · on April 28, 2022

I was wrong, it took around 5 days, but it finished!

Minor49er · on April 18, 2022

I would write a script in PHP to iterate over the CSV (though most languages would work here). In the first pass, scan for the rows that should be merged. On the second pass, when a merged row is encountered, write it if it hasn't been written yet. If you do this, just be sure to make a demo or two first to ensure that the behavior is what you expect. Also make use of buffering since reading and writing a file line-by-line will probably be very slow compared to reading it in at blocks at a time

nullbytesmatter · on April 18, 2022

Interesting idea to do two passes and build a merge list ahead of time. I am not sure if there's enough memory to build up such a list but it's worth a try.

Edit: I tried, and OOM'd by the OS. Apparently 64GB of RAM isn't enough to key off the unique column.

Minor49er · on April 18, 2022

If you are finding that many matches that early, it might help to do multiple passes. On the first pass, identify up to X number of duplicates. Once you have X number of duplicates or reach the end of the file, iterate over the CSV and merge those rows. Then start again identifying a new set of duplicates, then apply those updates, over and over, until the file has no more duplicates

modinfo · on April 18, 2022

I once created a simple script in NodeJS to merge two or more CSV files and remove duplicates. You might find it useful.

https://www.npmjs.com/package/merge-csv-keepass

nullbytesmatter · on April 18, 2022

Your script parses the entire CSV and then dedupes in memory. This would require ~3TB+ RAM.

zaik · on April 18, 2022

Seems like a naive approach will quickly run out of memory trying to detect duplicates. When possible, I would try to put the data into your favorite database system. A lot of optimization and knowledge on how to handle such things is in those systems.

imichael · on April 18, 2022

How about this. For each row, compute an 8 bit hash and write to a file whose name is the hash value. Now you have 256 files that you can dedupe in memory (and if not, use a 10 bit hash or whatever).

nullbytesmatter · on April 18, 2022

I have tried something similar, writing each entry to a trie on the filesystem and store each duplicate row on the filesystem. The problem was this created a heck of a mess and was taking too long, but it seems like it's feasible.

mattewong · on April 18, 2022

Are you asking how to do this with a single machine, or how to do this assuming you have access to cloud resources?

yuppie_scum · on April 18, 2022

Stream it thru Kafka, Redis or Elasticache or something pub-sub like that and dump the resulting set?

wizwit999 · on April 18, 2022

Convert to parquet and use spark.

IronWolve · on April 18, 2022

Split it into smaller files.