I agree that keeping data local is great and should be the first option when pos...

hectormalot · 2024-05-27T18:22:21 1716834141

I wouldn't underestimate how much a modern machine with a bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10 years old and has find + awk running an analysis in 12 seconds (at speed roughly equal to his hard drive) vs Hadoop taking 26 minutes. I've had similar experiences with much bigger datasets at work (think years of per-second manufacturing data across 10ks of sensors).

I get that that post is only on 3.5GB, but, consumer SSDs are now much faster at 7.5GB/s vs 270MB/s HDD back when the article was written. Even with only mildly optimised solutions, people are churning through the 1 billion rows (±12GB) challenge in seconds as well. And, if you have the data in memory (not impossible) your bottlenecks won't even be reading speed.

[1]: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

nottorp · 2024-05-27T12:21:11 1716812471

> I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.

The point of the article is 99.99% of businesses never pass even the 10 Gb point though.

sfilipco · 2024-05-27T12:55:44 1716814544

I agree with the theme of the article. My reply was to parent comment which has a 6 TB working set.