I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.
To mitigate these bottlenecks you get fancy hardware (e.g oracle appliance) or you scale out (and get TCO/performance gains from separating storage and compute - which is how Snowflake sold 3x cheaper compared to appliances when they came out).
I believe that Trino on HDFS would be able to finish faster than awk on 6 enterprise disks for 6TB data.
In conclusion I would say that we should keep data local if possible but 6TB is getting into the realm where Big Data tech starts to be useful if you do it a lot.
I wouldn't underestimate how much a modern machine with a bunch of RAM and SSDs can do vs HDFS. This post[1] is now 10 years old and has find + awk running an analysis in 12 seconds (at speed roughly equal to his hard drive) vs Hadoop taking 26 minutes. I've had similar experiences with much bigger datasets at work (think years of per-second manufacturing data across 10ks of sensors).
I get that that post is only on 3.5GB, but, consumer SSDs are now much faster at 7.5GB/s vs 270MB/s HDD back when the article was written. Even with only mildly optimised solutions, people are churning through the 1 billion rows (±12GB) challenge in seconds as well. And, if you have the data in memory (not impossible) your bottlenecks won't even be reading speed.
> I agree that keeping data local is great and should be the first option when possible. It works great on 10GB or even 100GB, but after that starts to matter what you optimize for because you start seeing execution bottlenecks.
The point of the article is 99.99% of businesses never pass even the 10 Gb point though.
To mitigate these bottlenecks you get fancy hardware (e.g oracle appliance) or you scale out (and get TCO/performance gains from separating storage and compute - which is how Snowflake sold 3x cheaper compared to appliances when they came out).
I believe that Trino on HDFS would be able to finish faster than awk on 6 enterprise disks for 6TB data.
In conclusion I would say that we should keep data local if possible but 6TB is getting into the realm where Big Data tech starts to be useful if you do it a lot.