Hacker News new | past | comments | ask | show | jobs | submit login

What the author is trying to say is that (IMHO) running a database on a MPP (quite common these days, or easily stood up in a data center or Amazon) is still a pain.

Outside of Postgres, none of the truly open-source databases scale well with MPPs. You're still stuck with mostly single-core processing.

So you can pay good $$$$$ to the commercial database providers to provide you with a good database that may scale to your needs.

Or use Redshift for a great analytics DB at a reasonable price. (Note: we heavily use HBase where I am at, but that's comparing apples to oranges.)




You can add Impala to HBase and get a pretty good SQL based low latency analytics solution. (If your data is structured to allow impala to take advantage of row key ordering)

* Disclosure: I'm an Apache HBase committer, I've written parts of Impala's HBase integration, and I work at Cloudera.


Apples to Oranges, how so, if you don't mind me asking? What about HBase makes it different from RedShift/Greenplum/Oracle?


HBase isn't built for analysis workloads. It doesn't have a complex query engine, so you end up having to do massive scans (which aren't especially fast), transfer a ton of data to the node where your analytics code is running, and do the computation there. If things are too big to run on one machine, that's your problem, not HBase's.

On the other hand, Impala, RedShift, Oracle Exadata, etc. let you ask the database to do work at a much higher level, which allows for much better performance because the data storage and computation layers can work in tandem (so you can prune down to only the data your query needs at each storage node before hitting the network, for example), and the database does the work of optimizing for multiple cores and nodes, not the writer of the analysis routine.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: