I don't see any mention of sqlite. Is a sqlite file not the same thing they're t...

nerdponx · on Jan 18, 2024

This is more about arranging files in cloud storage than about the engine used to query them.

You could store your data in a SQLite database, but that's not really interoperable in the way a bunch of Parquet files are. Source: tried it.

crabbone · on Jan 18, 2024

Here's one more problem not mentioned by others: SQLite has some functions that it declares but doesn't define. For example, regexp(). So, if your table is defined as having a constraint with regexp()... well, sucks to be you: it might depend on what regexp() implementation you load whether that constraint can be applied or not.

I know this because I needed this functionality in Ada code, which links statically with SQLite, and I didn't want to also link with PCRE library just to get regexp(), especially since GNATCOLL already, sort of has regexp... except it doesn't have the "fancy" features, s.a. lookaheads / lookbehinds / Unicode support etc.

So, a table that uses regexp() in its constraint definition isn't portable. But, if you only use the table data without the schema you lose a lot of valuable information...

----

Also, come to think about it: unlike server-client databases (eg. MySQL, PostgreSQL etc) SQLite doesn't have a wire-transfer format. Its interface returns values in the way C language understands them. The on-disk binary format isn't at all designed for transfer because it's optimized for access efficiency. This, beside other things, results in SQLite database file typically having tons of empty space, it doesn't use efficient (compressed) value representation etc.

So, trying to use SQLite format for transferring data isn't going to be a good arrangement. It's going to be wasteful and slow.

lalaland1125 · on Jan 18, 2024

The main disadvantage of sqlite compared to these other formats is that sqlite is designed for single machine processing. This can be problematic for huge datasets or complex queries.

These other formats easily support using clusters to process your data.

dist-epoch · on Jan 18, 2024

A sqlite file is no more single machine processing than a parquet file.

CyberDildonics · on Jan 18, 2024

If there are no writers it doesn't matter how many readers there are.

benjaminwootton · on Jan 18, 2024

I agree that sqlite has a number of similar benefits - openness, table abstractions and concurrent transactions. It’s also a library so close to how delta, iceberg and hudi are implemented.

I’m glad it’s had an uptick in interest recently but I haven’t yet seen it mentioned for analytics yet.

I assume it’s row rather than column oriented?

speedgoose · on Jan 18, 2024

Yes, SQLite is row oriented. It's not a very space efficient format because it also doesn't support compression or compact representations of numbers in binary.

But it doesn't rely on the JVM and a typical JVM ecosystem. It is a big benefit for some use-cases, like dealing with numerical data on the edge.

hobofan · on Jan 18, 2024

Delta Lake also isn't JVM based. The main implementation is a Rust library with first-party Python bindings.

bbuchert · on Jan 18, 2024

If you are searching for the analytics equivalent you should check out duckdb.

speedgoose · on Jan 18, 2024

DuckDB is about querying existing files or am I missing something?

hobofan · on Jan 18, 2024

The documentation puts a lot of emphasis on loading/querying existing files, as that's the first thing you will want to do for a OLAP use-case, but in general it's the same as sqlite where you maintain a database file.

speedgoose · on Jan 18, 2024

Thanks. That's good to know. The file format is not highlighted in the documentation but I found some information about it: https://duckdb.org/internals/storage.html

It seems to be a bit early to rely on it to store data in an object store, but I will do some tests to compare with SQLite:

> The DuckDB internal storage format is currently in flux, and is expected to change with each release until we reach v1.0.0.

dist-epoch · on Jan 18, 2024

DuckDB file format is the compressed column oriented equivalent of SQLite files.

gigatexal · on Jan 18, 2024

OLAP or column oriented SQLite is DuckDB.

prpl · on Jan 18, 2024

These are metadata layers on top of file formats, expecting to process many files (where schema may change) in typically object storage, with volumes in excess of a petabyte. There is nothing preventing you from using sqlite with iceberg, for example - except implementing it (may involve implementation in your execution engine too). It already supports Parquet, Orc, and Avro (which is more row oriented than column oriented)

Iceberg and Sqlite might be interesting if you wanted to colocate two tables in the same file, for example. A smart enough data access layer with an appropriate execution engine could possibly see the data for both.

lysecret · on Jan 18, 2024

Column orientation is extremely important for query + storage efficiency. Furthermore, you need a distributed query engine (Athena, Bigquery etc.) and they all support parquet.

datadrivenangel · on Jan 18, 2024

DuckDB is SQLite but column oriented!

lysecret · on Jan 18, 2024

Yes we are using DuckDB right now. Love it. Works extremely well if you can partition down your data to manageable (few gbs) size.

ytjohn · on Jan 18, 2024

I'm a fan of datasette.io, which is built around sqlite. Essentially it's a set of tools that you point at a sqlite db to explore datasettes.

- https://datasette.io/examples