More

stavrospap · on Aug 2, 2023

Stavros from TileDB here (Founder and CEO). I thought of requesting some feedback from the community on this blog. It was only natural for a multi-dimensional array database like TileDB to offer vector (i.e., 1D array) search capabilities. But the team managed to do it very well and the results surprised us. We are just getting started in this ___domain and a lot of new algorithms and features are coming up, but the sooner we get feedback the better.

TileDB-Vector-Search Github repo: https://github.com/TileDB-Inc/TileDB-Vector-Search

TileDB-Embedded (core array engine) Github repo: https://github.com/TileDB-Inc/TileDB

TileDB 101: Vector Search (blog to get kickstarted): https://tiledb.com/blog/tiledb-101-vector-search/

stavrospap · on May 4, 2022

Hi folks, Stavros from TileDB here. Here are my two cents on tabular data. TileDB (Embedded) is a very serious competitor to Parquet, the only other sane choice IMO when it comes to storing large volumes of tabular data (especially when combined with Arrow). Admittedly, we haven’t been advertising TileDB’s tabular capabilities, but that’s only because we were busy with much more challenging applications, such as genomics (population and single-cell), LiDAR, imaging and other very convoluted (from a data format perspective) domains.

Similar to Parquet:

* TileDB is columnar and comes with a lot of compressors, checksum and encryption filters.

* TileDB is built in C++ with multi-threading and vectorization in mind

* TileDB integrates with Arrow, using zero-copy techniques

* TileDB has numerous optimized APIs (C, C++, C#, Python, R, Java, Go)

* TileDB pushes compute down to storage, similar to what Arrow does

Better than Parquet:

* TileDB is multi-dimensional, allowing rapid multi-column conditions

* TileDB builds versioning and time-traveling into the format (no need for Delta Lake, Iceberg, etc)

* TileDB allows for lock-free parallel writes / parallel reads with ACID properties (no need for Delta Lake, Iceberg, etc)

* TileDB can handle more than tables, for example n-dimensional dense arrays (e.g., for imaging, video, etc)

Useful links:

* Github repo (https://github.com/TileDB-Inc/TileDB)

* TileDB Embedded overview (https://tiledb.com/products/tiledb-embedded/)

* Docs (https://docs.tiledb.com/)

* Webinar on why arrays as a universal data model (https://tiledb.com/blog/why-arrays-as-a-universal-data-model)

Happy to hear everyone’s thoughts.

stavrospap · on Dec 10, 2021

Disclosure: I am the author of this article and the Founder/CEO of TileDB. The article touches upon some important and rather thought provoking issues around data management, so I thought to get some feedback from the HN community.

stavrospap · on July 20, 2020

TileDB Embedded is a storage engine like HDF5, with the following differentiators: (1) it is cloud-native, (2) it supports also sparse arrays, (3) it offers rapid updates, (4) it supports data versioning and time traveling built into its format. TileDB Cloud (our cloud SaaS solution) further allows you to see which arrays you own in the cloud and which ones you share with others, along with full access logs. You can also attach arbitrary descriptions and metadata that can search on, even find and access public datasets posted by you or others.

stavrospap · on July 20, 2020

TileDB and Hail are rather complementary. We have customers that use TileDB to store and manage their variants, and Hail to perform GWAS (by exporting from TileDB to Hail format). We are currently designing a tighter integration with Hail. This expands on our vision for a universal data engine that integrates with pretty much everything out there and does not lock you in a single framework (e.g., Spark).

aroch · on July 20, 2020

That was our feeling about the two products as well, the limitations w/ TileDB-vcf though sort of forced our hands. I was (and still am) of the opinion TileDB would be a good variant store since it does do so many of the things we want and does them well

stavrospap · on July 20, 2020

Stavros from TileDB here. Great description of the genomics use case for TileDB. We'd be interested in learning what limitations you've found. Happy to discuss over email as well ([email protected]).

aroch · on July 20, 2020

Hey Stavros. We were looking for a data-store to integrate into a clinical genomics LIMS that supports in-system analysis. We deal with de novo sequenced clinical samples (and not genotyped samples, which seems to be what TileDB-vcf had in mind?). There are some edge cases that TileDB-vcf explicitly disallows (updates/reinserts to the same sample, overlapping variants) that are not edge cases for us but rather common occurrences.

stavrospap · on July 20, 2020

This is an API issue with TileDB-VCF. The core TileDB library supports inserts/appends/overwrites without issues and we just need to expose those operations in the TileDB-VCF APIs. Added to our backlog, thanks!

stavrospap · on May 9, 2020

Folks, apologies, but I think we got a bit side tracked here, TileDB does not suffer from the consistency issues mentioned above.

Here is how TileDB performs a new (potentially concurrent with other reads and writes) write:

- It creates a fragment folder (or "prefix" of a set of objects on S3 - there are no "folders" on S3) which is timestamped and carries a unique UUID. This fragment is self-contained and represents the entire write (e.g., all cells and all attribute values)

- It writes all data objects under the fragment prefix. Note that TileDB never updates, it always writes new immutable objects.

- After all the PUT requests succeed for the data objects, it creates an empty "ok" object.

Here is how TileDB performs a (potentially concurrent with other reads and writes) read:

- It lists the array prefix to get the ok objects

- There are two cases:

1. The ok object is not there for some fragment. That fragment is completely ignored.

2. The ok object is there. Since TileDB writes the ok object last, all the data objects it wrote have been committed and are all visible with GET requests. TileDB reads the data objects only with GET requests (not ListObject requests). Due to S3’s read-after-write consistency model (https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction...), all those objects will be available for reading (now on all S3 regions) with GET and there will be no errors.

Therefore, TileDB follows the eventual consistency model of S3 without any errors and surprises. The user doesn't need to handle anything. Our customers have been using TileDB in production for a long time, storing hundreds of TBs of data on S3, and no consistency issue has ever come up.

Summarizing, what xyzzy_plugh is raising here is that TileDB does not have ACID guarantees. And that is true (we never claimed the contrary) and intentional. We are building a transactional layer outside of the storage engine. The reason is that this transactional layer indeed needs to be a constantly running distributed service, whereas we want the TileDB storage engine to be embeddable and used without performance regression even by applications that do not need ACID (that is, the majority of our data science applications).

stavrospap · on May 7, 2020

Stavros form TileDB here. Here is a more verbose explanation. Up until before 2.0, TileDB was already powerful for the main applications we targeted at: geospatial and genomics. The support for both dense and sparse arrays and the way it handles data versioning made it quite unique vs. HDF5 and Zarr. But we noticed that most of the data scientists we were working with had a lot of data beyond genomic variants, LiDAR points and rasters. They had tons of dataframes. And they were using at least two storage engines, TileDB for arrays, and Parquet or a relational database for dataframes. If you are in a large organization, this a big pain.

In TileDB 2.0 we made a huge refactoring to support something seemingly simple: dimensions in sparse arrays that can have different types and that could even be strings. This allowed us to model any dataframe as a sparse array, effectively making TileDB act as a primary multi-dimensional index. In relational databases, this means that your data is sorted in an order on disk that favors your multi-column slicing enormously, so range search becomes rapid.

Therefore, what we are telling the community with this release is that you can have dense arrays, sparse arrays, and dataframes in a single embeddable library being integrated with pretty much every data science tool out there, so that data scientists never have to worry about backends, files, updates, or anything other than their scientific analysis. In other words, we believe the future of data science is more science.

bionhoward · on May 7, 2020

Show us code usage examples please!

stavrospap · on May 7, 2020

Full developer docs here: https://docs.tiledb.com/main/

Specific dataframe examples coming up shortly.

stavrospap · on May 7, 2020

Thanks for the post!

stavrospap · on May 7, 2020

Stavros from TileDB here. TL;DR, it's all about fast slicing on multiple columns while supporting updates, locally or in the cloud.

Suppose you serialize your dataframe in HDF5 or Redis. Let your dataframe have schema (Date, Stock, Price). Assume this dataframe is 1 TB long and stored on S3, GCS or Azure (as they are cheap). How would you be able to efficiently perform an average query on Price for a specific Date range and Stock symbol? With HDF5 you would have to download 1 TB (no notion of "fast slicing on variable predicates") and apply the predicates locally. If you stored the dataframe in Parquet (a better choice for this use case), then you would be able to build some logic in your code that uses the Parquet metadata/indexes and prune a lot of unnecessary information (as Spark does). However, Parquet is "one-dimensional", i.e., your pruning would be efficient on Date, but not on Stock (you'd have to "partition" your Parquet files with Spark or Hive and things could get quite complicated). Most importantly, you wouldn't be able to update the Parquet files; you would have to generate new files and build a catalog on top (or use services like Delta Lake) to manage your Parquet files. And this is an extremely cumbersome task.

TileDB abstracts everything for you, while allowing you to slice fast on any number of columns. You just define Date and Stock as "dimensions", and slicing on both those columns becomes uber efficient locally or in the cloud. Effectively, you turn this dataframe into a sparse 2D array. Updates and time traveling are handled by TileDB. You get to use Spark, Dask, MariaDB and PrestoDB as you did before, but there is no need for Hive, Delta Lake or any other cataloging service. Thank you for pointing out the confusion though. We just launched and we have tons of examples coming up.

hxzhao · on May 7, 2020

Stavros thanks for the explanation, how does TileDB avoid downloading the entire matrix and do the slicing (locally)? Are we achieving this by breaking down a big matrix to a set of smaller ones? so that you only down the subset of that the current query need? If this is the case, what is the current measure we have to avoid mismatch on metadata (e.g. some error while uploading them to S3) that links them together? thanks

stavrospap · on May 7, 2020

Efficient slicing happens because of "tiling", hence the name TileDB. A tile is similar to an HDF5 or Zarr "chunk", or more loosely to a Parquet page. Although totally configurable, tiling is handled solely by TileDB, the user doesn't need to know about it. A tile is the atomic unit of IO and compression. TileDB maintains all the necessary metadata and indexing built into its format and, given a query, it knows how to fetch only the tiles that might include results. The tiles are decompressed in your memory and filtered further for the actual results. The dense array case is rather straightforward. The sparse case is a big differentiator in TileDB and it is quite challenging, especially in the presence of updates. TileDB handles the sparse case via bulk-loaded R-trees for multi-dimensional indexing, and via an LSM-tree-like approach with immutable objects that allows time traveling.

Concerning your point on potential errors occurring on S3, this is addressed by TileDB's immutable object approach. If an error occurs upon some write, there will be no array corruption. Happy to discuss about this topic on a separate thread.

Some related docs:

https://docs.tiledb.com/main/performance-tips/choosing-tilin...

https://docs.tiledb.com/main/basic-concepts/tile-filters#til...

https://docs.tiledb.com/main/basic-concepts/definitions/frag...