I'm trying to wrap my head around embeddings and am not sure I understand how "r...

foooobaba · on Feb 7, 2023

One strategy commonly used:

1. Create embeddings of your db entries by running through a nn in inference mode, save in a database in vector format.

2. Convert your query to an embedding by running it through a neural network in inference mode

3. Perform a nearest neighbor search of your query embedding with your db embeddings. There are also databases optimized for this, for example FAISS by meta/fb [1].

So if your network is already trained or you use something like OpenAI for embeddings, it can still be done in near real time just think of getting your embedding vector as part of the indexing process.

You can do more things too, like cluster your embeddings db to find similar entries.

[1] https://engineering.fb.com/2017/03/29/data-infrastructure/fa...

visarga · on Feb 7, 2023

> if you use embeddings, do you need to know "a priori" what you want to search for?

No. Your embedding is something that represents the data. You initially calculate the embedding for each datapoint with the API (or model) and store them in an index.

When a user makes a query, it is first embedded by a call to the API. Then you can measure similarity score with a very simple multiply and add operation (dot product) between these embeddings.

More concretely, an embedding is an array of floats, usually between 300 and 10,000 long. To compare two embeddings a and b you do sum_i(a_i * b_i), the larger this score is, the more similar a and b are. If you sort by similarity you have ranked your results.

The fun part is when you compare embeddings of text with embeddings of images. They can exist in the same space, and that's how generative AI relates text to images.

sharemywin · on Feb 7, 2023

I'm pretty sure they need to be cross trained on text and images like clip. it was trained on the little alt text of images to learn the association.

cpa · on Feb 7, 2023

You need 3 things : a query, a corpus of text to be searched against and a language model that can map text to vectors ("compute the embedding").

Morally, when you get a query, you compute its embedding (through an API call) and return parts of your corpus whose embedding are cosine-close to the query.

In order to do that efficiently, you indeed have to pre-compute all the embeddings of your corpus beforehand and store them in your database.

xrd · on Feb 7, 2023

Thank you. So, this does imply that you need to know the target of your embeddings in advance, right?

If I want to know if my text has something like "dancing" and it does have "tango" inside it, why wouldn't I just generate a list of synonyms and then for-each run those queries on my text and then aggregate them myself?

I can download a few synonyms databases here:

https://stackoverflow.com/questions/5618304/looking-for-thes...

Is the value here that OpenAI can get me a better list of synonyms than I could do on my own?

If OpenAI were better at generating this list of synonyms, especially with more current data (I need to search for a concept like "fuzzy-text" and want text with "embeddings" to be a positive match!) that would be valuable.

It feels like OpenAI will probably be faster to update their model with current data from the Internet than those synonym lists linked above. Having said that, one of the criticisms of ChatGPT is that it does not have great knowledge of more recent events, right? Don't ask it about the Ukraine war unless you want a completely fabricated result.

cpa · on Feb 7, 2023

That's the value prop of large language models (and here, of OpenAI's LLM): because it's been trained on some "somehow sufficiently large" corpus of data, it has internalized a lot of real world concepts, in a "superficial yet oddly good enough" way. Good enough for it to have embeddings for "dancing" and "tango" that will be fairly close.

And if you really need to, you can also fine-tune your LLM or do few shot learning to further customise your embeddings to your dataset.

foooobaba · on Feb 7, 2023

You don’t need to know the target queries, if you compute embeddings of your entries and your query you just find which embeddings are closest to your query embedding. The advantage over using synonyms is that the embedding is meant to encode the meaning of the content such that similar embeddings represent similar meaning and you won’t need to deal with the combinatorial explosion of all the different ways you can say the same thing with different words (it can also work for other content, like images, or multi language if ur network is trained for it).

thanatropism · on Feb 7, 2023

Look for the Python package "sentence-transformers" and give it a spin.

foooobaba · on Feb 7, 2023

But yes, if you ask OpenAI to predict next set of tokens (which is how chat works), it won’t be up to date with latest information. But if you’re using it for embeddings this is less of a problem since language itself doesn’t evolve as quickly, and using embeddings is all about encoding the meaning of text, which is likely not going to change so much - but not to say it can’t for example the definition of “transformer” pre 2017 is probably not referring to the “transformer architecture”.

odo1242 · on Feb 7, 2023

I think what you would do is some long-running “index” process where you dynamically generate embeddings for all changes that get made to text in the database. Databases that support searching large amounts of text probably do a way simpler version of this (i.e. reverse keyword indexes) already. Granted, this does involve a very beefy database.