I'm trying to wrap my head around embeddings and am not sure I understand how "realtime" embeddings will be in a real world application.
For example, if you use embeddings, do you need to know "a priori" what you want to search for? In other words, if you don't know your search queries up front, you have to generate the embeddings, store them inside the database, and then use them. The first step requires an API call to a commercial company (OpenAI here), running against a private model, via a service (which could have downtime, etc). (I imagine there are other embedding technologies but then I would need to manage the hardware costs of training and all the things that come with running ML models on my own.)
Compare that to a regular LIKE search inside my database: I can do that with just a term that a user provides without preparing my database beforehand; the database has native support for finding that term in whatever column I choose. Embedding seem much more powerful in that you can search for something in a much fuzzier way using the cosine distance of the embedding vector, but requires me to generate and store those embeddings first.
1. Create embeddings of your db entries by running through a nn in inference mode, save in a database in vector format.
2. Convert your query to an embedding by running it through a neural network in inference mode
3. Perform a nearest neighbor search of your query embedding with your db embeddings. There are also databases optimized for this, for example FAISS by meta/fb [1].
So if your network is already trained or you use something like OpenAI for embeddings, it can still be done in near real time just think of getting your embedding vector as part of the indexing process.
You can do more things too, like cluster your embeddings db to find similar entries.
> if you use embeddings, do you need to know "a priori" what you want to search for?
No. Your embedding is something that represents the data. You initially calculate the embedding for each datapoint with the API (or model) and store them in an index.
When a user makes a query, it is first embedded by a call to the API. Then you can measure similarity score with a very simple multiply and add operation (dot product) between these embeddings.
More concretely, an embedding is an array of floats, usually between 300 and 10,000 long. To compare two embeddings a and b you do sum_i(a_i * b_i), the larger this score is, the more similar a and b are. If you sort by similarity you have ranked your results.
The fun part is when you compare embeddings of text with embeddings of images. They can exist in the same space, and that's how generative AI relates text to images.
You need 3 things : a query, a corpus of text to be searched against and a language model that can map text to vectors ("compute the embedding").
Morally, when you get a query, you compute its embedding (through an API call) and return parts of your corpus whose embedding are cosine-close to the query.
In order to do that efficiently, you indeed have to pre-compute all the embeddings of your corpus beforehand and store them in your database.
Thank you. So, this does imply that you need to know the target of your embeddings in advance, right?
If I want to know if my text has something like "dancing" and it does have "tango" inside it, why wouldn't I just generate a list of synonyms and then for-each run those queries on my text and then aggregate them myself?
Is the value here that OpenAI can get me a better list of synonyms than I could do on my own?
If OpenAI were better at generating this list of synonyms, especially with more current data (I need to search for a concept like "fuzzy-text" and want text with "embeddings" to be a positive match!) that would be valuable.
It feels like OpenAI will probably be faster to update their model with current data from the Internet than those synonym lists linked above. Having said that, one of the criticisms of ChatGPT is that it does not have great knowledge of more recent events, right? Don't ask it about the Ukraine war unless you want a completely fabricated result.
That's the value prop of large language models (and here, of OpenAI's LLM): because it's been trained on some "somehow sufficiently large" corpus of data, it has internalized a lot of real world concepts, in a "superficial yet oddly good enough" way. Good enough for it to have embeddings for "dancing" and "tango" that will be fairly close.
And if you really need to, you can also fine-tune your LLM or do few shot learning to further customise your embeddings to your dataset.
You don’t need to know the target queries, if you compute embeddings of your entries and your query you just find which embeddings are closest to your query embedding. The advantage over using synonyms is that the embedding is meant to encode the meaning of the content such that similar embeddings represent similar meaning and you won’t need to deal with the combinatorial explosion of all the different ways you can say the same thing with different words (it can also work for other content, like images, or multi language if ur network is trained for it).
But yes, if you ask OpenAI to predict next set of tokens (which is how chat works), it won’t be up to date with latest information. But if you’re using it for embeddings this is less of a problem since language itself doesn’t evolve as quickly, and using embeddings is all about encoding the meaning of text, which is likely not going to change so much - but not to say it can’t for example the definition of “transformer” pre 2017 is probably not referring to the “transformer architecture”.
I think what you would do is some long-running “index” process where you dynamically generate embeddings for all changes that get made to text in the database. Databases that support searching large amounts of text probably do a way simpler version of this (i.e. reverse keyword indexes) already. Granted, this does involve a very beefy database.
For example, if you use embeddings, do you need to know "a priori" what you want to search for? In other words, if you don't know your search queries up front, you have to generate the embeddings, store them inside the database, and then use them. The first step requires an API call to a commercial company (OpenAI here), running against a private model, via a service (which could have downtime, etc). (I imagine there are other embedding technologies but then I would need to manage the hardware costs of training and all the things that come with running ML models on my own.)
Compare that to a regular LIKE search inside my database: I can do that with just a term that a user provides without preparing my database beforehand; the database has native support for finding that term in whatever column I choose. Embedding seem much more powerful in that you can search for something in a much fuzzier way using the cosine distance of the embedding vector, but requires me to generate and store those embeddings first.
Am I wrong about my assumptions here?