We've built out a decently complex pipeline for this, but a lot of the magic has to do with the specific embedding model we've trained to know what text is relevant to feed in and what text isn't.
This is a really cool tool. Have you considered filtering known blog-spam/low-quality content mill/SEO'ed garbage type sites (ie: GeeksForGeeks, W3Schools, TutorialsPoint)? That would make me definitely jump on this, and even pay for a subscription. I spend way too much time having to scroll down Google past all this junk before I hit the official documentation for module I'm using.
If you use duckduckgo there's the ddg-filter firefox plugin that lets you block domains. I use it to block exactly the low quality domains you mention.
Maybe there are similar plugins for other search engines as well...
i don't think they really need to...maybe for citations but for training if the content is the same on site A and B it doesn't matter which one it pulled from.
that said.. if the content itself is bad then that'd be a problem. we'll probably start seeing that, sites designed to poison LLMs.
I know, it's just irritating to have to do that, or have an extension do it. I would be happy to support a search engine that lets me filter out unwanted crud.
Any pointers on how to build custom embedding ? I am working on a specialized ___domain where words may mean different things than rest of the world. I want to create my own embeddings, which I suspect would help. Any pointers ?
We've built out a decently complex pipeline for this, but a lot of the magic has to do with the specific embedding model we've trained to know what text is relevant to feed in and what text isn't.