I've been doing this recently with every URL I've bookmarked over the last 15 years or so since I signed up for pinboard.in. http://spider.cloud has been really nice for crawling sites and saving the results as markdown. I plan on expanding it to transcribing youtube videos I've saved, github repos I've starred, HN posts, etc.
Ultimately I'm trying to index my "window" to the web as embedded content in a vector store. Not sure exactly what I'm going to do with it yet but I imagine it will be a component of some kind of personal agent system I can use to reference old info and help as a writing tool or as an "idea generator" of some kind. I'll likely end up not using most of it but you never know.
I've scraped about 10k markdown files which has created a ~10gb chromadb instance so far. Eventually I'll probably create separate collections based on ___domain, and filter down items that I care about more.
Ultimately I'm trying to index my "window" to the web as embedded content in a vector store. Not sure exactly what I'm going to do with it yet but I imagine it will be a component of some kind of personal agent system I can use to reference old info and help as a writing tool or as an "idea generator" of some kind. I'll likely end up not using most of it but you never know.
I've scraped about 10k markdown files which has created a ~10gb chromadb instance so far. Eventually I'll probably create separate collections based on ___domain, and filter down items that I care about more.