All those broken links in your “django signals” seem to have come from a page full of mangled URLs that got picked up on; unfortunately they’ve pushed the actual results all the way down to page 6! I definitely need to give a boost to official documentation.
“golang cobra” gets what appears to be the official repo as the first result; but it’s clearly not really getting what you’re going for here. This is a good example of the sort of challenges a search engine faces: both “go” and “cobra” have multiple meanings, and it needs to understand the context to figure out whether a given link is relevant for this particular search. I think something like a vector search would be useful here but I haven’t looked into setting something like that up yet.
GitHub is on my list, but it’s very big and is going to require careful optimization. (Even if I only load top-level READMEs it’s still a ton of data.)
ReadTheDocs would be great, but they don’t seem to have any dump/download support, or even a list of all the documentation sites they host, so in lieu of that they’re going to have to wait until I get a general web crawler.
I have some heuristics to collapse multiple versions into single result with a version picker, but they require some adjustments to the rest of my data processing pipeline which I haven’t gotten round to yet.
All those broken links in your “django signals” seem to have come from a page full of mangled URLs that got picked up on; unfortunately they’ve pushed the actual results all the way down to page 6! I definitely need to give a boost to official documentation.
“golang cobra” gets what appears to be the official repo as the first result; but it’s clearly not really getting what you’re going for here. This is a good example of the sort of challenges a search engine faces: both “go” and “cobra” have multiple meanings, and it needs to understand the context to figure out whether a given link is relevant for this particular search. I think something like a vector search would be useful here but I haven’t looked into setting something like that up yet.
GitHub is on my list, but it’s very big and is going to require careful optimization. (Even if I only load top-level READMEs it’s still a ton of data.)
ReadTheDocs would be great, but they don’t seem to have any dump/download support, or even a list of all the documentation sites they host, so in lieu of that they’re going to have to wait until I get a general web crawler.
I have some heuristics to collapse multiple versions into single result with a version picker, but they require some adjustments to the rest of my data processing pipeline which I haven’t gotten round to yet.