That would be a hilariously bad idea for them. Their business is based on fair use. The only way to enforce restrictions against scraping is through copyright law because obviously you can run the spidering code from any jurisdiction you want, so any law that says “thou shall not scrape” is toothless unless it acts through copyright. Any workable restrictions against using scraped data would also make ChatGPT illegal too.
Nonsense. Regulation rarely works retroactively. Their model is trained and they have the money to license incremental data going forward, potentially exclusively.
My point such laws that regulate the act of scraping itself cannot work because you can easily scrape in a different country where that law doesn’t apply, and then transfer the data in - or indeed train your NN in a different country and transfer the model.
Only copyright can see through all of that, you would have to gut fair use in order to have an effective anti-scraping law.
I'm going largely by memory but when the U.S. expanded copyright at one point they actually took some stuff out of the public ___domain. You can look it up but the current formula is authors life plus 70 and a different formula for corporate works, and when they expanded it most recently there were actually some public ___domain works that become not public ___domain retroactively. (A quick google search reveals the 1976 Act added 19 years to the terms of existing copyrights, this might be what I'm thinking of-- in other words some works that had copyright expired then had them renewed and removed from the public ___domain.)
There's also copyright reversion, which is a related new provision that applied to older copyrighted works. Quoting from an article I just pulled up
"...the 1976 Act created a new right allowing authors and their heirs to terminate a prior grant of copyright, the Act also set forth specific steps concerning the timing and contents of the termination notice that must be served in order to effectuate termination. The termination of a grant may be effective “at any time during a period of five years beginning of the end of 56 years from the date the copyright was originally secured”..."
But this is a red herring because the fact a model has been trained in the past doesn't mean a copyright lawsuit is "retroactive". The infringement would presumably be occuring anew every day you make it available on your web site.
How's that gonna work when they need to update their model? Also, how would they compete with companies like FB that have an insane amount of conversational data, or Google, a company that literally indexes the internet?
Spend money on licensing deals, lock out the competition. The value of the LLM isn’t up to date data, it’s the concepts of extracts. There’s very limited value in a large amount of crap if chinchilla is to be believed.
I don’t think stack overflow is all that valuable once your model has access to github due to their good friends at MS.
The money in proprietary AI is on the top end now, open source / edge is destroying monetisation on the lower end. Top end means high quality ___domain specific data.
As a heavy ChatGPT user I disagree. Lack of up to date data is one of the biggest issues I face every day - technology changes fast, libraries change APIs, new tech comes out, etc.
I’m working on this problem (heavy user of chatgpt too). What kinds of libraries do you use it for that are out of date. I could hopefully get you into the beta with it having better responses for those libs. Please email me [email protected]
It has information from 2021.
ChatGPT presents Quickwit as follows:
As of my last knowledge update in September 2021, Quickwit is an open-source search engine infrastructure that is designed for building and deploying search solutions quickly and efficiently. It focuses on providing fast and scalable full-text search capabilities for applications and websites. Quickwit is built on top of the Rust programming language and leverages technologies like the tantivy search engine library.