It’s so now they can lobby for anti scraping regulation and hamper any possible ...

zarzavat · on Aug 7, 2023

That would be a hilariously bad idea for them. Their business is based on fair use. The only way to enforce restrictions against scraping is through copyright law because obviously you can run the spidering code from any jurisdiction you want, so any law that says “thou shall not scrape” is toothless unless it acts through copyright. Any workable restrictions against using scraped data would also make ChatGPT illegal too.

gmerc · on Aug 7, 2023

Nonsense. Regulation rarely works retroactively. Their model is trained and they have the money to license incremental data going forward, potentially exclusively.

staticman2 · on Aug 7, 2023

Copyright laws do in fact (or have in fact) acted retroactively.

gmerc · on Aug 7, 2023

It’s a red herring - There’s many ways to regulate scraping that don’t involve changing copyright.

Meta has been lobbying hard around that for years.

zarzavat · on Aug 8, 2023

My point such laws that regulate the act of scraping itself cannot work because you can easily scrape in a different country where that law doesn’t apply, and then transfer the data in - or indeed train your NN in a different country and transfer the model.

Only copyright can see through all of that, you would have to gut fair use in order to have an effective anti-scraping law.

mmmmmmtoes · on Aug 7, 2023

When? Not doubting, just curious about scope and type of scenarios where it's happened.

staticman2 · on Aug 7, 2023

I'm going largely by memory but when the U.S. expanded copyright at one point they actually took some stuff out of the public ___domain. You can look it up but the current formula is authors life plus 70 and a different formula for corporate works, and when they expanded it most recently there were actually some public ___domain works that become not public ___domain retroactively. (A quick google search reveals the 1976 Act added 19 years to the terms of existing copyrights, this might be what I'm thinking of-- in other words some works that had copyright expired then had them renewed and removed from the public ___domain.)

There's also copyright reversion, which is a related new provision that applied to older copyrighted works. Quoting from an article I just pulled up

"...the 1976 Act created a new right allowing authors and their heirs to terminate a prior grant of copyright, the Act also set forth specific steps concerning the timing and contents of the termination notice that must be served in order to effectuate termination. The termination of a grant may be effective “at any time during a period of five years beginning of the end of 56 years from the date the copyright was originally secured”..."

But this is a red herring because the fact a model has been trained in the past doesn't mean a copyright lawsuit is "retroactive". The infringement would presumably be occuring anew every day you make it available on your web site.

MWil · on Aug 7, 2023

I cannot for the life of me find the links but I feel like this happened with Monopoly or some other board game.

drexlspivey · on Aug 7, 2023

They still need current data or their GPT models will be stuck at september 2021 forever

insanitybit · on Aug 7, 2023

How's that gonna work when they need to update their model? Also, how would they compete with companies like FB that have an insane amount of conversational data, or Google, a company that literally indexes the internet?

gmerc · on Aug 7, 2023

Spend money on licensing deals, lock out the competition. The value of the LLM isn’t up to date data, it’s the concepts of extracts. There’s very limited value in a large amount of crap if chinchilla is to be believed.

I don’t think stack overflow is all that valuable once your model has access to github due to their good friends at MS.

The money in proprietary AI is on the top end now, open source / edge is destroying monetisation on the lower end. Top end means high quality ___domain specific data.

insanitybit · on Aug 7, 2023

> The value of the LLM isn’t up to date data

As a heavy ChatGPT user I disagree. Lack of up to date data is one of the biggest issues I face every day - technology changes fast, libraries change APIs, new tech comes out, etc.

gvkhna · on Aug 7, 2023

I’m working on this problem (heavy user of chatgpt too). What kinds of libraries do you use it for that are out of date. I could hopefully get you into the beta with it having better responses for those libs. Please email me [email protected]

insanitybit · on Aug 7, 2023

Rust libraries as well as Hashicorp Nomad (has changed a lot since ChatGPT's last training point). Also QuickWit is totally unknown to ChatGPT.

fulmicoton · on Aug 15, 2023

It has information from 2021. ChatGPT presents Quickwit as follows:

As of my last knowledge update in September 2021, Quickwit is an open-source search engine infrastructure that is designed for building and deploying search solutions quickly and efficiently. It focuses on providing fast and scalable full-text search capabilities for applications and websites. Quickwit is built on top of the Rust programming language and leverages technologies like the tantivy search engine library.

gmerc · on Aug 7, 2023

I’m saying if they feed the source into ChatGPT (from their friends at Github) they have everything they need already.

insanitybit · on Aug 7, 2023

Oh. Hm, yeah, that sounds possible. We'll see. There are a lot of places besides Github where people talk about code.

brianjking · on Aug 7, 2023

They're actually paying for access to the AP and other sources now.