Hacker News new | past | comments | ask | show | jobs | submit login
Training algorithms on copyrighted data not illegal: US Supreme Court (towardsdatascience.com)
327 points by alok-g on Nov 15, 2019 | hide | past | favorite | 46 comments



SCOTUS denied the petition for writ of certiorari, thereby leaving the 2nd Circuit's ruling in Google's favor intact.

However, the 2nd Circuit's ruling is not binding on any other federal circuits.

Also, as Enginerrd stated, the holding is not nearly as broad as the article makes it out to be.

The holding was:

1. Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use.

2. Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement.

Based on the above holding, I think the article's conclusion is a stretch for general training algorithms using copyrighted data because: (1) there would not be a library supplying the information to the training algorithm, (2) there would be no similar display of snippets, and (3) we do not know if a training algorithm would provide a market substitute for the copyrighted data.


While the decision is only binding in the 2nd circuit, the precedent is admissible in other courts. If this goes to trial in a different circuit you can bring the finding to the judge who will consider it - it won't be binding, but he will consider it. If it goes to appeal in a different circuit the next circuit will reference this in their decision - if they decide the 2nd is wrong they will be very clear why the they think the 2nd circuit is wrong when they make ruling (and this can in turn be re-submitted to the 2nd circuit who might change their mind if the reasoning is good enough). If this goes to the supreme court in the future they will read this decision and it will influence them - again they can decide either way.


Yes, the 2nd Circuit's decision is persuasive authority for other circuits. However, that's not what the article claims. The article claims that SCOTUS ruled when it, in fact, did not.


SCOTUS ruled on the cert petition; what people may not understand is that while that is a ruling, it is (and there is explicit precedent on this point) not one which has precedential weight (even as persuasive authority) as regards the merits of the issues addressed in the lower court ruling.


Isn't it unlikely that a case will be granted cert if appeals courts in different circuits are in agreement? I.e. not a circuit split? So while not legally binding, it might be in practice indicative.

I wonder, does this mean I can scrape Instagram/Facebook for photos and use them for face recognition? Is that 'fair use'? Is an Instagram post a publication?


> Isn't it unlikely that a case will be granted cert if appeals courts in different circuits are in agreement?

As I understand, it's generally viewed to be th case that a circuit split makes cert. more likely, sure.

> So while not legally binding, it might be in practice indicative.

I guess that it's indicative that, barring change in membership of th court, cert. would likely be denied in a future case raising the same issue from the same or a different circuit with the same result.

It definitely should not be seen as indicative of anything on the merits other than that the members of the court don't see it as obviously and urgently wrong.


It's not hard to see why people are not understanding this correctly.

The HN link text:

> Training algorithms on copyrighted data not illegal: US Supreme Court

The sub-heading for the article:

> Training algorithms on copyrighted data is not illegal, according to the United States Supreme Court.


I'm going to take a picture of an Anish Kapoor sculpture tomorrow. Will that be an infringing transformative work?


Towardsdatascience.com is rapidly rising on my irritation meter. Tons of submissions to HN of questionable value. This article is worse than most, the person interpreting the ruling does not appear to have a legal background and has essentially twisted the ruling to support his foregone conclusion. I'd love for an actual lawyer (paging Rayiner) to interpret this and to see whether or not any such far reaching conclusions are supported by the ruling the article is about. I've read the ruling and I've come to the conclusion that it makes no such statement but since I'm also not a lawyer you should assign as much value to that opinion as to the article itself.


Medium hosted articles in general tend to have both low quality and value. It's a very shiny platform which hosts a massive amount of really poorly reasoned articles.


We probably need a technical rather than legal solution for this. The problem is that generative algorithms are susceptible to accidental memorization so you can't guarantee that the output will be transformative. For example, play with https://talktotransformer.com to see how many well known pieces of text it can spew out verbatim. It is very prone to derailing into a harry potter fan fiction regardless of prompt.

Other than copyright there are privacy considerations too. For example Gmail's Smart Compose is trained on users' private messages so you don't want it to memorize "private" details (such as credit card numbers): https://arxiv.org/abs/1802.08232

Is it possible to solve this by adversarially checking if the output is "original" enough or not? Or is that intractable, given how much resource our society already pours into making the same classification in court?


I agree with your point, but I think the tendency to accidentally produce substantial quotations of some unknown fidelity of some copyrighted material is definitely not a "significant market substitute for the protected" material. There's no way I'm going to try and trick a bot into reciting a whole story to me instead of buying a copy of the story.

That said, these many-pointed court tests make it really hard to do anything around copyrighted stuff without being a large corporation, which is problematic.


If you look at it from a perspective of information theory, you may find that compression and AI are equivalent problems. And we know that compression (including lossy compression) doesn't remove copyrights.

Good thing that these court tests are establishing that the argument, rather than being technical, is about creation of value to society. And about potential economic damage to the copyright owners. As long as there's no damage and value to society, it seems that the courts are fine with the use of data.


This point is extremely important, particularly in the healthcare field (which I happen to be in at the moment). We have to be very, very positive that our deidentification process is thorough and accurate to prevent a HIPAA violation from occurring.


> We have to be very, very positive that our deidentification process is thorough and accurate to prevent a HIPAA violation from occurring

If your model is any more useful than something trained on coarse aggregates, then it can be used to reidentify individuals. This is a pretty hard dilemma in the entire industry, not just health.

I hope my observations are skewed, but instead of trying to seriously address this issue I've seen an entire legal-loophole style data laundering industry emerge where highly identifiable information changes hands without it 'technically' changing hands in the legal sense. I'm talking about entities like datarepublic.


It really is a complicated topic, and one we spend a lot of time thinking about. We're using a peer reviewed method for removing PHI identifiers, and are combining that with an approach that involves using the least amount of data possible to get results. Our models will also never be released to the public, but instead will have an API where we can see abnormal behavior (such as sending lots of requests to try and tease out other information) and intervene.


Other than easily controlled walled garden type accesses it is hard to limit with throttling.

Is there a way to determine how much information about a particular individual has been leaked out?


If you start from real data that's a very tough nut to crack. Depending on your purpose you may be better off with generating data that has the same statistical properties.


That's true, but in healthcare generating data that matches the statistical properties of real data is also a very tough nut to crack. Essentially you need experienced clinicians who can document a fake patient's care journey through multiple years with thousands of concept codes and narrative text blocks. The level of effort is on the same order of magnitude as writing a novel.

Of course if you don't need a complete longitudinal patient chart then generating realistic data for just one aspect can be a lot simpler.



Sure and Synthea is doing something similar, as well as a few others. Those can produce data good enough for certain use cases. But it's not quite good enough for realistic product demos involving complex conditions, or testing anything related to record linkage.


We've tested numerous synthetic data companies and so far have not found any which can produce data that is useful for real world applications. There are some that are close enough where academic groups might be able to publish around it.

There is a single startup we're working with that has come much closer, but still isn't quite there. I think that if they manage to get funding they'll be somewhere usable in the near future.


Which one is that? Super interested for one of my customers...


It is possible to solve it by either:

1) Federated machine learning. Basically you let google train their model on your private data, with the understanding that some of it would be used for the public goods. But this separate your data from the weights.

2) Train your own models, and ask google to open Gmail such that they will call your models.


This isn't nearly as broad a precedent as the title sounds.

They used some pretty reasonable tests of copyright infringement to conclude that no such infringement occured.


If this legal theory is correct and upheld, possession of data will be pretty important. So all those data you licensed for one reason or another might be able to be used to train.

I think the logic makes sense because imagine if humans were prevented from getting ideas from watching movies. It seems similar to not letting AI watch every movie ever and learn.


It sounds like to get the data into their AI, Google had to make copies. They digitized the physical books, and then trained using those copies they made. Also, their system included excerpts from the books which it could retrieve and show users.

Hence, they had to use fair use to justify it.

I think if you could train the AI without having to make a copy first, such as having the AI read the physical books directly, or in the case of your movie example having the AI watch the movie on a TV hooked up to a DVD player playing a copy of the movie on DVD that you bought from a retailer authorized by the copyright owner to sell such DVDs, you might not even need to make a fair use argument.

The definitions section of the US copyright statutes, 17 USC 101 [1], defines "copies" like this:

> “Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.

and "fixed" is defined like this:

> A work is “fixed” in a tangible medium of expression when its embodiment in a copy or phonorecord, by or under the authority of the author, is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.

An AI reading or watching the work as one of many many works in order to learn weights for a neural net does not result in a material object from which the work can be perceived, reproduced, or otherwise communicated. Thus, there is no copy, and hence no copyright issue.

[1] https://www.law.cornell.edu/uscode/text/17/101


"It sounds like to get the data into their AI, Google had to make copies"

It seems to me that "copy" is a legal term of art. Like, if I view a website, it might say I'm not allowed to copy the information. But depending on the level of abstraction, the data has been copied many times by many entities just to get to me, all the layers of machines, caches, retransmissions, etc. and exactly what I do with the normal functions of my browser cause more copies to be made.

Either this is not considered copying or it is considered fair use, but it seems pretty arbitrary to me, except that obviously considering it infringement would not advance the constitutional purpose of IP protections.


> "does not result in a material object from which the work can be perceived"

This does not hold true in all cases. Note that the ruling lists the end goals as 'fair use' goals and that that seems to have been an important part in the conclusion reached.

The key thing to strive for in creating derivative works that are deserving of copyright protection in their own right is that they contain 'substantial originality', mere machine transformation does not qualify.


> mere machine transformation does not qualify

That minimises the contribution of thousands of researchers in designing the models and their training regimens.


> I think if you could train the AI without having to make a copy first, such as having the AI read the physical books directly

Presumably this would require a digital camera capturing a copy of the book and storing it for some amount of time in computer memory, which seems equivalent to copying a digital version of a book or movie into the AI's computer memory.


The "period of more than transitory duration" makes all the difference. If you digitize the page for just long enough to run the model on it once and then delete it, that's different from storing it indefinitely to train all your future models with.


I don't think the judgment supports the conclusion in the article.


From a 2016 refusal to review a case


I mean, why would it be? We train ourselves on copyrighted material all the time.


We should be surprised that nine non-technical people are making technical decisions that impact 300m people. It seems that legal decisions are increasingly scope-creeping into domains where ___domain experts are necessary. This is even more evident after seeing the Zuck-Congress hearings, where Congress proved to the people that they aren't really the best people to work on technological issues.


Misleading title. SCOTUS refused to hear the case. They didn’t rule in Google’s favour.


Is using pirated or otherwise illegally acquired data to train an algorithm legal? If yes then why is it illegal to use it for other purposes?

Is it legal to use the data to train an algorithm if the license disallows that explicitly?


The end decider of that question will be another judge, but what I've been told by counsel retained for that express purpose so far was that the model (you don't really train an algorithm) that resulted from the use of that data would count as an automatic transformation of the data leaving it very likely that the model itself would be classed as a derivative work without deserving copyright protection in its own right.

In other words: you could not have come up with the model without the original data.


Does copyright law actually make it illegal (as opposed to just frowned-upon, and maybe hindered by one's ISP) to receive pirated material? I was under the impression that the illegal part was the distribution of it.


> Does copyright law actually make it illegal (as opposed to just frowned-upon, and maybe hindered by one's ISP) to receive pirated material?

Receiving, strictly, no. However with digital material, most use involves copying; for legitimate copies that copying is covered by an implied license for the normal use of the work, for copies which are not themselves authorized, there is likewise no implied license for use.

Also, receipt of digital copies itself often involves copying directed by the receiver, which is prohibited, and may even involve a request from the receiver to the originator to make the copy under circumstances which would be viewed as knowing that it was unauthorized, which, may often, as a solicitation of an unlawful act, itself be illegal.


Could this be used against Google? E.g. train an algorithm from their road traffic information (fetched legally via their API or web UI) to improve rough time estimates based on OpenStreetMap data.


> Could this be used against Google? E.g. train an algorithm from their road traffic information (fetched legally via their API or web UI) to improve rough time estimates based on OpenStreetMap data.

No, because then contractual ToS, not naked copyright law, will be at issue. Even if Google doesn't have the right terms to forestall this now, it's a trivial change for them to adopt.


Anyone know of any organizations working to repeal Intellectual Monopoly laws? I know there are organizations that try to counter the influence of the IM industry like EFF and FSF, but I’m looking for groups that have come out and said Intellectual Monopoly laws need to go, period.


Pirate Parties worldwide. Sometimes they even get seats in parliament.


Somebody know the situation in EU for that?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: