>nobody owns AI outputs This would be fantastic imo. A new era of the commons. >...

doctorpangloss · on Oct 13, 2023

> Adobe has trained only on public ___domain and their own stock images.

Adobe is lying. They are relying on general ignorance about the technology to get away with it.

Adobe has not shown how they train the text encoders in Firefly, or what images were used for the text-based conditioning (i.e. "text to image") part of their image generation model. They are almost certainly using CLIP or T5, which are trained on LAION2b, an image dataset with the very problems they are trying to address, C4 (a text dataset similarly encumbered) and similar.

bUt nO oNe eLsE hAs bRoUgHt tHiS uP. It's so arcane for non-practitioners. Talk about this directly with someone like Astropulse, who monetizes a Stable Diffusion model: no confusion, totally agrees with me. By comparison, I've pinged the Ars Technica journalist who just wrote about this issue: crickets. Posted to the Adobe forum: crickets. E-mailed them on their specific address for this: crickets. I have no idea why something so obvious has slipped by everyone's radar!

artninja1988 · on Oct 13, 2023

Would it be impossible to train their own text encoder on just the images they have? How many would one need?

doctorpangloss · on Oct 13, 2023

I welcome anyone who works at Adobe to simply answer this question and put it to rest. There is absolutely nothing sensitive about the issue, unless it exposes them in a lie.

So no chance. I think it's a big fat lie. They'd have to have made some other scientific breakthrough, which they didn't.

Using information from https://openai.com/research/clip and https://github.com/mlfoundations/open_clip, it's possible to answer this question.

It's certainly not impossible, but it's impracticable. On 248m images (roughly the size of Adobe Stock), CLIP gets 37% on ImageNet, and on the 2000m from LAION, it performs 71-80%. And even with 2000m images, CLIP is substantially worse performing than the approach that Imagen uses for "text comprehension," which relies on essentially many billions more images and text tokens.

artninja1988 · on Oct 13, 2023

Interesting. I looked through the laion Datasets a bit and it was astonishing how bad the captions really are. Very very short captions if not completely wrong. Amazing to me that this even works at all. I wonder how much better clip etc would perform and be more efficient if they had probably tagged images, not just with the alt text. Maybe that's why dalle 3 is so good at following the prompts?