I haven't used SORA, but none of the GenAI I'm aware of could produce a competen...

TeMPOraL · 2024-12-09T19:55:05 1733774105

A human artist keeps state :). They keep it between drawing sessions, and more importantly, they keep very detailed state - their imagination or interpretation of what the thing (house, grizzled detective, etc.) is.

Most models people currently use don't keep state between invocations, and whatever interpretation they make from provided context (e.g. reference image, previous frame) is surface level and doesn't translate well to output. This is akin to giving each panel in a comic to a different artist, and also telling them to sketch it out by their gut, without any deep analysis of prior work. It's a big limitation, alright, but researchers and practitioners are actively working to overcome it.

(Same applies to LLMs, too.)

Der_Einzige · 2024-12-10T14:07:16 1733839636

Btw there’s a way to match characters in a batch in the forge webUI which guarantees that all images in the batch have the same figure in it. Trivial to implement this in all other image generators. This critique is baseless.

staticman2 · 2024-12-10T17:48:55 1733852935

So prove it. If you are in good faith arguing an AI, via automation can draw a comic script with consistent figures, please tell an AI to draw the images in the first 3 pages of this script I pulled from the comic book script archive:

https://www.comicsexperience.com/wp-content/uploads/2018/09/...

Or if you can't do this, explain why the feature you mentioned cannot do this, and what it or good for?

TeMPOraL · 2024-12-11T16:04:49 1733933089

As long as you're not asking for a zero-shot solution with a single model run three times in a row, this should be entirely doable, though I imagine ensuring the result would require a complex pipeline consisting of:

- An LLM to inflate descriptions in the script to very detailed prompts (equivalent to artist thinking up how characters will look, how the scene is organized);

- A step to generate a representative drawing of every character via txt2img - or more likely, multiple ones, with a multimodal LLM rating adherence to the prompt;

- A step to generate a lot of variations of every character in different poses, using e.g. ControlNet or whatever is currently the SOTA solution used by the Stable Diffuison community to create consistent variations of a character;

- A step to bake all those character variations into a LoRA;

- Finally, scenes would be generated by another call to txt2img, with prompts computed in step 1, and appropriate LoRAs active (this can be handled through prompt too).

Then iterate on that, e.g. maybe additional img2img to force comic book style (with a different SD derivative, most likely), etc.

Point being, every subproblem of the task has many different solutions already developed, with new ones appearing every month - all that's left to have an "AI artist" capable of solving your challenge is to wire the building blocks up. For that, you need just a trivial bit of Python code using existing libraries (e.g. hooking up to ComfyUI), and guess what, GPT-4 and Claude 3.5 Sonnet are quite good at Python.

EDIT: I asked Claude to generate "pseudocode" diagram of the solution from our two comments:

http://www.plantuml.com/plantuml/img/dLLDQnin4BthLmpn9JaafOR...

Each of the nodes here would be like 3-5 real ComfyUI nodes in practice.

staticman2 · 2024-12-11T18:56:01 1733943361

I appreciate the detailed response. I had a feeling the answer was some variation of "well I could get an AI to draw that but I'd have to hack at it for a few hours...". If a human has to work at it for hours, it's more like using Blender than "having an AI draw it" in my mind.

I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably. If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.

TeMPOraL · 2024-12-11T22:24:59 1733955899

> If a human has to work at it for hours, it's more like using Blender than "having an AI draw it" in my mind.

A human has to work at it too; more than few hours when doing more than few quick sketches (memory has its limits; there's a reason artists keep reference drawings around), and obviously they already put years into learning their skills than before, but fair - the human artist already knows how to do things that any given model doesn't yet[0], we kind of have to assemble the overall flow ourselves for now[1].

Then again, you only need to assemble it once, putting those hours of work up front - and if it's done, and it works, it becomes fair to say that AI can, in fact, generate self-consistent comic books.

> I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably.

I agree. I obviously didn't try this myself either (yet, I'm very tempted to try it, to satisfy my own curiosity). However, between my own experience with LLMs and Stable Diffusion, and occasionally browsing Stable Diffusion subreddits, I'm convinced all individual steps work well (and have multiple working alternatives), except for the one you flagged, i.e. evaluating prompt adherence using multimodal LLM - that last one I only feel should work, but I don't know for sure. However, see [1] for alternative approach :).

My point thus is, all individual steps are possible, and wiring them together seems pretty straightforward, therefore the whole thing should work if someone bothers to do it.

> If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.

I think the biggest concentration of enthusiasm is to be found in NSWF uses of SD :). On the one hand, you're right; we probably should've seen it done already. On the other hand, my impression is that most people doing advanced SD magic are perfectly satisfied with partially manual workflows. And it kind of makes sense - manual steps allow for flexibility and experimentation, and some things are much simpler to wire by hand or patch up with some tactical photoshopping, than to try and automate them fully. In particular, things judging the quality of output is both easy for humans and hard to automate.

Still, I've recently seen ads of various AI apps claiming to do complex work (such as animating characters in photos) end-to-end automatically - exactly the kind of work that's typically done in partially manual process. So I suspect fully-automated solutions are being built on a case-by-case basis, driven by businesses making apps for the general population; a process that lags some months behind what image gen communities figure out in the open.

--

[0] - Though arguably, LLMs contain the procedural knowledge of how a task should be done; just ask it to ELI5 or explain in WikiHow style.

[1] - In fact, I just asked Claude to solve this problem in detail, without giving it my own solution to look at (but hinting at the required complexity level); see this: https://cloud.typingmind.com/share/db36fc29-6229-4127-8336-b... (and excuse the weird errors; Claude is overloaded at the moment, so some responses had to be regenerated; also styling on the shared conversation sucks, so be sure to use the "pop out" button on diagrams to see them in detail).

At very high level, it's the same as mine, but one level below, it uses different tools and approaches, some of which I never knew about - like keeping memory in embedding space instead of text space, and using various other models I didn't know exist.

EDIT: I did some quick web search for some of the ideas Claude proposed, and discovered even more techniques and models I never heard of. Even my own awareness of the image generation space is only scratching the surface of what people are doing.