- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.
- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..
- Will be released open, the preview is to improve its quality & safety just like og stable diffusion
- It will launch with full ecosystem of tools
- It's a new base taking advantage of latest hardware & comes in all sizes
- Enables video, 3D & more..
- Need moar GPUs..
- More technical details soon
>Can we create videos similar like sora
Given enough GPUs and good data yes.
>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?
Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.
(adding some later replies)
>awesome. I assume these aren't heavily cherry picked seeds?
No this is all one generation. With DPO, refinement, further improvement should get better.
>Do you have any solves coming for driving coherency and consistency across image generations? For example, putting the same dog in another scene?
yeah see
@Scenario_gg's great work with IP adapters for example. Our team builds ComfyUI so you can expect some really great stuff around this...
>Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.
Imagine the new version will. DALLE and MJ are also pipelines, you can pretty much do anything accurately with pipelines now.
>Nice. Is it an open-source / open-parameters / open-data model?
Like prior SD models it will be open source/parameters after the feedback and improvement phase. We are open data for our LMs but not other modalities.
>Cool!!! What do you mean by good data? Can it directly output videos?
If we trained it on video yes, it is very much like the arch of sora.
Stability has to make money somehow. By releasing an 8B parameter model, they’re encouraging people to use their paid API for inference. It’s not a terrible business decision. And hobbyists can play with the smaller models, which with some refining will probably be just fine for most non-professional use cases.
Oh they’ll never let you pay for porn generation. But they will happily entertain having you pay for quality commercial images that are basically a replacement for the entire graphic design industry.
Don't people quantize SD down to 8 bits? I understand plenty of people don't have 8GB of VRAM (and I suppose you need some extra for supplemental data, so maybe 10GB?). But that's still well within the realm of consumer hardware capabilities.
I am going to look at quantization for 8b. But also, these are transformers, so variety of merging / Frankenstein-tune is possible. For example, you can use 8b model to populate the KV cache (which computes once, so can load from slower devices, such as RAM / SSD) and use 800M model for diffusion by replicating weights to match layers of the 8b model.
Do you know how the memory demands compare to LLMs at the same number of parameters? For example, Mistral 7B quantized to 4 bits works very well on an 8GB card, though there isn’t room for long context.
Nvidia is making way too much money keeping cards with lots of memory exclusive to server GPUs they sell with insanely high margins.
AMD still suffers from limited resources and doesn't seem willing to spend too much chasing a market that might just be a temporary hype, Google's TPUs are a pain to use and seem to have stalled out, and Intel lacks commitment, and even their products that went roughly in that direction aren't a great match for neural networks because of their philosophy of having fewer more complex cores.
MPS is promising and the memory bandwidth is definitely there, but stable diffusion performance on Apple Silicon remains terribly poor compared with consumer Nvidia cards (in my humble opinion). Perhaps this is partly because so many bits of the SD ecosystem are tied to Nvidia primitives.
Image diffusion models tend to have relatively low memory requirements compared to LLMs (and don’t benefit from batching), so having access to 128 GB of unified memory is kinda pointless.
Last I saw they performed really poorly, like lower single digits t/s. Don't get me wrong they're probably a decent value for experimenting with it, but is flat out pathetic compared to an A100 or H100. And I think useless for training?
You can run a 180B model like Falcon Q4 around 4-5tk/s, a 120B model like Goliath Q4 at around 6-10tk/s, and 70B Q4 around 8-12tk/s and smaller models much quicker, but it really depends on the context size, model architecture and other settings. A A100 or H100 is obviously going to be a lot faster but it costs significantly more taking its supporting requirements into account and can’t be run on a light, battery powered laptop etc…
I kind of wonder if gaming will start incorporating AI stuff. What if instead of generating a stable diffusion image, you could generate levels and monsters
GPU memory is all about bandwidth, not latency. DDR5 can do 4-8 GT/s x 64-bit bus per DIMM, so maxing 128 GB/s with a dual memory controller, 512 GB/s with 8x memory controllers on server chips, but GDDR6 can run at twice the frequency and has a memory bus ~5x as wide in the 4090, so you get an order of magnitude bump in throughput, so nearly 1 TB/s on a consumer product. Datacenter GPUs (e.g. A100) with HBM2e doubles that to 2 TB/s
I've never tried it, but in Windows you can have CUDA apps fall back to system ram when GPU vram is exhausted. You could slap 128gb in your rig with a 4070. I'm sure performance falls off a cliff, but if it's the difference between possible and impossible that might be acceptable.
Please give me some DIMM slots on the GPU so that I can choose my own memory like I'm used to from the CPU-world and which I can re-use when I upgrade my GPU.
An M1 Mac Studio with that much RAM can be had for around $3K if you look for good deals, and will give you ~8 tok/s on a 70B model, or ~5 tok/s for a 120B one.
Unfortunately production capacity for that is limited, and with sufficient demand, all pricing is an auction. Therefore, we aren't going to be seeing that card in years
We have highly efficient models for inference and a quantization team.
Need moar GPUs to do a video version of this model similar to Sora now they have proved that Diffusion Transformers can scale with latent patches (see stablevideo.com and our work on that model, currently best open video model).
We have 1/100th of the resources of OpenAI and 1/1000th of Google etc.
Google got cheap TPU chips, means they circumvent the extremely expensive Nvidia corporate licenses. I can easily see them having 10x the resources of OpenAI for this.
Yes, they have deep pockets and could increase investment if needed. But the actual resources devoted today are public, and in line with the parent said.
can someone explain why nVidia doesn't just hold their own AI? And literally devote 50% of their production to their own compute center? In an age where even ancient companies like Cisco are getting in the AI race, why wouldn't the people with the keys to the kingdom get involved?
They've been very happy selling shovels at a steep margin to literally endless customers.
The reason is because they instantly get a risk free guaranteed VERY healthy margin on every card they sell, and there's endless customers lined up for them.
If they kept the cards, they give up the opportunity to make those margins, and instead take the risk that they'll develop a money generating service (that makes more money then selling the cards).
This way there's no risk of: A competitor out competing them, not successfully developing a profitable product, "the ai bubble popping", stagnating development, etc.
There's also the advantage that this capital has allowed them to buy up most of TSMC's production capacity, which limits the competitors like Google's TPUs.
Because history has shown that the money is in selling the picks and shovels, not operating the mine. (At least for now. There very well may come a point later on when operating the mine makes more sense, but not until it's clear where the most profitable spot will be)
Don’t stretch that analogy too far. It was applicable to gold rushes, which were low hanging fruit where any idiot could dig a hole and find gold.
Historically, once the easy to find gold was all gone it was the people who owned the deep gold mines and had the capital to exploit them who became wealthy.
1. the real keys to the kingdom are held by TSMC whose fab capacity rules the advanced chips we all get, from NVIDIA to Apple to AMD to even Intel these days.
2. the old advice is to sell shovels during a gold rush
> Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?
There is an inherent trade off between model size and quality. Quantization reduces model size at the expense of quality. Sometimes it's a better way to do that than reducing the number of parameters, but it's still fundamentally the same trade off. You can't make the highest quality model use the smallest amount of memory. It's information theory, not sorcery.
Yes Quantization compresses float32 values to int8 by mapping the large range of floats to a smaller integer range using a scale factor. This scale factor is key for converting back to floats (dequantization), aiming to preserve as much information as possible within the int8 limits. While quantization reduces model size and speeds up computation, it trades off some accuracy due to the compression. It's a balance between efficiency and model quality, not a magic solution to shrink models without losing some performance.
Quantization is essential for me since a 7B model won't fit on my RTX 2060 with only 6GB of VRAM. It allows me to compress the model so it can run on my hardware.
I understand that Sora is very popular, so it makes sense to refer to it, but when saying it is similar to Sora, I guess it actually makes more sense to say that it uses a Diffusion Transformer (DiT) (https://arxiv.org/abs/2212.09748) like Sora. We don't really know more details on Sora, while the original DiT has all the details.
Is anyone else struck by the similarities in textures between the images in the appendix of the above "Scalable Diffusion Models with Transformers" paper?
If you size the browser window right, paging with the arrow keys (so the document doesn't scroll) you'll see (eg, pages 20-21) the textures of the parrot's feathers are almost identical to the textures of bark on the tree behind the panda bear, or the forest behind the red panda is very similar to the undersea environment.
Even if I'm misunderstanding something fundamental here about this technique, I still find this interesting!
So is this "SDXL safe" or "SD2.1" safe, cause SDXL safe we can deal with, if it's 2.1 safe it's gonna end up DOA for a large part of the opensource community again
Don't know about 3.0, but Cascade has different level of safety between the full model and the light model. Full model is far more prudish, but both completely fail with some prompts.
>>>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?
>>>Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.
--
Can you fragment responses such that if an edge device (mobile app) is prompted for [thing] it can pass tokens upstream on the prompt -- Torrenting responses effectively - and you could push actual GPU edge devices in certain climates... like dens cities whom are expected to be a Fton of GPU cycle consumption around the edge?
So you have tiered processing (speed is done locally, quality level 1 can take some edge gpu - and corporate shit can be handled in cloud...
----
Can you fragment and torrent a response?
If so, how is that request torn up and routed to appropriate resources?
BOFH me if this is a stupid question? (but its valid for how we are evolving to AI being intrinsic to our society so quickly.)
Soon the GPU and its associated memory will be on different cards, as once happened with CPUs. The day of the GPU with ram slots is fast approaching. We will soon plug terabytes of ram into our 4090s, then plug a half-dozen 4090s into a raspberry PI to create a Cronenberg rendering monster. Can it generate movies faster than Pixar can write them? Sure. Can it play Factorio? Heck no.
Any seperation of a GPU from its VRAM is going to come at the expense of (a lot of) bandwidth. VRAM is only as fast as it is because the memory chips are as close as possible to the GPU, either on seperate packages immediately next to the GPU package or integrated onto the same package as the GPU itself in the fanciest stuff.
If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.
Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.
That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.
Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?
Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.
No it won't. GPUs are good at ml partly because of the huge memory bandwidth. 1000s of bits wide. You won't find connectors that have that many terminals and maintain signal quality. Even putting a second bank soldered on the same signals can be enough to mess things up.
I doubt it. The latest GPUs utilize HBM which is necessarily part of the same package as the main die. If you had a RAM slot for a GPU you might as well just go out to system RAM, way too much latency to be useful.
It isn't the latency which is the problem, it's the bandwidth. A memory socket with that much bandwidth would need a lot of pins. In principle you could just have more memory slots where each slot has its own channel. 16 channels of DDR5-8000 would have more bandwidth than the RTX 4090. But an ordinary desktop board with 16 memory channels is probably not happening. You could plausibly see that on servers however.
What's more likely is hybrid systems. Your basic desktop CPU gets e.g. 8GB of HBM, but then also has 16GB of DRAM in slots. Another CPU/APU model that fits into the same socket has 32GB of HBM (and so costs more), which you could then combine with 128GB of DRAM. Or none, by leaving the slots empty, if you want entirely HBM. A server or HEDT CPU might have 256GB of HBM and support 4TB of DRAM.
I don’t think you really understand the current trends in computer architecture. Even cpus are being moved to have on package ram for higher bandwidth. Everything is the opposite of what you said.
Higher bandwidth but lower capacity. The real trend is different physical architectures for different compute loads. There is a place in AI for bulk albeit slower memory such as extremely large date sets that want to run internally on a discreet card without involving pci lanes.
This is also not true. You can transfer from main memory to cards plenty fast enough that it is not a bottleneck. Consumer GPU's don't even use pcie5 yet, which doubles the bandwidth of 4. Professional datacenter cards don't use pcie AT ALL, but they do put a huge amount of RAM on the package with the GPUs.
Some notes:
- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.
- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..
- Will be released open, the preview is to improve its quality & safety just like og stable diffusion
- It will launch with full ecosystem of tools
- It's a new base taking advantage of latest hardware & comes in all sizes
- Enables video, 3D & more..
- Need moar GPUs..
- More technical details soon
>Can we create videos similar like sora
Given enough GPUs and good data yes.
>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?
Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.
(adding some later replies)
>awesome. I assume these aren't heavily cherry picked seeds?
No this is all one generation. With DPO, refinement, further improvement should get better.
>Do you have any solves coming for driving coherency and consistency across image generations? For example, putting the same dog in another scene?
yeah see @Scenario_gg's great work with IP adapters for example. Our team builds ComfyUI so you can expect some really great stuff around this...
>Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.
Imagine the new version will. DALLE and MJ are also pipelines, you can pretty much do anything accurately with pipelines now.
>Nice. Is it an open-source / open-parameters / open-data model?
Like prior SD models it will be open source/parameters after the feedback and improvement phase. We are open data for our LMs but not other modalities.
>Cool!!! What do you mean by good data? Can it directly output videos?
If we trained it on video yes, it is very much like the arch of sora.