Hacker News new | past | comments | ask | show | jobs | submit login

From: https://twitter.com/EMostaque/status/1760660709308846135

Some notes:

- This uses a new type of diffusion transformer (similar to Sora) combined with flow matching and other improvements.

- This takes advantage of transformer improvements & can not only scale further but accept multimodal inputs..

- Will be released open, the preview is to improve its quality & safety just like og stable diffusion

- It will launch with full ecosystem of tools

- It's a new base taking advantage of latest hardware & comes in all sizes

- Enables video, 3D & more..

- Need moar GPUs..

- More technical details soon

>Can we create videos similar like sora

Given enough GPUs and good data yes.

>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?

Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.

(adding some later replies)

>awesome. I assume these aren't heavily cherry picked seeds?

No this is all one generation. With DPO, refinement, further improvement should get better.

>Do you have any solves coming for driving coherency and consistency across image generations? For example, putting the same dog in another scene?

yeah see @Scenario_gg's great work with IP adapters for example. Our team builds ComfyUI so you can expect some really great stuff around this...

>Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.

Imagine the new version will. DALLE and MJ are also pipelines, you can pretty much do anything accurately with pipelines now.

>Nice. Is it an open-source / open-parameters / open-data model?

Like prior SD models it will be open source/parameters after the feedback and improvement phase. We are open data for our LMs but not other modalities.

>Cool!!! What do you mean by good data? Can it directly output videos?

If we trained it on video yes, it is very much like the arch of sora.




SD 1.5 is 983m parameters, SDXL is 3.5b, for reference.

Very interesting. I've been streching my 12GB 3060 as far as I can; it's exciting that smaller hardware is still usable even with modern improvements.


Stability has to make money somehow. By releasing an 8B parameter model, they’re encouraging people to use their paid API for inference. It’s not a terrible business decision. And hobbyists can play with the smaller models, which with some refining will probably be just fine for most non-professional use cases.


I would LOL if they released the "safe" model for free but made you pay for the one with boobs.


Oh they’ll never let you pay for porn generation. But they will happily entertain having you pay for quality commercial images that are basically a replacement for the entire graphic design industry.


It's not an easy fap, but I guess I'm watching people get f*cked either way.


Don't people quantize SD down to 8 bits? I understand plenty of people don't have 8GB of VRAM (and I suppose you need some extra for supplemental data, so maybe 10GB?). But that's still well within the realm of consumer hardware capabilities.


I’m the wrong person to ask, but it seems Stability intends to offer models from 800M to 8B parameters in size, which offers something for everyone.


I am going to look at quantization for 8b. But also, these are transformers, so variety of merging / Frankenstein-tune is possible. For example, you can use 8b model to populate the KV cache (which computes once, so can load from slower devices, such as RAM / SSD) and use 800M model for diffusion by replicating weights to match layers of the 8b model.


800m is good for mobile, 8b for graphics cards.

Bigger than that is also possible, not saturated yet but need more GPUs.


Do you know how the memory demands compare to LLMs at the same number of parameters? For example, Mistral 7B quantized to 4 bits works very well on an 8GB card, though there isn’t room for long context.


you ca also quantisation which lowers memory requirements at a small lose of performance.


I'm curious - where are the GPUs with decent processing power but enormous memory? Seems like there'd be a big market for them.


Nvidia is making way too much money keeping cards with lots of memory exclusive to server GPUs they sell with insanely high margins.

AMD still suffers from limited resources and doesn't seem willing to spend too much chasing a market that might just be a temporary hype, Google's TPUs are a pain to use and seem to have stalled out, and Intel lacks commitment, and even their products that went roughly in that direction aren't a great match for neural networks because of their philosophy of having fewer more complex cores.


MacBooks with M2 or M3 Max. I’m serious. They perform like a 2070 or 2080 but have up to 128GB of unified memory, most of which can be used as VRAM.


MPS is promising and the memory bandwidth is definitely there, but stable diffusion performance on Apple Silicon remains terribly poor compared with consumer Nvidia cards (in my humble opinion). Perhaps this is partly because so many bits of the SD ecosystem are tied to Nvidia primitives.


Image diffusion models tend to have relatively low memory requirements compared to LLMs (and don’t benefit from batching), so having access to 128 GB of unified memory is kinda pointless.


They do benefit from batching; up to a 50% performance improvement, in my experience.

That might seem small compared to LLMs, but it isn't small in absolute terms.


I got a 2x jump on my 4090 from batching SDXL.


Stable diffusion will run fine on a 3090, or 4070ti Super and higher.


How many tokens/s are we talking for a 70B model?

Last I saw they performed really poorly, like lower single digits t/s. Don't get me wrong they're probably a decent value for experimenting with it, but is flat out pathetic compared to an A100 or H100. And I think useless for training?


You can run a 180B model like Falcon Q4 around 4-5tk/s, a 120B model like Goliath Q4 at around 6-10tk/s, and 70B Q4 around 8-12tk/s and smaller models much quicker, but it really depends on the context size, model architecture and other settings. A A100 or H100 is obviously going to be a lot faster but it costs significantly more taking its supporting requirements into account and can’t be run on a light, battery powered laptop etc…


For text inference, what you want is M1/M2 Ultra with its 800 Gb/s RAM. Max only goes up to 400 Gb/s.


Yeah but the ultra only goes in desktop platforms which may be limiting to some.


But that's no different from mid-to-high-end GPUs, which is what the original ask was about.


I’ll bet you the Nvidua 50xx series will have cards that are asymmetric for this reason. But nothing that will cannibalize their gaming market.

You’ll be able to get higher resolution but slowly. Or pay the $2800 for a 5090 and get high res with good speed.


I kind of wonder if gaming will start incorporating AI stuff. What if instead of generating a stable diffusion image, you could generate levels and monsters


I think the AMD 8600XT is a mod in this direction, otherwise there was little point in releasing it.

GPUs need a decent virtual memory system though. The current "it runs or it crashes" situation isn't good enough.


Nvidia have a system for DMA from GPU to system memory, GPUdirect. That seems like a potentially better route if latency can be handled well.


GPU memory is all about bandwidth, not latency. DDR5 can do 4-8 GT/s x 64-bit bus per DIMM, so maxing 128 GB/s with a dual memory controller, 512 GB/s with 8x memory controllers on server chips, but GDDR6 can run at twice the frequency and has a memory bus ~5x as wide in the 4090, so you get an order of magnitude bump in throughput, so nearly 1 TB/s on a consumer product. Datacenter GPUs (e.g. A100) with HBM2e doubles that to 2 TB/s


I dream of AMD or Intel creating cards to do just that


Tesla P40


H200 has 141GB, B100 (out next month) will probably have even more. How much memory do you need?


We need 128gb with a 4070 chip for about 2000 dollars. Thats what we want.


I've never tried it, but in Windows you can have CUDA apps fall back to system ram when GPU vram is exhausted. You could slap 128gb in your rig with a 4070. I'm sure performance falls off a cliff, but if it's the difference between possible and impossible that might be acceptable.

https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...


Nvidia will not build that any time soon. RAM is the dividing line between charging $40,000 vs $2500…


Please give me some DIMM slots on the GPU so that I can choose my own memory like I'm used to from the CPU-world and which I can re-use when I upgrade my GPU.


An M1 Mac Studio with that much RAM can be had for around $3K if you look for good deals, and will give you ~8 tok/s on a 70B model, or ~5 tok/s for a 120B one.


Unfortunately production capacity for that is limited, and with sufficient demand, all pricing is an auction. Therefore, we aren't going to be seeing that card in years


Yes please.


> - Need moar GPUs..

Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?


We have highly efficient models for inference and a quantization team.

Need moar GPUs to do a video version of this model similar to Sora now they have proved that Diffusion Transformers can scale with latent patches (see stablevideo.com and our work on that model, currently best open video model).

We have 1/100th of the resources of OpenAI and 1/1000th of Google etc.

So we focus on great algorithms and community.

But now we need those GPUs.


Don't fall for it: OpenAI is microsoft. They have as much as google, if not more.


Google got cheap TPU chips, means they circumvent the extremely expensive Nvidia corporate licenses. I can easily see them having 10x the resources of OpenAI for this.


Yes, they have deep pockets and could increase investment if needed. But the actual resources devoted today are public, and in line with the parent said.


To be clear here, you think that Microsoft has more AI compute than Google?


This isn’t OpenAI that make GPTx.

It’s StabilityAI that makes Stable Diffusion X.


can someone explain why nVidia doesn't just hold their own AI? And literally devote 50% of their production to their own compute center? In an age where even ancient companies like Cisco are getting in the AI race, why wouldn't the people with the keys to the kingdom get involved?


They've been very happy selling shovels at a steep margin to literally endless customers.

The reason is because they instantly get a risk free guaranteed VERY healthy margin on every card they sell, and there's endless customers lined up for them.

If they kept the cards, they give up the opportunity to make those margins, and instead take the risk that they'll develop a money generating service (that makes more money then selling the cards).

This way there's no risk of: A competitor out competing them, not successfully developing a profitable product, "the ai bubble popping", stagnating development, etc.

There's also the advantage that this capital has allowed them to buy up most of TSMC's production capacity, which limits the competitors like Google's TPUs.


Because history has shown that the money is in selling the picks and shovels, not operating the mine. (At least for now. There very well may come a point later on when operating the mine makes more sense, but not until it's clear where the most profitable spot will be)


Don’t stretch that analogy too far. It was applicable to gold rushes, which were low hanging fruit where any idiot could dig a hole and find gold.

Historically, once the easy to find gold was all gone it was the people who owned the deep gold mines and had the capital to exploit them who became wealthy.


"The people that made the most money in the gold rush were selling shovels, not digging gold".


1. the real keys to the kingdom are held by TSMC whose fab capacity rules the advanced chips we all get, from NVIDIA to Apple to AMD to even Intel these days.

2. the old advice is to sell shovels during a gold rush


Jensen was just talking about a new kind of data center: AI-generation factories.


> Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?

There is an inherent trade off between model size and quality. Quantization reduces model size at the expense of quality. Sometimes it's a better way to do that than reducing the number of parameters, but it's still fundamentally the same trade off. You can't make the highest quality model use the smallest amount of memory. It's information theory, not sorcery.


Yes Quantization compresses float32 values to int8 by mapping the large range of floats to a smaller integer range using a scale factor. This scale factor is key for converting back to floats (dequantization), aiming to preserve as much information as possible within the int8 limits. While quantization reduces model size and speeds up computation, it trades off some accuracy due to the compression. It's a balance between efficiency and model quality, not a magic solution to shrink models without losing some performance.

Quantization is essential for me since a 7B model won't fit on my RTX 2060 with only 6GB of VRAM. It allows me to compress the model so it can run on my hardware.


I believe he means for training


I understand that Sora is very popular, so it makes sense to refer to it, but when saying it is similar to Sora, I guess it actually makes more sense to say that it uses a Diffusion Transformer (DiT) (https://arxiv.org/abs/2212.09748) like Sora. We don't really know more details on Sora, while the original DiT has all the details.


Is anyone else struck by the similarities in textures between the images in the appendix of the above "Scalable Diffusion Models with Transformers" paper?

If you size the browser window right, paging with the arrow keys (so the document doesn't scroll) you'll see (eg, pages 20-21) the textures of the parrot's feathers are almost identical to the textures of bark on the tree behind the panda bear, or the forest behind the red panda is very similar to the undersea environment.

Even if I'm misunderstanding something fundamental here about this technique, I still find this interesting!


Could be that they’re all generated from the same seed. And we humans are really good at spotting patterns like that.


So is this "SDXL safe" or "SD2.1" safe, cause SDXL safe we can deal with, if it's 2.1 safe it's gonna end up DOA for a large part of the opensource community again


SD2.1 was not "overly safe", SD2.0 was because of a training bug.

2.1 didn't have adoption because people didn't want to deal with the open replacement for CLIP. Or possibly because everyone confused 2.0 and 2.1.


There was a replacement for CLIP? That is awesome. What was the issue with it?


Don't know about 3.0, but Cascade has different level of safety between the full model and the light model. Full model is far more prudish, but both completely fail with some prompts.


> SDXL safe we can deal with

how exactly did the community deal with it? interested to learn how to unlearn safety


>>>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?

>>>Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.

--

Can you fragment responses such that if an edge device (mobile app) is prompted for [thing] it can pass tokens upstream on the prompt -- Torrenting responses effectively - and you could push actual GPU edge devices in certain climates... like dens cities whom are expected to be a Fton of GPU cycle consumption around the edge?

So you have tiered processing (speed is done locally, quality level 1 can take some edge gpu - and corporate shit can be handled in cloud...

----

Can you fragment and torrent a response?

If so, how is that request torn up and routed to appropriate resources?

BOFH me if this is a stupid question? (but its valid for how we are evolving to AI being intrinsic to our society so quickly.)


> Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.

can someone explain how negation is currently done in stable diffusion? and why cant we do it in text LLMs?


you can use negative logit bias


>> all sorts of edge to giant GPU deployment.

Soon the GPU and its associated memory will be on different cards, as once happened with CPUs. The day of the GPU with ram slots is fast approaching. We will soon plug terabytes of ram into our 4090s, then plug a half-dozen 4090s into a raspberry PI to create a Cronenberg rendering monster. Can it generate movies faster than Pixar can write them? Sure. Can it play Factorio? Heck no.


Any seperation of a GPU from its VRAM is going to come at the expense of (a lot of) bandwidth. VRAM is only as fast as it is because the memory chips are as close as possible to the GPU, either on seperate packages immediately next to the GPU package or integrated onto the same package as the GPU itself in the fanciest stuff.

If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.

Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.


That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.


I'm having flashbacks to owning a Matrox Millenium as a kid. I never did get that 4MB vram upgrade.

https://www.512bit.net/matrox/matrox_millenium.html


Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?

Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.


Of course, this is common with LLMs which are too large to fit in any single GPU. I believe Deepspeed implements what you're referring to.


No it won't. GPUs are good at ml partly because of the huge memory bandwidth. 1000s of bits wide. You won't find connectors that have that many terminals and maintain signal quality. Even putting a second bank soldered on the same signals can be enough to mess things up.


I doubt it. The latest GPUs utilize HBM which is necessarily part of the same package as the main die. If you had a RAM slot for a GPU you might as well just go out to system RAM, way too much latency to be useful.


It isn't the latency which is the problem, it's the bandwidth. A memory socket with that much bandwidth would need a lot of pins. In principle you could just have more memory slots where each slot has its own channel. 16 channels of DDR5-8000 would have more bandwidth than the RTX 4090. But an ordinary desktop board with 16 memory channels is probably not happening. You could plausibly see that on servers however.

What's more likely is hybrid systems. Your basic desktop CPU gets e.g. 8GB of HBM, but then also has 16GB of DRAM in slots. Another CPU/APU model that fits into the same socket has 32GB of HBM (and so costs more), which you could then combine with 128GB of DRAM. Or none, by leaving the slots empty, if you want entirely HBM. A server or HEDT CPU might have 256GB of HBM and support 4TB of DRAM.


Agree, this is likely future. It’s really just an extension of The existing tiered CPU cache model


I don’t think you really understand the current trends in computer architecture. Even cpus are being moved to have on package ram for higher bandwidth. Everything is the opposite of what you said.


Higher bandwidth but lower capacity. The real trend is different physical architectures for different compute loads. There is a place in AI for bulk albeit slower memory such as extremely large date sets that want to run internally on a discreet card without involving pci lanes.


This is also not true. You can transfer from main memory to cards plenty fast enough that it is not a bottleneck. Consumer GPU's don't even use pcie5 yet, which doubles the bandwidth of 4. Professional datacenter cards don't use pcie AT ALL, but they do put a huge amount of RAM on the package with the GPUs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: