From: https://twitter.com/EMostaque/status/1760660709308846135 Some notes: - Thi...

cheald · on Feb 22, 2024

SD 1.5 is 983m parameters, SDXL is 3.5b, for reference.

Very interesting. I've been streching my 12GB 3060 as far as I can; it's exciting that smaller hardware is still usable even with modern improvements.

ttul · on Feb 22, 2024

Stability has to make money somehow. By releasing an 8B parameter model, they’re encouraging people to use their paid API for inference. It’s not a terrible business decision. And hobbyists can play with the smaller models, which with some refining will probably be just fine for most non-professional use cases.

jandrese · on Feb 22, 2024

I would LOL if they released the "safe" model for free but made you pay for the one with boobs.

ttul · on Feb 22, 2024

Oh they’ll never let you pay for porn generation. But they will happily entertain having you pay for quality commercial images that are basically a replacement for the entire graphic design industry.

ohthehugemanate · on Feb 23, 2024

It's not an easy fap, but I guess I'm watching people get f*cked either way.

teaearlgraycold · on Feb 22, 2024

Don't people quantize SD down to 8 bits? I understand plenty of people don't have 8GB of VRAM (and I suppose you need some extra for supplemental data, so maybe 10GB?). But that's still well within the realm of consumer hardware capabilities.

ttul · on Feb 22, 2024

I’m the wrong person to ask, but it seems Stability intends to offer models from 800M to 8B parameters in size, which offers something for everyone.

liuliu · on Feb 22, 2024

I am going to look at quantization for 8b. But also, these are transformers, so variety of merging / Frankenstein-tune is possible. For example, you can use 8b model to populate the KV cache (which computes once, so can load from slower devices, such as RAM / SSD) and use 800M model for diffusion by replicating weights to match layers of the 8b model.

memossy · on Feb 22, 2024

800m is good for mobile, 8b for graphics cards.

Bigger than that is also possible, not saturated yet but need more GPUs.

anon373839 · on Feb 22, 2024

Do you know how the memory demands compare to LLMs at the same number of parameters? For example, Mistral 7B quantized to 4 bits works very well on an 8GB card, though there isn’t room for long context.

vorticalbox · on Feb 22, 2024

you ca also quantisation which lowers memory requirements at a small lose of performance.

VikingCoder · on Feb 22, 2024

I'm curious - where are the GPUs with decent processing power but enormous memory? Seems like there'd be a big market for them.

wongarsu · on Feb 22, 2024

Nvidia is making way too much money keeping cards with lots of memory exclusive to server GPUs they sell with insanely high margins.

AMD still suffers from limited resources and doesn't seem willing to spend too much chasing a market that might just be a temporary hype, Google's TPUs are a pain to use and seem to have stalled out, and Intel lacks commitment, and even their products that went roughly in that direction aren't a great match for neural networks because of their philosophy of having fewer more complex cores.

ls612 · on Feb 22, 2024

MacBooks with M2 or M3 Max. I’m serious. They perform like a 2070 or 2080 but have up to 128GB of unified memory, most of which can be used as VRAM.

ttul · on Feb 22, 2024

MPS is promising and the memory bandwidth is definitely there, but stable diffusion performance on Apple Silicon remains terribly poor compared with consumer Nvidia cards (in my humble opinion). Perhaps this is partly because so many bits of the SD ecosystem are tied to Nvidia primitives.

ummonk · on Feb 23, 2024

Image diffusion models tend to have relatively low memory requirements compared to LLMs (and don’t benefit from batching), so having access to 128 GB of unified memory is kinda pointless.

Filligree · on Feb 23, 2024

They do benefit from batching; up to a 50% performance improvement, in my experience.

That might seem small compared to LLMs, but it isn't small in absolute terms.

ls612 · on Feb 23, 2024

I got a 2x jump on my 4090 from batching SDXL.

ls612 · on Feb 23, 2024

Stable diffusion will run fine on a 3090, or 4070ti Super and higher.

declaredapple · on Feb 22, 2024

How many tokens/s are we talking for a 70B model?

Last I saw they performed really poorly, like lower single digits t/s. Don't get me wrong they're probably a decent value for experimenting with it, but is flat out pathetic compared to an A100 or H100. And I think useless for training?

smcleod · on Feb 22, 2024

You can run a 180B model like Falcon Q4 around 4-5tk/s, a 120B model like Goliath Q4 at around 6-10tk/s, and 70B Q4 around 8-12tk/s and smaller models much quicker, but it really depends on the context size, model architecture and other settings. A A100 or H100 is obviously going to be a lot faster but it costs significantly more taking its supporting requirements into account and can’t be run on a light, battery powered laptop etc…

int_19h · on Feb 23, 2024

For text inference, what you want is M1/M2 Ultra with its 800 Gb/s RAM. Max only goes up to 400 Gb/s.

ls612 · on Feb 23, 2024

Yeah but the ultra only goes in desktop platforms which may be limiting to some.

int_19h · on Feb 23, 2024

But that's no different from mid-to-high-end GPUs, which is what the original ask was about.

SV_BubbleTime · on Feb 22, 2024

I’ll bet you the Nvidua 50xx series will have cards that are asymmetric for this reason. But nothing that will cannibalize their gaming market.

You’ll be able to get higher resolution but slowly. Or pay the $2800 for a 5090 and get high res with good speed.

m463 · on Feb 24, 2024

I kind of wonder if gaming will start incorporating AI stuff. What if instead of generating a stable diffusion image, you could generate levels and monsters

weebull · on Feb 23, 2024

I think the AMD 8600XT is a mod in this direction, otherwise there was little point in releasing it.

GPUs need a decent virtual memory system though. The current "it runs or it crashes" situation isn't good enough.

pbhjpbhj · on Feb 22, 2024

Nvidia have a system for DMA from GPU to system memory, GPUdirect. That seems like a potentially better route if latency can be handled well.

nick238 · on Feb 22, 2024

GPU memory is all about bandwidth, not latency. DDR5 can do 4-8 GT/s x 64-bit bus per DIMM, so maxing 128 GB/s with a dual memory controller, 512 GB/s with 8x memory controllers on server chips, but GDDR6 can run at twice the frequency and has a memory bus ~5x as wide in the 4090, so you get an order of magnitude bump in throughput, so nearly 1 TB/s on a consumer product. Datacenter GPUs (e.g. A100) with HBM2e doubles that to 2 TB/s

iosjunkie · on Feb 22, 2024

I dream of AMD or Intel creating cards to do just that

3abiton · on Feb 23, 2024

Tesla P40

p1esk · on Feb 22, 2024

H200 has 141GB, B100 (out next month) will probably have even more. How much memory do you need?

holoduke · on Feb 22, 2024

We need 128gb with a 4070 chip for about 2000 dollars. Thats what we want.

duffyjp · on Feb 22, 2024

I've never tried it, but in Windows you can have CUDA apps fall back to system ram when GPU vram is exhausted. You could slap 128gb in your rig with a 4070. I'm sure performance falls off a cliff, but if it's the difference between possible and impossible that might be acceptable.

https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...

ttul · on Feb 22, 2024

Nvidia will not build that any time soon. RAM is the dividing line between charging $40,000 vs $2500…

qwertox · on Feb 22, 2024

Please give me some DIMM slots on the GPU so that I can choose my own memory like I'm used to from the CPU-world and which I can re-use when I upgrade my GPU.

int_19h · on Feb 23, 2024

An M1 Mac Studio with that much RAM can be had for around $3K if you look for good deals, and will give you ~8 tok/s on a 70B model, or ~5 tok/s for a 120B one.

ta_1138 · on Feb 22, 2024

Unfortunately production capacity for that is limited, and with sufficient demand, all pricing is an auction. Therefore, we aren't going to be seeing that card in years

FeepingCreature · on Feb 22, 2024

Yes please.

netdur · on Feb 22, 2024

> - Need moar GPUs..

Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?

memossy · on Feb 22, 2024

We have highly efficient models for inference and a quantization team.

Need moar GPUs to do a video version of this model similar to Sora now they have proved that Diffusion Transformers can scale with latent patches (see stablevideo.com and our work on that model, currently best open video model).

We have 1/100th of the resources of OpenAI and 1/1000th of Google etc.

So we focus on great algorithms and community.

But now we need those GPUs.

sylware · on Feb 22, 2024

Don't fall for it: OpenAI is microsoft. They have as much as google, if not more.

Jensson · on Feb 22, 2024

Google got cheap TPU chips, means they circumvent the extremely expensive Nvidia corporate licenses. I can easily see them having 10x the resources of OpenAI for this.

pavon · on Feb 22, 2024

Yes, they have deep pockets and could increase investment if needed. But the actual resources devoted today are public, and in line with the parent said.

px43 · on Feb 22, 2024

To be clear here, you think that Microsoft has more AI compute than Google?

SV_BubbleTime · on Feb 22, 2024

This isn’t OpenAI that make GPTx.

It’s StabilityAI that makes Stable Diffusion X.

Solvency · on Feb 22, 2024

can someone explain why nVidia doesn't just hold their own AI? And literally devote 50% of their production to their own compute center? In an age where even ancient companies like Cisco are getting in the AI race, why wouldn't the people with the keys to the kingdom get involved?

declaredapple · on Feb 22, 2024

They've been very happy selling shovels at a steep margin to literally endless customers.

The reason is because they instantly get a risk free guaranteed VERY healthy margin on every card they sell, and there's endless customers lined up for them.

If they kept the cards, they give up the opportunity to make those margins, and instead take the risk that they'll develop a money generating service (that makes more money then selling the cards).

This way there's no risk of: A competitor out competing them, not successfully developing a profitable product, "the ai bubble popping", stagnating development, etc.

There's also the advantage that this capital has allowed them to buy up most of TSMC's production capacity, which limits the competitors like Google's TPUs.

blihp · on Feb 22, 2024

Because history has shown that the money is in selling the picks and shovels, not operating the mine. (At least for now. There very well may come a point later on when operating the mine makes more sense, but not until it's clear where the most profitable spot will be)

mr_toad · on Feb 23, 2024

Don’t stretch that analogy too far. It was applicable to gold rushes, which were low hanging fruit where any idiot could dig a hole and find gold.

Historically, once the easy to find gold was all gone it was the people who owned the deep gold mines and had the capital to exploit them who became wealthy.

chompychop · on Feb 22, 2024

"The people that made the most money in the gold rush were selling shovels, not digging gold".

downWidOutaFite · on Feb 22, 2024

1. the real keys to the kingdom are held by TSMC whose fab capacity rules the advanced chips we all get, from NVIDIA to Apple to AMD to even Intel these days.

2. the old advice is to sell shovels during a gold rush

swamp40 · on Feb 22, 2024

Jensen was just talking about a new kind of data center: AI-generation factories.

AnthonyMouse · on Feb 22, 2024

> Why is there not a greater focus on quantization to optimize model performance, given the evident need for more GPU resources?

There is an inherent trade off between model size and quality. Quantization reduces model size at the expense of quality. Sometimes it's a better way to do that than reducing the number of parameters, but it's still fundamentally the same trade off. You can't make the highest quality model use the smallest amount of memory. It's information theory, not sorcery.

netdur · on Feb 23, 2024

Yes Quantization compresses float32 values to int8 by mapping the large range of floats to a smaller integer range using a scale factor. This scale factor is key for converting back to floats (dequantization), aiming to preserve as much information as possible within the int8 limits. While quantization reduces model size and speeds up computation, it trades off some accuracy due to the compression. It's a balance between efficiency and model quality, not a magic solution to shrink models without losing some performance.

Quantization is essential for me since a 7B model won't fit on my RTX 2060 with only 6GB of VRAM. It allows me to compress the model so it can run on my hardware.

supermatt · on Feb 22, 2024

I believe he means for training

albertzeyer · on Feb 22, 2024

I understand that Sora is very popular, so it makes sense to refer to it, but when saying it is similar to Sora, I guess it actually makes more sense to say that it uses a Diffusion Transformer (DiT) (https://arxiv.org/abs/2212.09748) like Sora. We don't really know more details on Sora, while the original DiT has all the details.

tithe · on Feb 22, 2024

Is anyone else struck by the similarities in textures between the images in the appendix of the above "Scalable Diffusion Models with Transformers" paper?

If you size the browser window right, paging with the arrow keys (so the document doesn't scroll) you'll see (eg, pages 20-21) the textures of the parrot's feathers are almost identical to the textures of bark on the tree behind the panda bear, or the forest behind the red panda is very similar to the undersea environment.

Even if I'm misunderstanding something fundamental here about this technique, I still find this interesting!

jachee · on Feb 22, 2024

Could be that they’re all generated from the same seed. And we humans are really good at spotting patterns like that.

cchance · on Feb 22, 2024

So is this "SDXL safe" or "SD2.1" safe, cause SDXL safe we can deal with, if it's 2.1 safe it's gonna end up DOA for a large part of the opensource community again

astrange · on Feb 22, 2024

SD2.1 was not "overly safe", SD2.0 was because of a training bug.

2.1 didn't have adoption because people didn't want to deal with the open replacement for CLIP. Or possibly because everyone confused 2.0 and 2.1.

raxxorraxor · on Feb 23, 2024

There was a replacement for CLIP? That is awesome. What was the issue with it?

weebull · on Feb 23, 2024

Don't know about 3.0, but Cascade has different level of safety between the full model and the light model. Full model is far more prudish, but both completely fail with some prompts.

swyx · on Feb 23, 2024

> SDXL safe we can deal with

how exactly did the community deal with it? interested to learn how to unlearn safety

samstave · on Feb 22, 2024

>>>How does it perform on 3090, 4090 or less? Are us mere mortals gonna be able to have fun with it ?

>>>Its in sizes from 800m to 8b parameters now, will be all sizes for all sorts of edge to giant GPU deployment.

--

Can you fragment responses such that if an edge device (mobile app) is prompted for [thing] it can pass tokens upstream on the prompt -- Torrenting responses effectively - and you could push actual GPU edge devices in certain climates... like dens cities whom are expected to be a Fton of GPU cycle consumption around the edge?

So you have tiered processing (speed is done locally, quality level 1 can take some edge gpu - and corporate shit can be handled in cloud...

----

Can you fragment and torrent a response?

If so, how is that request torn up and routed to appropriate resources?

BOFH me if this is a stupid question? (but its valid for how we are evolving to AI being intrinsic to our society so quickly.)

swyx · on Feb 23, 2024

> Dall-e often doesn’t even understand negation, let alone complex spatial relations in combination with color assignments to objects.

can someone explain how negation is currently done in stable diffusion? and why cant we do it in text LLMs?

scottmf · on Feb 23, 2024

you can use negative logit bias

sandworm101 · on Feb 22, 2024

>> all sorts of edge to giant GPU deployment.

Soon the GPU and its associated memory will be on different cards, as once happened with CPUs. The day of the GPU with ram slots is fast approaching. We will soon plug terabytes of ram into our 4090s, then plug a half-dozen 4090s into a raspberry PI to create a Cronenberg rendering monster. Can it generate movies faster than Pixar can write them? Sure. Can it play Factorio? Heck no.

jsheard · on Feb 22, 2024

Any seperation of a GPU from its VRAM is going to come at the expense of (a lot of) bandwidth. VRAM is only as fast as it is because the memory chips are as close as possible to the GPU, either on seperate packages immediately next to the GPU package or integrated onto the same package as the GPU itself in the fanciest stuff.

If you don't care about bandwidth you can already have a GPU access terabytes of memory across the PCIe bus, but it's too slow to be useful for basically anything. Best case you're getting 64GB/sec over PCIe 5.0 x16, when VRAM is reaching 3.3TB/sec on the highest end hardware and even mid-range consumer cards are doing >500GB/sec.

Things are headed the other way if anything, Apple and Intel are integrating RAM onto the CPU package for better performance than is possible with socketed RAM.

sandworm101 · on Feb 22, 2024

That depends on whether performance or capacity is the goal. Smaller amounts of ram closer to the processing unit makes for faster computation, but AI also presents a capacity issue. If the workload needs the space, having a boatload of less-fast ram is still preferable to offloading data to something more stable like flash. That is where bulk memory modules connected though slots may one day appear on GPUs.

duffyjp · on Feb 22, 2024

I'm having flashbacks to owning a Matrox Millenium as a kid. I never did get that 4MB vram upgrade.

https://www.512bit.net/matrox/matrox_millenium.html

mysterydip · on Feb 22, 2024

Is there a way to partition the data so that a given GPU had access to all the data it needs but the job itself was parallelized over multiple GPUs?

Thinking on the classic neural network for example, each column of nodes would only need to talk to the next column. You could group several columns per GPU and then each would process its own set of nodes. While an individual job would be slower, you could run multiple tasks in parallel, processing new inputs after each set of nodes is finished.

zettabomb · on Feb 22, 2024

Of course, this is common with LLMs which are too large to fit in any single GPU. I believe Deepspeed implements what you're referring to.

weebull · on Feb 23, 2024

No it won't. GPUs are good at ml partly because of the huge memory bandwidth. 1000s of bits wide. You won't find connectors that have that many terminals and maintain signal quality. Even putting a second bank soldered on the same signals can be enough to mess things up.

zettabomb · on Feb 22, 2024

I doubt it. The latest GPUs utilize HBM which is necessarily part of the same package as the main die. If you had a RAM slot for a GPU you might as well just go out to system RAM, way too much latency to be useful.

AnthonyMouse · on Feb 22, 2024

It isn't the latency which is the problem, it's the bandwidth. A memory socket with that much bandwidth would need a lot of pins. In principle you could just have more memory slots where each slot has its own channel. 16 channels of DDR5-8000 would have more bandwidth than the RTX 4090. But an ordinary desktop board with 16 memory channels is probably not happening. You could plausibly see that on servers however.

What's more likely is hybrid systems. Your basic desktop CPU gets e.g. 8GB of HBM, but then also has 16GB of DRAM in slots. Another CPU/APU model that fits into the same socket has 32GB of HBM (and so costs more), which you could then combine with 128GB of DRAM. Or none, by leaving the slots empty, if you want entirely HBM. A server or HEDT CPU might have 256GB of HBM and support 4TB of DRAM.

brookst · on Feb 22, 2024

Agree, this is likely future. It’s really just an extension of The existing tiered CPU cache model

ltbarcly3 · on Feb 22, 2024

I don’t think you really understand the current trends in computer architecture. Even cpus are being moved to have on package ram for higher bandwidth. Everything is the opposite of what you said.

sandworm101 · on Feb 22, 2024

Higher bandwidth but lower capacity. The real trend is different physical architectures for different compute loads. There is a place in AI for bulk albeit slower memory such as extremely large date sets that want to run internally on a discreet card without involving pci lanes.

ltbarcly3 · on Feb 24, 2024

This is also not true. You can transfer from main memory to cards plenty fast enough that it is not a bottleneck. Consumer GPU's don't even use pcie5 yet, which doubles the bandwidth of 4. Professional datacenter cards don't use pcie AT ALL, but they do put a huge amount of RAM on the package with the GPUs.